Introducing the {pipeflow} package

pipeflow pipeline tools reproducible research

Efficiently managing complex data analysis workflows can be a challenge. In standard R programming, chaining functions, tracking intermediate results, and maintaining dependencies between steps often lead to cluttered code that is difficult to scale or modify. Enter {pipeflow} — a beginner-friendly R package designed to simplify and streamline data analysis pipelines by making them modular, intuitive, and adaptable.


In this post, we’ll contrast the traditional approach with {pipeflow}, showcasing how it empowers users to build robust workflows while reducing complexity. Let’s dive in!

The Problem: Standard R Workflow

Consider an analysis of R’s airquality data set,

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

where we want to:

Here’s how the workflow might look using standard R:

library(ggplot2)

# Step 1: Prepare the data
airquality$Temp.Celsius <- (airquality$Temp - 32) * 5 / 9

# Step 2: Fit a linear model
model <- lm(Ozone ~ Temp.Celsius, data = airquality)

# Step 3: Generate the plot
coeffs <- coefficients(model)
ggplot(airquality) +
  geom_point(aes(Temp.Celsius, Ozone)) +
  geom_abline(intercept = coeffs[1], slope = coeffs[2]) +
  labs(title = "Linear model fit")

While functional, this approach has clear drawbacks:

The {pipeflow} Solution: Modular and Manageable

{pipeflow} addresses these challenges by organizing workflows into modular, dependency-aware steps. Let’s rewrite the same workflow using {pipeflow}.

Step 1: Initialize the Pipeline

First, we create a pipeline named “my-pipeline” and load the airquality dataset as input:

library(pipeflow)

pip <- Pipeline$new("my-pipeline", data = airquality)

Step 2: Add a Data Preparation Step

Next, we add a step to calculate Temp.Celsius:

pip$add(
  "data_prep",
  function(data = ~data) {
    replace(data, "Temp.Celsius", (data[, "Temp"] - 32) * 5 / 9)
  }
)

Step 3: Fit a Linear Model

We add another step to fit a linear model using the transformed data:

pip$add(
  "model_fit",
  function(data = ~data_prep, xVar = "Temp.Celsius") {
    lm(paste("Ozone ~", xVar), data = data)
  }
)

Step 4: Visualize the Model Fit

Finally, we create a visualization step that uses the outputs from model_fit and data_prep:

pip$add(
  "model_plot",
  function(
    model = ~model_fit,
    data = ~data_prep,
    xVar = "Temp.Celsius",
    title = "Linear model fit"
  ) {
    coeffs <- coefficients(model)
    ggplot(data) +
      geom_point(aes(.data[[xVar]], .data[["Ozone"]])) +
      geom_abline(intercept = coeffs[1], slope = coeffs[2]) +
      labs(title = title)
  }
)

Now, we can run the pipeline and inspect the model plot:

pip$run()
INFO  [2024-12-23 11:36:10.291] Start run of 'my-pipeline' pipeline:
INFO  [2024-12-23 11:36:10.321] Step 1/4 data
INFO  [2024-12-23 11:36:10.332] Step 2/4 data_prep
INFO  [2024-12-23 11:36:10.357] Step 3/4 model_fit
INFO  [2024-12-23 11:36:10.361] Step 4/4 model_plot
INFO  [2024-12-23 11:36:10.368] Finished execution of steps.
INFO  [2024-12-23 11:36:10.368] Done.
pip$get_out("model_plot")

model-plot

Why {pipeflow}?

Here are the key advantages of using {pipeflow} over the standard approach:

Visualizing the Pipeline

With {pipeflow}, you can easily visualize your pipeline using the visNetwork package to produce a diagram showing the flow from data to data_prep, model_fit, and model_plot, making the workflow immediately understandable.

library(visNetwork)
do.call(visNetwork, args = pip$get_graph()) |>
    visHierarchicalLayout(direction = "LR")

Ensuring Integrity

{pipeflow} also verifies pipeline integrity at definition time. For example, trying to reference a non-existent step triggers an error:

pip$add(
  "invalid_step",
  function(data = ~non_existent) {
    data
  }
)
Error: step 'invalid_step': dependency 'non_existent' not found

This proactive error-checking ensures that pipelines remain robust and free from misconfigurations.

Dynamic Updates

One of {pipeflow}’s standout features is its ability to dynamically update parameters and rerun only the affected steps. For example:

# Change the predictor variable
pip$set_params(list(xVar = "Solar.R"))
pip$run()
INFO  [2024-12-23 11:36:10.782] Start run of 'my-pipeline' pipeline:
INFO  [2024-12-23 11:36:10.783] Step 1/4 data - skip 'done' step
INFO  [2024-12-23 11:36:10.784] Step 2/4 data_prep - skip 'done' step
INFO  [2024-12-23 11:36:10.785] Step 3/4 model_fit
INFO  [2024-12-23 11:36:10.789] Step 4/4 model_plot
INFO  [2024-12-23 11:36:10.798] Finished execution of steps.
INFO  [2024-12-23 11:36:10.799] Done.

Only the steps depending on xVar (i.e., model_fit and model_plot) are rerun.

# Update input data using only the first 10 rows
pip$set_data(airquality[1:10, ])
pip$run()
INFO  [2024-12-23 11:36:10.828] Start run of 'my-pipeline' pipeline:
INFO  [2024-12-23 11:36:10.829] Step 1/4 data
INFO  [2024-12-23 11:36:10.831] Step 2/4 data_prep
INFO  [2024-12-23 11:36:10.837] Step 3/4 model_fit
INFO  [2024-12-23 11:36:10.840] Step 4/4 model_plot
INFO  [2024-12-23 11:36:10.846] Finished execution of steps.
INFO  [2024-12-23 11:36:10.847] Done.

The entire pipeline is rerun, as all steps depend on the input data.

# Update the plot title
pip$set_params(list(title = "Updated Plot Title"))
pip$run()
INFO  [2024-12-23 11:36:10.878] Start run of 'my-pipeline' pipeline:
INFO  [2024-12-23 11:36:10.879] Step 1/4 data - skip 'done' step
INFO  [2024-12-23 11:36:10.881] Step 2/4 data_prep - skip 'done' step
INFO  [2024-12-23 11:36:10.882] Step 3/4 model_fit - skip 'done' step
INFO  [2024-12-23 11:36:10.884] Step 4/4 model_plot
INFO  [2024-12-23 11:36:10.889] Finished execution of steps.
INFO  [2024-12-23 11:36:10.890] Done.

Only the model_plot step is rerun. Let’s inspect the final result with the updated x-axis variable and plot title:

pip$get_out("model_plot")

updated-plot

{pipeflow} vs. targets

The R ecosystem includes powerful tools like targets, designed for advanced, reproducible workflows. However, targets may involve additional setup and a steeper learning curve while {pipeflow} emphasizes simplicity:

While targets excels in highly complex workflows, {pipeflow} offers a versatile solution that’s both beginner-friendly but still capable of supporting demanding tasks.

Conclusion

{pipeflow} transforms the way you build and manage data analysis workflows in R. By automating dependency tracking, ensuring pipeline integrity, and enabling dynamic updates, it reduces complexity and enhances productivity. Whether you’re working on a simple analysis or a large-scale project, {pipeflow} helps you focus on insights rather than infrastructure.

Ready to give {pipeflow} a try? Explore the documentation to learn more and start building smarter pipelines today!

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".