Efficiently managing complex data analysis workflows can be a challenge. In standard R programming, chaining functions, tracking intermediate results, and maintaining dependencies between steps often lead to cluttered code that is difficult to scale or modify. Enter {pipeflow} — a beginner-friendly R package designed to simplify and streamline data analysis pipelines by making them modular, intuitive, and adaptable.
In this post, we’ll contrast the traditional approach with {pipeflow}, showcasing how it empowers users to build robust workflows while reducing complexity. Let’s dive in!
Consider an analysis of R’s airquality data set,
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
where we want to:
Here’s how the workflow might look using standard R:
library(ggplot2)
# Step 1: Prepare the data
airquality$Temp.Celsius <- (airquality$Temp - 32) * 5 / 9
# Step 2: Fit a linear model
model <- lm(Ozone ~ Temp.Celsius, data = airquality)
# Step 3: Generate the plot
coeffs <- coefficients(model)
ggplot(airquality) +
geom_point(aes(Temp.Celsius, Ozone)) +
geom_abline(intercept = coeffs[1], slope = coeffs[2]) +
labs(title = "Linear model fit")
While functional, this approach has clear drawbacks:
{pipeflow} addresses these challenges by organizing workflows into modular, dependency-aware steps. Let’s rewrite the same workflow using {pipeflow}.
First, we create a pipeline named “my-pipeline” and load the airquality dataset as input:
Next, we add a step to calculate Temp.Celsius:
pip$add(
"data_prep",
function(data = ~data) {
replace(data, "Temp.Celsius", (data[, "Temp"] - 32) * 5 / 9)
}
)
We add another step to fit a linear model using the transformed data:
pip$add(
"model_fit",
function(data = ~data_prep, xVar = "Temp.Celsius") {
lm(paste("Ozone ~", xVar), data = data)
}
)
Finally, we create a visualization step that uses the outputs from model_fit
and data_prep
:
pip$add(
"model_plot",
function(
model = ~model_fit,
data = ~data_prep,
xVar = "Temp.Celsius",
title = "Linear model fit"
) {
coeffs <- coefficients(model)
ggplot(data) +
geom_point(aes(.data[[xVar]], .data[["Ozone"]])) +
geom_abline(intercept = coeffs[1], slope = coeffs[2]) +
labs(title = title)
}
)
Now, we can run the pipeline and inspect the model plot:
pip$run()
INFO [2024-12-23 11:36:10.291] Start run of 'my-pipeline' pipeline:
INFO [2024-12-23 11:36:10.321] Step 1/4 data
INFO [2024-12-23 11:36:10.332] Step 2/4 data_prep
INFO [2024-12-23 11:36:10.357] Step 3/4 model_fit
INFO [2024-12-23 11:36:10.361] Step 4/4 model_plot
INFO [2024-12-23 11:36:10.368] Finished execution of steps.
INFO [2024-12-23 11:36:10.368] Done.
pip$get_out("model_plot")
Here are the key advantages of using {pipeflow} over the standard approach:
With {pipeflow}, you can easily visualize your pipeline using the visNetwork
package to produce a diagram showing the flow from data
to data_prep
,
model_fit
, and model_plot
, making the workflow immediately understandable.
library(visNetwork)
do.call(visNetwork, args = pip$get_graph()) |>
visHierarchicalLayout(direction = "LR")
{pipeflow} also verifies pipeline integrity at definition time. For example, trying to reference a non-existent step triggers an error:
pip$add(
"invalid_step",
function(data = ~non_existent) {
data
}
)
Error: step 'invalid_step': dependency 'non_existent' not found
This proactive error-checking ensures that pipelines remain robust and free from misconfigurations.
One of {pipeflow}’s standout features is its ability to dynamically update parameters and rerun only the affected steps. For example:
# Change the predictor variable
pip$set_params(list(xVar = "Solar.R"))
pip$run()
INFO [2024-12-23 11:36:10.782] Start run of 'my-pipeline' pipeline:
INFO [2024-12-23 11:36:10.783] Step 1/4 data - skip 'done' step
INFO [2024-12-23 11:36:10.784] Step 2/4 data_prep - skip 'done' step
INFO [2024-12-23 11:36:10.785] Step 3/4 model_fit
INFO [2024-12-23 11:36:10.789] Step 4/4 model_plot
INFO [2024-12-23 11:36:10.798] Finished execution of steps.
INFO [2024-12-23 11:36:10.799] Done.
Only the steps depending on xVar
(i.e., model_fit
and model_plot
)
are rerun.
# Update input data using only the first 10 rows
pip$set_data(airquality[1:10, ])
pip$run()
INFO [2024-12-23 11:36:10.828] Start run of 'my-pipeline' pipeline:
INFO [2024-12-23 11:36:10.829] Step 1/4 data
INFO [2024-12-23 11:36:10.831] Step 2/4 data_prep
INFO [2024-12-23 11:36:10.837] Step 3/4 model_fit
INFO [2024-12-23 11:36:10.840] Step 4/4 model_plot
INFO [2024-12-23 11:36:10.846] Finished execution of steps.
INFO [2024-12-23 11:36:10.847] Done.
The entire pipeline is rerun, as all steps depend on the input data.
# Update the plot title
pip$set_params(list(title = "Updated Plot Title"))
pip$run()
INFO [2024-12-23 11:36:10.878] Start run of 'my-pipeline' pipeline:
INFO [2024-12-23 11:36:10.879] Step 1/4 data - skip 'done' step
INFO [2024-12-23 11:36:10.881] Step 2/4 data_prep - skip 'done' step
INFO [2024-12-23 11:36:10.882] Step 3/4 model_fit - skip 'done' step
INFO [2024-12-23 11:36:10.884] Step 4/4 model_plot
INFO [2024-12-23 11:36:10.889] Finished execution of steps.
INFO [2024-12-23 11:36:10.890] Done.
Only the model_plot
step is rerun. Let’s inspect the final result with
the updated x-axis variable and plot title:
pip$get_out("model_plot")
The R ecosystem includes powerful tools like targets, designed for advanced, reproducible workflows. However, targets may involve additional setup and a steeper learning curve while {pipeflow} emphasizes simplicity:
While targets excels in highly complex workflows, {pipeflow} offers a versatile solution that’s both beginner-friendly but still capable of supporting demanding tasks.
{pipeflow} transforms the way you build and manage data analysis workflows in R. By automating dependency tracking, ensuring pipeline integrity, and enabling dynamic updates, it reduces complexity and enhances productivity. Whether you’re working on a simple analysis or a large-scale project, {pipeflow} helps you focus on insights rather than infrastructure.
Ready to give {pipeflow} a try? Explore the documentation to learn more and start building smarter pipelines today!
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".