ReciPies¶
ReciPies is a Python package for feature engineering and data preprocessing with a focus on medical and clinical data. It provides a unified interface for working with both Polars and Pandas DataFrames while maintaining column role information throughout data transformations.
Summary¶
Declarative, reproducible data preprocessing
Human-readable and transparent pipelines
No trade-off between readability, performance, or flexibility
Backend flexibility: works with Polars and Pandas
Reduces cognitive overhead in feature engineering
Installation¶
Install ReciPies using pip:
pip install recipies
For development installation:
git clone https://github.com/rvandewater/ReciPies.git
cd ReciPies
pip install -e '.[dev]'
Quick Start¶
Here’s a simple example of using ReciPies:
# Import necessary libraries
import polars as pl
import numpy as np
from datetime import datetime, MINYEAR
from recipies import Ingredients, Recipe
from recipies.selector import all_numeric_predictors, all_predictors
from recipies.step import StepSklearn, StepHistorical, Accumulator, StepImputeFill
from sklearn.impute import MissingIndicator
# Set up random state for reproducible results
rand_state = np.random.RandomState(42)
# Create time columns for two different groups
timecolumn = pl.concat([
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True)
])
# Create sample DataFrame
df = pl.DataFrame({
"id": [1] * 6 + [2] * 4,
"time": timecolumn,
"y": rand_state.normal(size=(10,)),
"x1": rand_state.normal(loc=10, scale=5, size=(10,)),
"x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
"x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
"x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
})
# Introduce some missing values
df = df.with_columns(
pl.when(pl.int_range(pl.len()).is_in([1, 2, 4, 7]))
.then(None)
.otherwise(pl.col("x1"))
.alias("x1")
)
df2 = df.clone()
# Create Ingredients and Recipe
ing = Ingredients(df)
rec = Recipe(
ing,
outcomes=["y"],
predictors=["x1", "x2", "x3", "x4"],
groups=["id"],
sequences=["time"]
)
rec.add_step(StepSklearn(MissingIndicator(features="all"), sel=all_predictors()))
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
# Apply the recipe to the ingredients
df = rec.prep()
# Apply the recipe to a new DataFrame (e.g., test set)
df2 = rec.bake(df2)
Core Concepts¶
- Ingredients
A wrapper around DataFrames that maintains column role information, ensuring data semantics are preserved during transformations.
- Recipe
A collection of processing steps that can be applied to Ingredients objects to create reproducible data pipelines.
- Step
Individual data transformation operations that understand column roles and can work with both Polars and Pandas backends.
- Selector
Utilities for selecting columns based on their roles or other criteria.
Backend Support¶
ReciPies supports both Polars and Pandas backends:
Polars: High-performance DataFrame library with lazy evaluation
Pandas: Traditional DataFrame library with extensive ecosystem support
The package automatically detects the backend and provides a consistent API regardless of the underlying DataFrame implementation.
Examples¶
Check out the examples/ directory for Jupyter notebooks demonstrating various use cases of ReciPies. Check out the benchmarks/ directory for performance comparisons between Polars and Pandas backends.
Contributing¶
Contributions are welcome! Please see our contributing guidelines and open an issue or submit a pull request on the GitHub repository.
License¶
This project is licensed under the MIT License. See the LICENSE file for details.
Contents:
- API Reference
- recipies.ingredients.Ingredients
- recipies.recipe.Recipe
- recipies.step.Step
- recipies.step.StepImputeFill
- recipies.step.StepImputeFastZeroFill
- recipies.step.StepImputeFastForwardFill
- recipies.step.StepImputeModel
- recipies.step.Accumulator
- recipies.step.StepHistorical
- recipies.step.StepSklearn
- recipies.step.StepResampling
- recipies.step.StepScale
- recipies.step.StepFunction
- recipies.selector.Selector
- recipies.selector.enlist_dt
- recipies.selector.enlist_str
- recipies.selector.intersection
- recipies.selector.all_of
- recipies.selector.regex_names
- recipies.selector.starts_with
- recipies.selector.ends_with
- recipies.selector.contains
- recipies.selector.has_role
- recipies.selector.has_type
- recipies.selector.all_predictors
- recipies.selector.all_numeric_predictors
- recipies.selector.all_outcomes
- recipies.selector.all_groups
- recipies.selector.select_groups
- recipies.selector.all_sequences
- recipies.selector.select_sequence