ReciPies Basics¶
This notebook contains the basic concepts for ReciPies. First, we simply install the package and import the necessary modules. It is often useful to first create a virtual environment for your project.
# Instal ReciPies within notebook:
# !pip install recipies
import numpy as np
import polars as pl
from recipies import Recipe
from recipies.ingredients import Ingredients
from datetime import datetime, MINYEAR
from IPython.display import display
Creating our data as Polars DataFrame¶
We will create a simple dataset to demonstrate the functionality of ReciPies. We have different datatypes, and a temporal aspect to our data. We also add some missing values to our data as this common.
rand_state = np.random.RandomState(42)
timecolumn = pl.concat(
[
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True),
]
)
df = pl.DataFrame(
{
"id": [1] * 6 + [2] * 4,
"time": timecolumn,
"y": rand_state.normal(size=(10,)),
"x1": rand_state.normal(loc=10, scale=5, size=(10,)),
"x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
"x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
"x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
}
)
df[[1, 2, 4, 7], "x1"] = None
df
| id | time | y | x1 | x2 | x3 | x4 |
|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | f64 | i64 | cat | cat |
| 1 | 0001-01-01 00:00:00 | 0.496714 | 7.682912 | 0 | "a" | "x" |
| 1 | 0001-01-01 01:00:00 | -0.138264 | null | 1 | "b" | "y" |
| 1 | 0001-01-01 02:00:00 | 0.647689 | null | 0 | "c" | "y" |
| 1 | 0001-01-01 03:00:00 | 1.52303 | 0.433599 | 0 | "a" | "x" |
| 1 | 0001-01-01 04:00:00 | -0.234153 | null | 0 | "c" | "y" |
| 1 | 0001-01-01 05:00:00 | -0.234137 | 7.188562 | 0 | "b" | "y" |
| 2 | 0001-01-01 00:00:00 | 1.579213 | 4.935844 | 0 | "c" | "x" |
| 2 | 0001-01-01 01:00:00 | 0.767435 | null | 0 | "a" | "x" |
| 2 | 0001-01-01 02:00:00 | -0.469474 | 5.45988 | 0 | "b" | "y" |
| 2 | 0001-01-01 03:00:00 | 0.54256 | 2.938481 | 1 | "c" | "x" |
Creating Ingredients¶
To get started, we need to create an ingredients object. This object will be used to create a recipe.
ing = Ingredients(df)
This ingredients object should contain the roles of the columns. The roles are used to determine how we can process the data. For example, the column "y" can be defined as an outcome column, which we can use later to define what we want to do with this type of columns:
roles = {"y": ["outcome"]}
ing = Ingredients(df, copy=False, roles=roles)
Creating a recipe¶
We can also directly create a recipy and specify the roles as arguments to the instantion. A recipy always needs to have an ingredients object and optionally also the target column, the feature columns, the group columns and the sequential or time column.
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])
You can use the output to check that your configurations are correct.
display(rec)
Recipe
Inputs:
role amount of variables
0 outcome 1
1 predictor 4
2 group 1
3 sequence 1
Operations:
As expected, Recipies reports 1 outcome variable (y), 4 predictor variables (x1, x2, x3, x4), 1 grouping variable (id), and sequence variable (time).
We see that the operations are not yet defined. We have to add steps to our recipe to define what we want to do with the data. But, first, we want to be able to select which columns we want to prepare in our recipe.
Selectors¶
Selectors are used to select columns based on their roles. For example, we can select all outcome columns, or all predictor columns. We can also combine selectors to select multiple roles at once. Here we select all numeric predictor columns:
from recipies.selector import all_numeric_predictors
all_numeric_predictors()
all numeric predictors
Adding steps¶
Let's preprocess our data! First: we know that there is some missing data in our predictors. We can easily add a step to fill in the missing values with the mean of the column.
from recipies.selector import all_numeric_predictors
from recipies.step import StepImputeFill
rec = rec.add_step(StepImputeFill(sel=all_numeric_predictors(), strategy="mean"))
print(rec)
Recipe
Inputs:
role amount of variables
0 outcome 1
1 predictor 4
2 group 1
3 sequence 1
Operations:
Impute with mean for all numeric predictors
We see that the Operations now contains one step: Impute missing values with mean for all numeric predictor columns.
Prepping the recipe¶
Let's prep the recipe. This will "train" the steps we added to the recipe to the data in order. The result will be a recipe object that is ready to bake any data that has the same schema as the data we used to prep the recipe. This is useful for example when we want to apply the same preprocessing steps to a test set or new data.
rec.prep(df)
display(rec)
Recipe
Inputs:
role amount of variables
0 outcome 1
1 predictor 4
2 group 1
3 sequence 1
Operations:
Impute with mean for ['x1', 'x2'] [trained]
We now see [trained], which indicates that all necessary statistics for this step have already been estimated from some training data and can be applied to ("baked") test datasets.
Baking the recipe¶
We now bake the recipe. This will apply re-computed (frozen) transformations we specified in the recipe without refitting to the data in order. This can be done to any DataFrame with the same variables since we have prepped the data before. The result will be a new DataFrame with the preprocessed data.
baked_df = rec.bake(data=df)
display(baked_df)
| id | time | y | x1 | x2 | x3 | x4 |
|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | f64 | i64 | cat | cat |
| 1 | 0001-01-01 00:00:00 | 0.496714 | 7.682912 | 0 | "a" | "x" |
| 1 | 0001-01-01 01:00:00 | -0.138264 | 5.101691 | 1 | "b" | "y" |
| 1 | 0001-01-01 02:00:00 | 0.647689 | 5.101691 | 0 | "c" | "y" |
| 1 | 0001-01-01 03:00:00 | 1.52303 | 0.433599 | 0 | "a" | "x" |
| 1 | 0001-01-01 04:00:00 | -0.234153 | 5.101691 | 0 | "c" | "y" |
| 1 | 0001-01-01 05:00:00 | -0.234137 | 7.188562 | 0 | "b" | "y" |
| 2 | 0001-01-01 00:00:00 | 1.579213 | 4.935844 | 0 | "c" | "x" |
| 2 | 0001-01-01 01:00:00 | 0.767435 | 4.444735 | 0 | "a" | "x" |
| 2 | 0001-01-01 02:00:00 | -0.469474 | 5.45988 | 0 | "b" | "y" |
| 2 | 0001-01-01 03:00:00 | 0.54256 | 2.938481 | 1 | "c" | "x" |
Let's compare it to the original data:
display(df)
| id | time | y | x1 | x2 | x3 | x4 |
|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | f64 | i64 | cat | cat |
| 1 | 0001-01-01 00:00:00 | 0.496714 | 7.682912 | 0 | "a" | "x" |
| 1 | 0001-01-01 01:00:00 | -0.138264 | null | 1 | "b" | "y" |
| 1 | 0001-01-01 02:00:00 | 0.647689 | null | 0 | "c" | "y" |
| 1 | 0001-01-01 03:00:00 | 1.52303 | 0.433599 | 0 | "a" | "x" |
| 1 | 0001-01-01 04:00:00 | -0.234153 | null | 0 | "c" | "y" |
| 1 | 0001-01-01 05:00:00 | -0.234137 | 7.188562 | 0 | "b" | "y" |
| 2 | 0001-01-01 00:00:00 | 1.579213 | 4.935844 | 0 | "c" | "x" |
| 2 | 0001-01-01 01:00:00 | 0.767435 | null | 0 | "a" | "x" |
| 2 | 0001-01-01 02:00:00 | -0.469474 | 5.45988 | 0 | "b" | "y" |
| 2 | 0001-01-01 03:00:00 | 0.54256 | 2.938481 | 1 | "c" | "x" |
Let's try and bake the recipe with a different dataframe that has the same schema but some missing values in the "x1" column. The recipe should fill in the trained missing values with the mean of the column.:
df2 = df.clone()
df2[list(range(1, 9)), "x1"] = None
baked_df2 = rec.bake(data=df2)
display(baked_df2)
| id | time | y | x1 | x2 | x3 | x4 |
|---|---|---|---|---|---|---|
| i64 | datetime[μs] | f64 | f64 | i64 | cat | cat |
| 1 | 0001-01-01 00:00:00 | 0.496714 | 7.682912 | 0 | "a" | "x" |
| 1 | 0001-01-01 01:00:00 | -0.138264 | 7.682912 | 1 | "b" | "y" |
| 1 | 0001-01-01 02:00:00 | 0.647689 | 7.682912 | 0 | "c" | "y" |
| 1 | 0001-01-01 03:00:00 | 1.52303 | 7.682912 | 0 | "a" | "x" |
| 1 | 0001-01-01 04:00:00 | -0.234153 | 7.682912 | 0 | "c" | "y" |
| 1 | 0001-01-01 05:00:00 | -0.234137 | 7.682912 | 0 | "b" | "y" |
| 2 | 0001-01-01 00:00:00 | 1.579213 | 2.938481 | 0 | "c" | "x" |
| 2 | 0001-01-01 01:00:00 | 0.767435 | 2.938481 | 0 | "a" | "x" |
| 2 | 0001-01-01 02:00:00 | -0.469474 | 2.938481 | 0 | "b" | "y" |
| 2 | 0001-01-01 03:00:00 | 0.54256 | 2.938481 | 1 | "c" | "x" |
This is useful when we want to apply the same preprocessing steps to a test set, for example, to prevent data leakage.
This concludes the basic concepts of ReciPies! You can now create ingredients, recipes, add steps, prep and bake your data. Explore more advanced features and steps in the documentation!