Using ReciPies with Pandas and Polars¶
This notebook demonstrates how ReciPies works with both Pandas and Polars backends, and how to convert between them during a preprocessing flow.
We'll:
- Build a tiny synthetic dataset
- Run the same
Recipewith Pandas and with Polars - Convert between backends while keeping roles and steps consistent
In [1]:
Copied!
import numpy as np
import pandas as pd
import polars as pl
from datetime import datetime, timedelta
from IPython.display import display
from recipies import Ingredients, Recipe
from recipies.selector import all_predictors
from recipies.step import StepImputeFill, StepHistorical, Accumulator
from recipies.constants import Backend
rng = np.random.default_rng(42)
# Build a tiny panel/time series dataset
n = 8
ids = np.array([1] * (n // 2) + [2] * (n // 2))
base_time = datetime(2020, 1, 1, 0, 0, 0)
times = np.array([base_time + timedelta(hours=i) for i in range(n)])
x1 = rng.normal(10, 2, size=n)
x2 = rng.integers(0, 2, size=n)
y = rng.normal(0, 1, size=n)
# Inject some missing values
x1[[1, 5]] = np.nan
pdf = pd.DataFrame(
{
"id": ids,
"time": times,
"x1": x1,
"x2": x2,
"y": y,
}
)
pldf = pl.from_pandas(pdf)
import numpy as np
import pandas as pd
import polars as pl
from datetime import datetime, timedelta
from IPython.display import display
from recipies import Ingredients, Recipe
from recipies.selector import all_predictors
from recipies.step import StepImputeFill, StepHistorical, Accumulator
from recipies.constants import Backend
rng = np.random.default_rng(42)
# Build a tiny panel/time series dataset
n = 8
ids = np.array([1] * (n // 2) + [2] * (n // 2))
base_time = datetime(2020, 1, 1, 0, 0, 0)
times = np.array([base_time + timedelta(hours=i) for i in range(n)])
x1 = rng.normal(10, 2, size=n)
x2 = rng.integers(0, 2, size=n)
y = rng.normal(0, 1, size=n)
# Inject some missing values
x1[[1, 5]] = np.nan
pdf = pd.DataFrame(
{
"id": ids,
"time": times,
"x1": x1,
"x2": x2,
"y": y,
}
)
pldf = pl.from_pandas(pdf)
Use Pandas backend¶
You can pass a Pandas DataFrame directly to Recipe. Roles are assigned via constructor arguments.
We impute missing values and compute a rolling historical mean as a simple example.
In [2]:
Copied!
rec_pd = Recipe(
pdf,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_pd.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_pd.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_pd = rec_pd.prep()
train_pd.head()
rec_pd = Recipe(
pdf,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_pd.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_pd.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_pd = rec_pd.prep()
train_pd.head()
Out[2]:
| id | time | y | x1 | x2 | x1mean_hist | x2mean_hist | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2020-01-01 00:00:00 | 0.066031 | 10.609434 | 1 | 10.609434 | 1.000000 |
| 1 | 1 | 2020-01-01 01:00:00 | 1.127241 | 10.609434 | 0 | 10.609434 | 0.500000 |
| 2 | 1 | 2020-01-01 02:00:00 | 0.467509 | 11.500902 | 1 | 10.906590 | 0.666667 |
| 3 | 1 | 2020-01-01 03:00:00 | -0.859292 | 11.881129 | 0 | 11.150225 | 0.500000 |
| 4 | 2 | 2020-01-01 04:00:00 | 0.368751 | 6.097930 | 1 | 6.097930 | 1.000000 |
Switch to Polars¶
There are two common ways:
- Start with a Polars
DataFrameand create yourRecipenormally. - Convert a Pandas
DataFrameto Polars on-the-fly by constructingIngredientswithbackend=Backend.POLARS.
Both preserve roles and steps.
In [3]:
Copied!
# Option 1: Start directly with Polars DataFrame
rec_pl = Recipe(
pldf,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_pl.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_pl.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_pl = rec_pl.prep()
train_pl.head()
# Option 1: Start directly with Polars DataFrame
rec_pl = Recipe(
pldf,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_pl.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_pl.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_pl = rec_pl.prep()
train_pl.head()
Out[3]:
shape: (5, 7)
| id | time | x1 | x2 | y | x1mean_hist | x2mean_hist |
|---|---|---|---|---|---|---|
| i64 | datetime[ns] | f64 | i64 | f64 | f64 | f64 |
| 1 | 2020-01-01 00:00:00 | 10.609434 | 1 | 0.066031 | 10.609434 | 1.0 |
| 1 | 2020-01-01 01:00:00 | 10.609434 | 0 | 1.127241 | 10.609434 | 0.5 |
| 1 | 2020-01-01 02:00:00 | 11.500902 | 1 | 0.467509 | 10.90659 | 0.666667 |
| 1 | 2020-01-01 03:00:00 | 11.881129 | 0 | -0.859292 | 11.150225 | 0.5 |
| 2 | 2020-01-01 04:00:00 | 6.09793 | 1 | 0.368751 | 6.09793 | 1.0 |
In [4]:
Copied!
# Option 2: Convert a Pandas DataFrame into Polars via Ingredients
# (This preserves roles and allows you to keep working in Polars.)
ing_pd_to_pl = Ingredients(pdf, backend=Backend.POLARS)
rec_conv = Recipe(
ing_pd_to_pl,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_conv.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_conv.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_conv = rec_conv.prep()
train_conv.head()
# Option 2: Convert a Pandas DataFrame into Polars via Ingredients
# (This preserves roles and allows you to keep working in Polars.)
ing_pd_to_pl = Ingredients(pdf, backend=Backend.POLARS)
rec_conv = Recipe(
ing_pd_to_pl,
outcomes=["y"],
predictors=["x1", "x2"],
groups=["id"],
sequences=["time"],
)
rec_conv.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec_conv.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
train_conv = rec_conv.prep()
train_conv.head()
Out[4]:
shape: (5, 7)
| id | time | x1 | x2 | y | x1mean_hist | x2mean_hist |
|---|---|---|---|---|---|---|
| i64 | datetime[ns] | f64 | i64 | f64 | f64 | f64 |
| 1 | 2020-01-01 00:00:00 | 10.609434 | 1 | 0.066031 | 10.609434 | 1.0 |
| 1 | 2020-01-01 01:00:00 | 10.609434 | 0 | 1.127241 | 10.609434 | 0.5 |
| 1 | 2020-01-01 02:00:00 | 11.500902 | 1 | 0.467509 | 10.90659 | 0.666667 |
| 1 | 2020-01-01 03:00:00 | 11.881129 | 0 | -0.859292 | 11.150225 | 0.5 |
| 2 | 2020-01-01 04:00:00 | 6.09793 | 1 | 0.368751 | 6.09793 | 1.0 |
Converting outputs explicitly¶
If you need to explicitly convert between backends while keeping the data and roles, work with Ingredients and to_df:
Ingredients.to_df(output_format=Backend.PANDAS)Ingredients.to_df(output_format=Backend.POLARS)
In [5]:
Copied!
# Start with an Ingredients object (Pandas), then convert to Polars and back
ing_pd = Ingredients(pdf, backend=Backend.PANDAS)
# Convert to Polars DataFrame first
pl_df_from_pd = ing_pd.to_df(output_format=Backend.POLARS)
# Create a new Ingredients from the converted Polars DataFrame
ing_pl = Ingredients(pl_df_from_pd, backend=Backend.POLARS)
# Convert back to Pandas
pd_df_from_pl = ing_pl.to_df(output_format=Backend.PANDAS)
display(pl_df_from_pd.head())
print(type(pl_df_from_pd))
print(type(pd_df_from_pl))
# Start with an Ingredients object (Pandas), then convert to Polars and back
ing_pd = Ingredients(pdf, backend=Backend.PANDAS)
# Convert to Polars DataFrame first
pl_df_from_pd = ing_pd.to_df(output_format=Backend.POLARS)
# Create a new Ingredients from the converted Polars DataFrame
ing_pl = Ingredients(pl_df_from_pd, backend=Backend.POLARS)
# Convert back to Pandas
pd_df_from_pl = ing_pl.to_df(output_format=Backend.PANDAS)
display(pl_df_from_pd.head())
print(type(pl_df_from_pd))
print(type(pd_df_from_pl))
shape: (5, 5)
| id | time | x1 | x2 | y |
|---|---|---|---|---|
| i64 | datetime[ns] | f64 | i64 | f64 |
| 1 | 2020-01-01 00:00:00 | 10.609434 | 1 | 0.066031 |
| 1 | 2020-01-01 01:00:00 | null | 0 | 1.127241 |
| 1 | 2020-01-01 02:00:00 | 11.500902 | 1 | 0.467509 |
| 1 | 2020-01-01 03:00:00 | 11.881129 | 0 | -0.859292 |
| 2 | 2020-01-01 04:00:00 | 6.09793 | 1 | 0.368751 |
<class 'polars.dataframe.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>