Benchmarking¶

This page provides an overview of the benchmarking process implemented in the perform_benchmark.py script. The benchmarking evaluates the performance of various preprocessing steps in the ReciPies library using different data sizes and backends (Polars and Pandas).

Overview¶

The benchmarking script measures the execution time and memory usage of preprocessing steps applied to synthetic ICU data. The results are aggregated and saved as a CSV file for further analysis.

How It Works¶

1. Synthetic Data Generation¶

The script uses the generate_icu_data function to create synthetic ICU data with configurable:

Data sizes: Number of rows in the dataset.
Missingness thresholds: Proportion of missing values in the dataset.
Random seeds: For reproducibility.

2. Backends¶

The benchmarking supports two backends:

Polars: A high-performance DataFrame library.
Pandas: The standard Python DataFrame library.

The script converts the synthetic data to the appropriate backend format before running the benchmarks.

3. Preprocessing Steps¶

The following preprocessing steps are benchmarked:

Imputation:
- StepImputeFill: Forward-fill and zero-fill strategies.
- StepSklearn: Using MissingIndicator from sklearn.
Scaling:
- StepSklearn: Using scalers like StandardScaler, MinMaxScaler, etc.
Discretization:
- StepSklearn: Using KBinsDiscretizer.
Historical Accumulation:
- StepHistorical: Using accumulators like MIN, MAX, MEAN, and COUNT.

4. Metrics¶

For each preprocessing step, the script measures:

Execution Time: The time taken to apply the step.
Memory Usage: The peak memory usage during the step.

Running the Benchmark¶

Command¶

Run the script using the following command:

python perform_benchmark.py --data_sizes 1000 10000 --seeds 42 41

Arguments¶

--data_sizes: A list of data sizes to benchmark (e.g., 1000, 10000).
--seeds: A list of random seeds for reproducibility.

Example¶

python perform_benchmark.py --data_sizes 1000 10000 100000 --seeds 42 43

Results¶

Output¶

The script generates a CSV file with the benchmarking results. The filename includes the data sizes, seeds, and a timestamp. For example:

results_datasizes_[1000, 10000]_seeds_[42, 43]_datetime_2025-11-19_12-00-00.csv

Aggregated Metrics¶

The results include:

Mean Execution Time (duration_mean)
Standard Deviation of Execution Time (duration_std)
Mean Memory Usage (memory_mean)
Standard Deviation of Memory Usage (memory_std)
Speed Difference: Difference in execution time between Pandas and Polars.
Speedup: Ratio of Pandas execution time to Polars execution time.

Example Workflow¶

1. Dynamic Recipe Benchmark¶

The benchmark_dynamic_recipe function benchmarks a dynamic recipe with multiple preprocessing steps applied sequentially.

2. Step-Specific Benchmark¶

The benchmark_step function benchmarks a single preprocessing step.

3. Backend Comparison¶

The benchmark_backend function compares the performance of Polars and Pandas for the same preprocessing steps.

Example Results¶

Sample Output¶

Data Size	Step	Backend	Duration Mean (ms)	Memory Mean (MB)	Speedup
1000	StepImputeFill	Pandas	50.0	10.0	1.5
1000	StepImputeFill	Polars	33.3	8.0	1.5
10000	StepHistoricalMean	Pandas	500.0	50.0	2.0
10000	StepHistoricalMean	Polars	250.0	30.0	2.0

Customization¶

Adding New Steps¶

To benchmark additional steps:

Define the step in the steps_complete or steps_missing list.
Ensure the step is compatible with both Pandas and Polars.

Modifying Metrics¶

To add or modify metrics:

Update the run_step_benchmark function to include the new metric.
Update the aggregation logic in the benchmark_backend function.

Notes¶

Dependencies: Ensure all required libraries (e.g., polars, pandas, sklearn, tqdm, memory_profiler) are installed.
Performance: Polars is generally faster and more memory-efficient than Pandas for large datasets.
Reproducibility: Use the --seeds argument to ensure consistent results across runs.

Next Steps¶

Analyze the results using the aggregate_results.ipynb notebook.
Use the benchmarking results to optimize preprocessing pipelines in the ReciPies library.