Benchmarking¶
This page provides an overview of the benchmarking process implemented in the perform_benchmark.py script. The benchmarking evaluates the performance of various preprocessing steps in the ReciPies library using different data sizes and backends (Polars and Pandas).
Overview¶
The benchmarking script measures the execution time and memory usage of preprocessing steps applied to synthetic ICU data. The results are aggregated and saved as a CSV file for further analysis.
How It Works¶
1. Synthetic Data Generation¶
The script uses the generate_icu_data function to create synthetic ICU data with configurable:
- Data sizes: Number of rows in the dataset.
- Missingness thresholds: Proportion of missing values in the dataset.
- Random seeds: For reproducibility.
2. Backends¶
The benchmarking supports two backends:
- Polars: A high-performance DataFrame library.
- Pandas: The standard Python DataFrame library.
The script converts the synthetic data to the appropriate backend format before running the benchmarks.
3. Preprocessing Steps¶
The following preprocessing steps are benchmarked:
- Imputation:
StepImputeFill: Forward-fill and zero-fill strategies.StepSklearn: UsingMissingIndicatorfromsklearn.
- Scaling:
StepSklearn: Using scalers likeStandardScaler,MinMaxScaler, etc.
- Discretization:
StepSklearn: UsingKBinsDiscretizer.
- Historical Accumulation:
StepHistorical: Using accumulators likeMIN,MAX,MEAN, andCOUNT.
4. Metrics¶
For each preprocessing step, the script measures:
- Execution Time: The time taken to apply the step.
- Memory Usage: The peak memory usage during the step.
Running the Benchmark¶
Command¶
Run the script using the following command:
python perform_benchmark.py --data_sizes 1000 10000 --seeds 42 41
Arguments¶
--data_sizes: A list of data sizes to benchmark (e.g.,1000,10000).--seeds: A list of random seeds for reproducibility.
Example¶
python perform_benchmark.py --data_sizes 1000 10000 100000 --seeds 42 43
Results¶
Output¶
The script generates a CSV file with the benchmarking results. The filename includes the data sizes, seeds, and a timestamp. For example:
results_datasizes_[1000, 10000]_seeds_[42, 43]_datetime_2025-11-19_12-00-00.csv
Aggregated Metrics¶
The results include:
- Mean Execution Time (
duration_mean) - Standard Deviation of Execution Time (
duration_std) - Mean Memory Usage (
memory_mean) - Standard Deviation of Memory Usage (
memory_std) - Speed Difference: Difference in execution time between Pandas and Polars.
- Speedup: Ratio of Pandas execution time to Polars execution time.
Example Workflow¶
1. Dynamic Recipe Benchmark¶
The benchmark_dynamic_recipe function benchmarks a dynamic recipe with multiple preprocessing steps applied sequentially.
2. Step-Specific Benchmark¶
The benchmark_step function benchmarks a single preprocessing step.
3. Backend Comparison¶
The benchmark_backend function compares the performance of Polars and Pandas for the same preprocessing steps.
Example Results¶
Sample Output¶
| Data Size | Step | Backend | Duration Mean (ms) | Memory Mean (MB) | Speedup |
|---|---|---|---|---|---|
| 1000 | StepImputeFill | Pandas | 50.0 | 10.0 | 1.5 |
| 1000 | StepImputeFill | Polars | 33.3 | 8.0 | 1.5 |
| 10000 | StepHistoricalMean | Pandas | 500.0 | 50.0 | 2.0 |
| 10000 | StepHistoricalMean | Polars | 250.0 | 30.0 | 2.0 |
Customization¶
Adding New Steps¶
To benchmark additional steps:
- Define the step in the
steps_completeorsteps_missinglist. - Ensure the step is compatible with both Pandas and Polars.
Modifying Metrics¶
To add or modify metrics:
- Update the
run_step_benchmarkfunction to include the new metric. - Update the aggregation logic in the
benchmark_backendfunction.
Notes¶
- Dependencies: Ensure all required libraries (e.g.,
polars,pandas,sklearn,tqdm,memory_profiler) are installed. - Performance: Polars is generally faster and more memory-efficient than Pandas for large datasets.
- Reproducibility: Use the
--seedsargument to ensure consistent results across runs.
Next Steps¶
- Analyze the results using the
aggregate_results.ipynbnotebook. - Use the benchmarking results to optimize preprocessing pipelines in the
ReciPieslibrary.