|
| 1 | +# SCDL Speedtest |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The SCDL-speedtest provides a single-script speedtest to measure the performance of BioNeMo Framework's Single Cell |
| 6 | +Data Loader (SCDL) on your hardware to make sure it's performing as expected. It is designed to be easy to run, |
| 7 | +to work with your AnnData files, and to produce a simple set of reported metrics representative of real performance |
| 8 | +for applications using SCDL in a PyTorch DataLoader. |
| 9 | + |
| 10 | +## Quick Start |
| 11 | + |
| 12 | +### 0. Use a virtual environment |
| 13 | + |
| 14 | +```bash |
| 15 | +python -m venv bionemo_scdl_speedtest |
| 16 | + |
| 17 | +source bionemo_scdl_speedtest/bin/activate |
| 18 | +``` |
| 19 | + |
| 20 | +### 1. Install Dependencies |
| 21 | + |
| 22 | +```bash |
| 23 | +pip install torch pandas psutil tqdm bionemo-scdl |
| 24 | +``` |
| 25 | + |
| 26 | +**For baseline comparison** (optional): |
| 27 | +```bash |
| 28 | +pip install anndata scipy |
| 29 | +``` |
| 30 | + |
| 31 | +**Note**: If you have the BioNeMo source code, you can install bionemo-scdl locally: |
| 32 | +```bash |
| 33 | +cd /path/to/bionemo-framework |
| 34 | +pip install -e sub-packages/bionemo-scdl/ |
| 35 | +``` |
| 36 | + |
| 37 | +### 2. Run Basic Benchmark |
| 38 | + |
| 39 | +```bash |
| 40 | +# Download example dataset and run a quick benchmark / smoke test. |
| 41 | +python scdl_speedtest.py |
| 42 | + |
| 43 | +# Benchmark your own AnnData dataset |
| 44 | +python scdl_speedtest.py -i your_dataset.h5ad |
| 45 | + |
| 46 | +# Export a detailed CSV file |
| 47 | +python scdl_speedtest.py --csv |
| 48 | +``` |
| 49 | + |
| 50 | +3. Deactivate your virtual environment to return to your original shell state |
| 51 | + |
| 52 | +```bash |
| 53 | +deactivate |
| 54 | +``` |
| 55 | + |
| 56 | +## More Usage Examples |
| 57 | + |
| 58 | +```bash |
| 59 | +# Basic speedtest, using an automatically downloaded example dataset |
| 60 | +python scdl_speedtest.py |
| 61 | + |
| 62 | +# Test SCDL's expected performance on a specific AnnData dataset using sequential sampling |
| 63 | +python scdl_speedtest.py -i my_data.h5ad -s sequential |
| 64 | + |
| 65 | +# Generate CSV files for analysis |
| 66 | +python scdl_speedtest.py --csv -o report.txt |
| 67 | + |
| 68 | +# Run the speedtest with a custom batch size and runtime limit |
| 69 | +python scdl_speedtest.py --batch-size 64 --max-time 60 |
| 70 | + |
| 71 | +# Baseline comparison (SCDL vs AnnData in backed mode with lazy loading) |
| 72 | +python scdl_speedtest.py --generate-baseline |
| 73 | +``` |
| 74 | + |
| 75 | +## Command Line Options |
| 76 | + |
| 77 | +| Option | Description | Default | |
| 78 | +|--------|-------------|---------| |
| 79 | +| `-i, --input` | Dataset path (.h5ad, directory with .h5ad files, or scdl directory) | Auto-download example | |
| 80 | +| `-o, --output` | Save report to file | Print to screen (stdout) | |
| 81 | +| `-s, --sampling-scheme` | Sampling method (shuffle/sequential/random) | shuffle | |
| 82 | +| `--batch-size` | Batch size used in the PyTorch DataLoader | 32 | |
| 83 | +| `--max-time` | Max benchmark runtime (seconds). If the dataset is smaller | 30 | |
| 84 | +| `--warmup-time` | Warmup period (seconds). This runs the dataloader before measurement to better reflect average expected performance. | 2 | |
| 85 | +| `--csv` | Export detailed CSV files | False | |
| 86 | +| `--generate-baseline` | Compare SCDL vs AnnData performance | False | |
| 87 | +| `--num-epochs`| The number of epochs (passes through the training dataset). | 1 | |
| 88 | + |
| 89 | +## Sample Output |
| 90 | + |
| 91 | +``` |
| 92 | +============================================================ |
| 93 | +SCDL BENCHMARK REPORT |
| 94 | +============================================================ |
| 95 | +
|
| 96 | +Dataset: cellxgene_example_25k.h5ad |
| 97 | +Method: SCDL |
| 98 | +Sampling: shuffle |
| 99 | +Epochs: 1 |
| 100 | +
|
| 101 | +PERFORMANCE METRICS: |
| 102 | + Throughput: 20,098 samples/sec |
| 103 | + Instantiation: 0.066 seconds |
| 104 | + Avg Batch Time: 0.0016 seconds |
| 105 | +
|
| 106 | +MEMORY USAGE: |
| 107 | + Baseline: 446.6 MB |
| 108 | + Peak (Benchmark): 703.2 MB |
| 109 | + Dataset on Disk: 207.30 MB |
| 110 | +
|
| 111 | +DATA PROCESSED: |
| 112 | + Total Samples: 25,382 (25,382/epoch) |
| 113 | + Total Batches: 794 (794/epoch) |
| 114 | +============================================================ |
| 115 | +SCDL version: 0.0.8 |
| 116 | +Anndata version: 0.11.4 |
| 117 | +``` |
| 118 | + |
| 119 | +## Baseline Comparison Output |
| 120 | + |
| 121 | +When using `--generate-baseline`, you get a comprehensive comparison: |
| 122 | + |
| 123 | +``` |
| 124 | +================================================================================ |
| 125 | +SCDL vs ANNDATA COMPARISON REPORT |
| 126 | +================================================================================ |
| 127 | +
|
| 128 | +Dataset: cellxgene_example_25k.h5ad |
| 129 | +Sampling: shuffle |
| 130 | +
|
| 131 | +THROUGHPUT COMPARISON: |
| 132 | + SCDL: 22,668 samples/sec |
| 133 | + AnnData: 2,529 samples/sec |
| 134 | + Performance: 8.96x speedup with SCDL |
| 135 | +
|
| 136 | +MEMORY COMPARISON: |
| 137 | + SCDL Peak: 703.5 MB |
| 138 | + AnnData Peak: 568.8 MB |
| 139 | + Memory Efficiency: SCDL uses 1.24x more memory |
| 140 | +
|
| 141 | +DISK USAGE COMPARISON: |
| 142 | + SCDL Size: 0.20 GB |
| 143 | + AnnData Size: 0.14 GB |
| 144 | + Storage Efficiency: SCDL uses 1.43x more disk space |
| 145 | +
|
| 146 | +LOADING TIME COMPARISON: |
| 147 | + SCDL Conversion: 0.00 seconds (cached) |
| 148 | + AnnData Load: 0.25 seconds |
| 149 | +
|
| 150 | +SUMMARY: |
| 151 | + SCDL provides 9.0x throughput improvement |
| 152 | + SCDL uses 1.2x more memory |
| 153 | + SCDL disk usage: 0.20 GB |
| 154 | + AnnData disk usage: 0.14 GB |
| 155 | + SCDL uses 1.4x more disk space |
| 156 | +================================================================================``` |
| 157 | +``` |
| 158 | +## CSV Export |
| 159 | + |
| 160 | +When using `--csv`, the script generates: |
| 161 | + |
| 162 | +- **`summary.csv`**: Overall benchmark metrics and configuration |
| 163 | +- **`detailed_breakdown.csv`**: Per-epoch performance breakdown |
| 164 | + |
| 165 | +Perfect for analysis in Excel, Python, R, or other data tools. |
| 166 | + |
| 167 | +## Troubleshooting |
| 168 | + |
| 169 | + |
| 170 | +### Dataset Issues |
| 171 | + |
| 172 | +- **H5AD files**: Converted automatically to SCDL format (conversion time reported) |
| 173 | +- **Large datasets**: Uses memory-mapped access for efficiency |
| 174 | +- **Download failures**: Check internet connection and try again |
| 175 | +- **Conversion caching**: H5AD files are converted once, then reused on subsequent runs |
| 176 | + |
| 177 | +### Performance Tips |
| 178 | + |
| 179 | +- **Faster throughput**: Use `--batch-size 64` or higher |
| 180 | +- **Longer runs**: Increase `--max-time 120` for stable measurements |
| 181 | +- **Memory profiling**: Use `--csv` to get detailed memory usage per epoch |
| 182 | +- **Clearing the page cache**: With lazy loading, data may be stored in the page cache between runs. This is especially an issue with SCDL. Between runs, the page cache can be cleared with |
| 183 | +```sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'``` |
| 184 | + |
| 185 | +## Example Datasets |
| 186 | + |
| 187 | +The script automatically downloads a 25K cell example dataset from CellxGene. For other datasets: |
| 188 | + |
| 189 | +- **10X Genomics**: Convert .h5 files to .h5ad using `scanpy.read_10x_h5()` |
| 190 | +- **AnnData files**: Use directly with `-i dataset.h5ad` |
| 191 | +- **Large datasets**: Pre-convert to SCDL format for faster loading |
| 192 | + |
| 193 | +### Tahoe 100M |
| 194 | + |
| 195 | +The Tahoe 100M dataset (described in [Zhang _et al_. 2025](https://doi.org/10.1101/2025.02.20.639398)) contains data |
| 196 | +from 1,100 small-molecule perturbations across 50,000 cancer cell lines, totaling 100 Million cells. This dataset was |
| 197 | +used by [D'Ascenzo and Cultrera di Montesano 2025](https://github.com/Kidara/scDataset) to benchmark |
| 198 | +dataloaders for single cell data. |
| 199 | + |
| 200 | +To download the full Tahoe 100M dataset in AnnData format (1 file per plate, 14 total plates): |
| 201 | + |
| 202 | +**Warning** This will trigger egress charges, which can be significant. |
| 203 | + |
| 204 | +**Note** This dataset is 314 GB. The corresponding SCDL dataset after conversion is 1.1 TB, |
| 205 | +so ensure that you have sufficient disk space if using the entire dataset. |
| 206 | + |
| 207 | +**Note**: You will need to have installed the google cloud CLI to download this dataset. |
| 208 | + |
| 209 | +```bash |
| 210 | +gcloud storage cp -R gs://arc-ctc-tahoe100/2025-02-25/* . |
| 211 | +``` |
| 212 | + |
| 213 | +This will download 19 total files (14 from the full set + 5 related to the tutorial). |
| 214 | + |
| 215 | +To process this data, an option is to run ```python scdl_speedtest.py --generate-baseline -i <path to h5ad>.```. |
| 216 | +This will automatically convert the files to the SCDL format. Alternatively, with bionemo-scdl installed, |
| 217 | +```convert_h5ad_to_scdl --data-path <path to h5ad> --save-path <SCDL path>```. This is a multi-hour process to run the |
| 218 | +full conversion; however, running a single plate of the data should give you a good idea of expected SCDL performance |
| 219 | +on your system. The following command will run the speedtest on the first plate, as downloaded above: |
| 220 | + |
| 221 | +```bash |
| 222 | +python scdl_speedtest.py --generate-baseline -i tahoe-100m/h5ad/plate1_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad |
| 223 | +``` |
| 224 | + |
| 225 | +## Support |
| 226 | + |
| 227 | +For support, please [file an issue in the BioNeMo Framework GitHub repository](https://github.com/NVIDIA/bionemo-framework/issues). |
| 228 | +This code will be updated and refactored once a general benchmarking framework is in place. |
0 commit comments