Skip to content

Commit 0c27085

Browse files
edawsonpolinabinder1skothenhill-nv
authored
Add README and SpeedTest. (#1005)
### Description Adds the speedtest script for SCDL, a single-point-of-entry script for estimating expected performance of the single cell data loader on local hardware. ### Type of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: SKIP the CI. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage ```bash python scdl_speedtest.py ``` That's it. Users can also bring their own anndata files: ```bash python scdl_speedtest.py -i mydata.h5ad ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [X] I have tested these changes locally - [X] I have updated the documentation accordingly - [X] I have added/updated tests as needed - [] All existing tests pass successfully --------- Signed-off-by: Eric T. Dawson <edawson@nvidia.com> Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Co-authored-by: Polina Binder <pbinder@nvidia.com> Co-authored-by: Steven Kothen-Hill <148821680+skothenhill-nv@users.noreply.github.com>
1 parent 1a1edf0 commit 0c27085

File tree

4 files changed

+1610
-1
lines changed

4 files changed

+1610
-1
lines changed

3rdparty/NeMo

Submodule NeMo updated from 91470a0 to 164d12b
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# SCDL Speedtest
2+
3+
## Overview
4+
5+
The SCDL-speedtest provides a single-script speedtest to measure the performance of BioNeMo Framework's Single Cell
6+
Data Loader (SCDL) on your hardware to make sure it's performing as expected. It is designed to be easy to run,
7+
to work with your AnnData files, and to produce a simple set of reported metrics representative of real performance
8+
for applications using SCDL in a PyTorch DataLoader.
9+
10+
## Quick Start
11+
12+
### 0. Use a virtual environment
13+
14+
```bash
15+
python -m venv bionemo_scdl_speedtest
16+
17+
source bionemo_scdl_speedtest/bin/activate
18+
```
19+
20+
### 1. Install Dependencies
21+
22+
```bash
23+
pip install torch pandas psutil tqdm bionemo-scdl
24+
```
25+
26+
**For baseline comparison** (optional):
27+
```bash
28+
pip install anndata scipy
29+
```
30+
31+
**Note**: If you have the BioNeMo source code, you can install bionemo-scdl locally:
32+
```bash
33+
cd /path/to/bionemo-framework
34+
pip install -e sub-packages/bionemo-scdl/
35+
```
36+
37+
### 2. Run Basic Benchmark
38+
39+
```bash
40+
# Download example dataset and run a quick benchmark / smoke test.
41+
python scdl_speedtest.py
42+
43+
# Benchmark your own AnnData dataset
44+
python scdl_speedtest.py -i your_dataset.h5ad
45+
46+
# Export a detailed CSV file
47+
python scdl_speedtest.py --csv
48+
```
49+
50+
3. Deactivate your virtual environment to return to your original shell state
51+
52+
```bash
53+
deactivate
54+
```
55+
56+
## More Usage Examples
57+
58+
```bash
59+
# Basic speedtest, using an automatically downloaded example dataset
60+
python scdl_speedtest.py
61+
62+
# Test SCDL's expected performance on a specific AnnData dataset using sequential sampling
63+
python scdl_speedtest.py -i my_data.h5ad -s sequential
64+
65+
# Generate CSV files for analysis
66+
python scdl_speedtest.py --csv -o report.txt
67+
68+
# Run the speedtest with a custom batch size and runtime limit
69+
python scdl_speedtest.py --batch-size 64 --max-time 60
70+
71+
# Baseline comparison (SCDL vs AnnData in backed mode with lazy loading)
72+
python scdl_speedtest.py --generate-baseline
73+
```
74+
75+
## Command Line Options
76+
77+
| Option | Description | Default |
78+
|--------|-------------|---------|
79+
| `-i, --input` | Dataset path (.h5ad, directory with .h5ad files, or scdl directory) | Auto-download example |
80+
| `-o, --output` | Save report to file | Print to screen (stdout) |
81+
| `-s, --sampling-scheme` | Sampling method (shuffle/sequential/random) | shuffle |
82+
| `--batch-size` | Batch size used in the PyTorch DataLoader | 32 |
83+
| `--max-time` | Max benchmark runtime (seconds). If the dataset is smaller | 30 |
84+
| `--warmup-time` | Warmup period (seconds). This runs the dataloader before measurement to better reflect average expected performance. | 2 |
85+
| `--csv` | Export detailed CSV files | False |
86+
| `--generate-baseline` | Compare SCDL vs AnnData performance | False |
87+
| `--num-epochs`| The number of epochs (passes through the training dataset). | 1 |
88+
89+
## Sample Output
90+
91+
```
92+
============================================================
93+
SCDL BENCHMARK REPORT
94+
============================================================
95+
96+
Dataset: cellxgene_example_25k.h5ad
97+
Method: SCDL
98+
Sampling: shuffle
99+
Epochs: 1
100+
101+
PERFORMANCE METRICS:
102+
Throughput: 20,098 samples/sec
103+
Instantiation: 0.066 seconds
104+
Avg Batch Time: 0.0016 seconds
105+
106+
MEMORY USAGE:
107+
Baseline: 446.6 MB
108+
Peak (Benchmark): 703.2 MB
109+
Dataset on Disk: 207.30 MB
110+
111+
DATA PROCESSED:
112+
Total Samples: 25,382 (25,382/epoch)
113+
Total Batches: 794 (794/epoch)
114+
============================================================
115+
SCDL version: 0.0.8
116+
Anndata version: 0.11.4
117+
```
118+
119+
## Baseline Comparison Output
120+
121+
When using `--generate-baseline`, you get a comprehensive comparison:
122+
123+
```
124+
================================================================================
125+
SCDL vs ANNDATA COMPARISON REPORT
126+
================================================================================
127+
128+
Dataset: cellxgene_example_25k.h5ad
129+
Sampling: shuffle
130+
131+
THROUGHPUT COMPARISON:
132+
SCDL: 22,668 samples/sec
133+
AnnData: 2,529 samples/sec
134+
Performance: 8.96x speedup with SCDL
135+
136+
MEMORY COMPARISON:
137+
SCDL Peak: 703.5 MB
138+
AnnData Peak: 568.8 MB
139+
Memory Efficiency: SCDL uses 1.24x more memory
140+
141+
DISK USAGE COMPARISON:
142+
SCDL Size: 0.20 GB
143+
AnnData Size: 0.14 GB
144+
Storage Efficiency: SCDL uses 1.43x more disk space
145+
146+
LOADING TIME COMPARISON:
147+
SCDL Conversion: 0.00 seconds (cached)
148+
AnnData Load: 0.25 seconds
149+
150+
SUMMARY:
151+
SCDL provides 9.0x throughput improvement
152+
SCDL uses 1.2x more memory
153+
SCDL disk usage: 0.20 GB
154+
AnnData disk usage: 0.14 GB
155+
SCDL uses 1.4x more disk space
156+
================================================================================```
157+
```
158+
## CSV Export
159+
160+
When using `--csv`, the script generates:
161+
162+
- **`summary.csv`**: Overall benchmark metrics and configuration
163+
- **`detailed_breakdown.csv`**: Per-epoch performance breakdown
164+
165+
Perfect for analysis in Excel, Python, R, or other data tools.
166+
167+
## Troubleshooting
168+
169+
170+
### Dataset Issues
171+
172+
- **H5AD files**: Converted automatically to SCDL format (conversion time reported)
173+
- **Large datasets**: Uses memory-mapped access for efficiency
174+
- **Download failures**: Check internet connection and try again
175+
- **Conversion caching**: H5AD files are converted once, then reused on subsequent runs
176+
177+
### Performance Tips
178+
179+
- **Faster throughput**: Use `--batch-size 64` or higher
180+
- **Longer runs**: Increase `--max-time 120` for stable measurements
181+
- **Memory profiling**: Use `--csv` to get detailed memory usage per epoch
182+
- **Clearing the page cache**: With lazy loading, data may be stored in the page cache between runs. This is especially an issue with SCDL. Between runs, the page cache can be cleared with
183+
```sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'```
184+
185+
## Example Datasets
186+
187+
The script automatically downloads a 25K cell example dataset from CellxGene. For other datasets:
188+
189+
- **10X Genomics**: Convert .h5 files to .h5ad using `scanpy.read_10x_h5()`
190+
- **AnnData files**: Use directly with `-i dataset.h5ad`
191+
- **Large datasets**: Pre-convert to SCDL format for faster loading
192+
193+
### Tahoe 100M
194+
195+
The Tahoe 100M dataset (described in [Zhang _et al_. 2025](https://doi.org/10.1101/2025.02.20.639398)) contains data
196+
from 1,100 small-molecule perturbations across 50,000 cancer cell lines, totaling 100 Million cells. This dataset was
197+
used by [D'Ascenzo and Cultrera di Montesano 2025](https://github.com/Kidara/scDataset) to benchmark
198+
dataloaders for single cell data.
199+
200+
To download the full Tahoe 100M dataset in AnnData format (1 file per plate, 14 total plates):
201+
202+
**Warning** This will trigger egress charges, which can be significant.
203+
204+
**Note** This dataset is 314 GB. The corresponding SCDL dataset after conversion is 1.1 TB,
205+
so ensure that you have sufficient disk space if using the entire dataset.
206+
207+
**Note**: You will need to have installed the google cloud CLI to download this dataset.
208+
209+
```bash
210+
gcloud storage cp -R gs://arc-ctc-tahoe100/2025-02-25/* .
211+
```
212+
213+
This will download 19 total files (14 from the full set + 5 related to the tutorial).
214+
215+
To process this data, an option is to run ```python scdl_speedtest.py --generate-baseline -i <path to h5ad>.```.
216+
This will automatically convert the files to the SCDL format. Alternatively, with bionemo-scdl installed,
217+
```convert_h5ad_to_scdl --data-path <path to h5ad> --save-path <SCDL path>```. This is a multi-hour process to run the
218+
full conversion; however, running a single plate of the data should give you a good idea of expected SCDL performance
219+
on your system. The following command will run the speedtest on the first plate, as downloaded above:
220+
221+
```bash
222+
python scdl_speedtest.py --generate-baseline -i tahoe-100m/h5ad/plate1_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad
223+
```
224+
225+
## Support
226+
227+
For support, please [file an issue in the BioNeMo Framework GitHub repository](https://github.com/NVIDIA/bionemo-framework/issues).
228+
This code will be updated and refactored once a general benchmarking framework is in place.

0 commit comments

Comments
 (0)