-
Notifications
You must be signed in to change notification settings - Fork 1
Description
[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion
Problem Statement
Mass Spectrometry Imaging (MSI) datasets present unique computational challenges due to their inherent structure. A single MSI experiment captures a full mass spectrum at every spatial position (pixel), creating a 3D data cube where:
- Spatial dimensions: Tissue sections typically range from 100x100 to 2000x2000+ pixels
- Spectral dimension: Each pixel contains hundreds of thousands of m/z values when raw
- Data volume: A modest 1000x1000 pixel dataset with 300k mass channels results in 300 billion data points
For example, our Xenium dataset contains ~920,000 spectra, and when converted to a sparse table format, this creates matrices with billions of potential entries. Even with sparse storage (CSR/CSC/COO), the conversion process itself can exhaust available memory because intermediate representations must be held in RAM before writing.
The current workflow using SpatialData.write() requires the complete AnnData object in memory, which becomes prohibitive for large-scale MSI studies.
Proposed Solution
Implement direct Zarr streaming that bypasses SpatialData.write() entirely:
- Single-pass streaming: Iterate through spectra once, writing directly to Zarr arrays
- Bounded memory: Buffer only ~500k values at a time regardless of dataset size
- Incremental CSR/CSC construction: Build
indptr,indices, anddataarrays on-the-fly - SpatialData compatible: Output follows the expected Zarr structure so
SpatialData.read()works without modifications
Architecture
Reader.iter_spectra()
|
v
[Process spectrum -> resample to common mass axis]
|
v
[Buffer indices/data in memory (~500k values)]
|
v
[Flush to Zarr arrays when buffer full]
|
v
[Build indptr incrementally]
|
v
Final: SpatialData-compatible Zarr store
Results
Tested with synthetic data (250k pixels, 50k mass channels, 31M non-zeros):
| Metric | Standard Approach | Streaming Approach |
|---|---|---|
| Peak Memory | ~2-4 GB | 60 MB |
| Memory Scaling | Linear with data | Constant |
| SpatialData Compatible | Yes | Yes |
Related Work
This connects to ongoing SpatialData memory optimization efforts:
- Load Xenium mask labels using Dask scverse/spatialdata-io#337 - Dask for Xenium mask loading
- Reduce reader memory consumption scverse/spatialdata-io#228 -
output_pathparameter for element-by-element saving - Chunkwise image loader scverse/spatialdata-io#279 - Chunkwise image loader
Implementation
thyra/converters/spatialdata/streaming_converter.py- Core implementationmock_msi_generator.py- Synthetic data generator for testing without real datasets
Mock Data Generator
For easy collaboration and testing without large real datasets:
# Quick test (100x100 pixels, ~5s)
poetry run python mock_msi_generator.py small
# Realistic test (500x500 pixels, ~2min)
poetry run python mock_msi_generator.py medium
# Stress test (1000x1000 pixels)
poetry run python mock_msi_generator.py largeQuestions for Discussion
- Is direct Zarr construction (bypassing
SpatialData.write()) a supported/recommended pattern? - Are there required attributes we might be missing for full compatibility?
- Can images/shapes be added lazily after the table is created?
- Would this approach be useful to generalize for other large-scale SpatialData use cases?
Checklist
- Single-pass streaming implementation
- CSR matrix direct-to-Zarr writing
- SpatialData.read() compatibility verified
- Mock data generator for testing
- Add images/shapes lazily after table creation
- Integration with existing converter API
- Documentation