feat: streaming converter for memory-efficient large dataset conversion#69
feat: streaming converter for memory-efficient large dataset conversion#69Tomatokeftes merged 7 commits intomainfrom
Conversation
…ersion Implements direct Zarr streaming that bypasses SpatialData.write() to handle large MSI datasets (900k+ spectra) without memory exhaustion. Key features: - Single-pass streaming with bounded memory (~60 MB regardless of dataset size) - Incremental CSR matrix construction directly to Zarr arrays - SpatialData.read() compatible output - Mock data generator for testing without real datasets Closes #68
Complexity Monitoring ReportThreshold: 10 Complexity Distribution
Top Complex Functions
|
…aset conversion This commit implements a memory-efficient two-pass streaming approach for converting large MSI datasets to SpatialData format without disk caching. Key changes: Streaming Converter Improvements: - Add no-cache CSC streaming: two-pass approach (prescan + scatter) that eliminates ~200GB cache file I/O for large datasets - Use memory-mapped files (numpy memmap) for CSC arrays - OS manages virtual memory, keeping RAM usage minimal regardless of dataset size - Add TIC image and pixel shapes generation during CSC conversion - Suppress expected SpatialData warning about table annotating shapes that are written immediately after - Fix reader reset handling for second pass iteration - Remove ~500 lines of dead code (cache-based CSC methods) - Use dynamic version from package instead of hardcoded string Reader Enhancements: - Add iter_spectra() generator factory pattern to all readers - Add get_peak_counts_per_pixel() method for CSR indptr construction - Fix coordinate iteration order consistency across readers Test Coverage: - Increase streaming converter test coverage from 78% to 85% - Add 7 new tests: auto mode detection, custom temp directory, optical image loading (grayscale and RGB), chunk writing - Move mock_msi_generator.py to tests/fixtures/ - Fix line endings across codebase (pre-commit) The no-cache approach processes spectra twice but eliminates massive temporary file I/O, resulting in faster conversion for datasets where I/O is the bottleneck.
Complexity Monitoring ReportThreshold: 10 Complexity Distribution
Top Complex Functions
|
Extract the two-pass COO building logic into smaller helper methods: - _coo_pass1_count_nonzeros: handles counting pass - _coo_setup_zarr_arrays: sets up Zarr structure - _coo_pass2_write_data: handles data writing pass - _flush_chunk_to_zarr: flushes chunk buffers This brings the function complexity below the threshold of 10 and improves maintainability.
Complexity Monitoring ReportThreshold: 10 Complexity Distribution
Top Complex Functions
|
Extract the mass range scanning logic into smaller helper methods: - _init_peak_counts_array: initializes per-pixel peak counts array - _scan_all_spectra: main scanning loop - _process_spectrum_for_range: processes single spectrum - _store_pixel_peak_count: stores peak count for a pixel This brings the function complexity below the threshold of 10.
Complexity Monitoring ReportThreshold: 10 Excellent! No complexity violations found. |
Complexity Monitoring ReportThreshold: 10 Excellent! No complexity violations found. |
- Remove unused zero_copy parameter (always True, never used False) - Remove _convert_with_scipy() method (duplicate of COO path in convert()) - Simplify convert() method docstring This removes ~50 lines of untested dead code.
Complexity Monitoring ReportThreshold: 10 Excellent! No complexity violations found. |
- Update module and class docstrings to reflect current implementation - Remove outdated zero_copy references from documentation - Add _suppress_reader_progress() helper to consolidate repeated pattern - Simplify _process_spectrum with early return for non-resampling case - Remove duplicate fallback code for non-NN resampling
Complexity Monitoring ReportThreshold: 10 Excellent! No complexity violations found. |
Summary
Implements memory-efficient streaming conversion for large MSI datasets (900k+ spectra, 100GB+ dense equivalent) to SpatialData/Zarr format without memory exhaustion.
Closes #68
The Challenge: Why Large MSI Data is Hard
Mass Spectrometry Imaging datasets are inherently sparse - each pixel has peaks at only ~500-2000 m/z positions out of 50,000+ possible bins. However, the standard conversion approach:
AnnDatawhich may densify during operationsSpatialData.write()which can trigger additional copiesFor a 900x1000 pixel dataset with 50k m/z bins, this means:
Our Solution: Two-Pass Direct-to-Zarr Streaming
We bypass the standard SpatialData write path entirely and write directly to Zarr arrays in a streaming fashion.
CSR vs CSC: Why Format Matters
CSR (Compressed Sparse Row) - stores row pointers, easier for row-wise iteration:
CSC (Compressed Sparse Column) - stores column pointers, harder for row-wise data:
Why CSC is Harder (and Why We Use It)
MSI data arrives row-by-row (spectrum by spectrum), but CSC requires column-sorted data. This creates a fundamental mismatch:
Why CSC anyway? SpatialData/AnnData expects CSC format for efficient column (m/z bin) slicing, which is the common access pattern for ion images.
Our Two-Pass No-Cache Approach
Key insight: Processing spectra twice is faster than writing a 200GB cache file to disk.
Implementation Details
Memory Management
iter_spectra()creates fresh iterators for each passSpatialData Compatibility
SpatialData.read()understandsResults
Tested with real MSI data (PEA dataset, 250k+ pixels):
Potential for Non-MSI Data
This approach could benefit any large sparse tabular data in SpatialData:
The core technique (two-pass streaming to CSC Zarr) is format-agnostic. The main MSI-specific parts are:
Question for @LucaMarconato: Would a generalized version of this streaming writer be useful in
spatialdata-io? The pattern could be:Code Quality
Files Changed
streaming_converter.pyimzml_extractor.pybase_reader.pytest_streaming_converter.pyTest Plan
References
output_pathparameter for streaming