feat: streaming converter for memory-efficient large dataset conversion by Tomatokeftes · Pull Request #69 · M4i-Imaging-Mass-Spectrometry/thyra

Tomatokeftes · 2026-01-12T17:58:37Z

Summary

Implements memory-efficient streaming conversion for large MSI datasets (900k+ spectra, 100GB+ dense equivalent) to SpatialData/Zarr format without memory exhaustion.

Closes #68

The Challenge: Why Large MSI Data is Hard

Mass Spectrometry Imaging datasets are inherently sparse - each pixel has peaks at only ~500-2000 m/z positions out of 50,000+ possible bins. However, the standard conversion approach:

Builds a dense or scipy sparse matrix in memory
Passes it to AnnData which may densify during operations
Writes through SpatialData.write() which can trigger additional copies

For a 900x1000 pixel dataset with 50k m/z bins, this means:

Dense: 900k * 50k * 4 bytes = ~170 GB (impossible)
Scipy sparse in-memory: Still requires ~2-4 GB RAM for intermediate structures

Our Solution: Two-Pass Direct-to-Zarr Streaming

We bypass the standard SpatialData write path entirely and write directly to Zarr arrays in a streaming fashion.

CSR vs CSC: Why Format Matters

CSR (Compressed Sparse Row) - stores row pointers, easier for row-wise iteration:

indptr:  [0, 3, 5, 8, ...]  # Where each row starts
indices: [2, 5, 9, ...]      # Column indices
data:    [1.0, 2.5, ...]     # Values

CSC (Compressed Sparse Column) - stores column pointers, harder for row-wise data:

indptr:  [0, 2, 4, 7, ...]  # Where each column starts  
indices: [0, 3, 1, ...]      # Row indices
data:    [1.0, 2.5, ...]     # Values

Why CSC is Harder (and Why We Use It)

MSI data arrives row-by-row (spectrum by spectrum), but CSC requires column-sorted data. This creates a fundamental mismatch:

Approach	Memory	I/O	Complexity
CSR (row-major)	Low	Low	Simple - data arrives in order
CSC (column-major)	Higher	Higher	Hard - must transpose on-the-fly

Why CSC anyway? SpatialData/AnnData expects CSC format for efficient column (m/z bin) slicing, which is the common access pattern for ion images.

Our Two-Pass No-Cache Approach

Pass 1: Pre-scan
  - Count entries per column (for indptr)
  - Compute TIC values and average spectrum
  - No disk caching needed

Pass 2: Scatter to CSC
  - Allocate memmap files for indices/data
  - Process each spectrum, scatter to correct column positions
  - OS manages virtual memory via memmap

Key insight: Processing spectra twice is faster than writing a 200GB cache file to disk.

Implementation Details

Memory Management

Memmap arrays: OS handles paging, RAM usage stays ~100MB regardless of dataset size
Chunked writes: Flush to Zarr every N spectra to bound memory
Generator factory pattern: iter_spectra() creates fresh iterators for each pass

SpatialData Compatibility

Writes Zarr structure that SpatialData.read() understands
Includes TIC image and pixel shapes for visualization
Preserves all metadata (mass axis, coordinates, essential metadata)

Results

Tested with real MSI data (PEA dataset, 250k+ pixels):

Metric	Standard Approach	Streaming (This PR)
Peak Memory	~2-4 GB	~100 MB
Memory Scaling	O(n) with data	O(1) constant
Disk Cache	N/A	None needed

Potential for Non-MSI Data

This approach could benefit any large sparse tabular data in SpatialData:

Single-cell RNA-seq: Similar sparsity patterns (genes x cells)
Spatial transcriptomics: Spot-based methods with sparse gene expression
CODEX/IMC: Protein expression matrices

The core technique (two-pass streaming to CSC Zarr) is format-agnostic. The main MSI-specific parts are:

Spectrum resampling to common m/z axis
TIC image generation

Question for @LucaMarconato: Would a generalized version of this streaming writer be useful in spatialdata-io? The pattern could be:

# Hypothetical API
with StreamingTableWriter(output_path, n_rows, n_cols) as writer:
    for row_idx, (indices, values) in enumerate(data_generator):
        writer.add_row(row_idx, indices, values)

Code Quality

Test coverage: 85% on streaming converter
Complexity: All functions below threshold (refactored from 16 to ~5)
256 unit tests passing

Files Changed

File	Purpose
`streaming_converter.py`	Main streaming implementation
`imzml_extractor.py`	Per-pixel peak counts for pre-allocation
`base_reader.py`	Generator factory pattern for readers
`test_streaming_converter.py`	17 comprehensive tests

Test Plan

Unit tests for all streaming paths (CSR, CSC, COO)
Optical image loading tests
Auto-mode threshold detection
TIC image and shapes verification
Rectangular grid handling
Integration test with real large dataset

References

Load Xenium mask labels using Dask scverse/spatialdata-io#337 - Dask for Xenium mask loading
Reduce reader memory consumption scverse/spatialdata-io#228 - output_path parameter for streaming
Chunkwise image loader scverse/spatialdata-io#279 - Chunkwise image loader

…ersion Implements direct Zarr streaming that bypasses SpatialData.write() to handle large MSI datasets (900k+ spectra) without memory exhaustion. Key features: - Single-pass streaming with bounded memory (~60 MB regardless of dataset size) - Incremental CSR matrix construction directly to Zarr arrays - SpatialData.read() compatible output - Mock data generator for testing without real datasets Closes #68

github-actions · 2026-01-12T17:59:38Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 3
Average Complexity: 3.07
Maximum Complexity: 16

Complexity Distribution

1-5 (Low): 408
6-10 (Moderate): 65
11-15 (High): 1
16-20 (Very High): 2
21+ (Critical): 0

Top Complex Functions

thyra/converters/spatialdata/streaming_converter.py:610 - _stream_write_direct (16)
thyra/converters/spatialdata/streaming_converter.py:893 - _stream_write_direct_zarr (16)
thyra/converters/spatialdata/streaming_converter.py:186 - _stream_build_coo (15)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

…aset conversion This commit implements a memory-efficient two-pass streaming approach for converting large MSI datasets to SpatialData format without disk caching. Key changes: Streaming Converter Improvements: - Add no-cache CSC streaming: two-pass approach (prescan + scatter) that eliminates ~200GB cache file I/O for large datasets - Use memory-mapped files (numpy memmap) for CSC arrays - OS manages virtual memory, keeping RAM usage minimal regardless of dataset size - Add TIC image and pixel shapes generation during CSC conversion - Suppress expected SpatialData warning about table annotating shapes that are written immediately after - Fix reader reset handling for second pass iteration - Remove ~500 lines of dead code (cache-based CSC methods) - Use dynamic version from package instead of hardcoded string Reader Enhancements: - Add iter_spectra() generator factory pattern to all readers - Add get_peak_counts_per_pixel() method for CSR indptr construction - Fix coordinate iteration order consistency across readers Test Coverage: - Increase streaming converter test coverage from 78% to 85% - Add 7 new tests: auto mode detection, custom temp directory, optical image loading (grayscale and RGB), chunk writing - Move mock_msi_generator.py to tests/fixtures/ - Fix line endings across codebase (pre-commit) The no-cache approach processes spectra twice but eliminates massive temporary file I/O, resulting in faster conversion for datasets where I/O is the bottleneck.

github-actions · 2026-01-14T13:45:23Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 2
Average Complexity: 3.06
Maximum Complexity: 16

Complexity Distribution

1-5 (Low): 417
6-10 (Moderate): 69
11-15 (High): 1
16-20 (Very High): 1
21+ (Critical): 0

Top Complex Functions

thyra/converters/spatialdata/streaming_converter.py:281 - _stream_build_coo (16)
thyra/metadata/extractors/imzml_extractor.py:110 - _get_mass_range_complete (13)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

Extract the two-pass COO building logic into smaller helper methods: - _coo_pass1_count_nonzeros: handles counting pass - _coo_setup_zarr_arrays: sets up Zarr structure - _coo_pass2_write_data: handles data writing pass - _flush_chunk_to_zarr: flushes chunk buffers This brings the function complexity below the threshold of 10 and improves maintainability.

github-actions · 2026-01-14T13:49:01Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 1
Average Complexity: 3.05
Maximum Complexity: 13

Complexity Distribution

1-5 (Low): 420
6-10 (Moderate): 71
11-15 (High): 1
16-20 (Very High): 0
21+ (Critical): 0

Top Complex Functions

thyra/metadata/extractors/imzml_extractor.py:110 - _get_mass_range_complete (13)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

Extract the mass range scanning logic into smaller helper methods: - _init_peak_counts_array: initializes per-pixel peak counts array - _scan_all_spectra: main scanning loop - _process_spectrum_for_range: processes single spectrum - _store_pixel_peak_count: stores peak count for a pixel This brings the function complexity below the threshold of 10.

github-actions · 2026-01-14T13:51:24Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

github-actions · 2026-01-14T14:15:15Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

- Remove unused zero_copy parameter (always True, never used False) - Remove _convert_with_scipy() method (duplicate of COO path in convert()) - Simplify convert() method docstring This removes ~50 lines of untested dead code.

github-actions · 2026-01-14T15:01:38Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

- Update module and class docstrings to reflect current implementation - Remove outdated zero_copy references from documentation - Add _suppress_reader_progress() helper to consolidate repeated pattern - Simplify _process_spectrum with early return for non-resampling case - Remove duplicate fallback code for non-NN resampling

github-actions · 2026-01-14T15:11:03Z

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

style: apply black formatting to streaming converter

1432d27

Tomatokeftes marked this pull request as ready for review January 23, 2026 10:22

Tomatokeftes merged commit fe81d0e into main Jan 23, 2026
7 checks passed

Tomatokeftes deleted the feature/streaming-large-datasets branch January 23, 2026 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: streaming converter for memory-efficient large dataset conversion#69

feat: streaming converter for memory-efficient large dataset conversion#69
Tomatokeftes merged 7 commits intomainfrom
feature/streaming-large-datasets

Tomatokeftes commented Jan 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tomatokeftes commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Challenge: Why Large MSI Data is Hard

Our Solution: Two-Pass Direct-to-Zarr Streaming

CSR vs CSC: Why Format Matters

Why CSC is Harder (and Why We Use It)

Our Two-Pass No-Cache Approach

Implementation Details

Memory Management

SpatialData Compatibility

Results

Potential for Non-MSI Data

Code Quality

Files Changed

Test Plan

References

Uh oh!

github-actions bot commented Jan 12, 2026

Complexity Monitoring Report

Complexity Distribution

Top Complex Functions

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Complexity Distribution

Top Complex Functions

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Complexity Distribution

Top Complex Functions

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Uh oh!

github-actions bot commented Jan 14, 2026

Complexity Monitoring Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tomatokeftes commented Jan 12, 2026 •

edited

Loading