Skip to content

feat: streaming converter for memory-efficient large dataset conversion#69

Merged
Tomatokeftes merged 7 commits intomainfrom
feature/streaming-large-datasets
Jan 23, 2026
Merged

feat: streaming converter for memory-efficient large dataset conversion#69
Tomatokeftes merged 7 commits intomainfrom
feature/streaming-large-datasets

Conversation

@Tomatokeftes
Copy link
Copy Markdown
Collaborator

@Tomatokeftes Tomatokeftes commented Jan 12, 2026

Summary

Implements memory-efficient streaming conversion for large MSI datasets (900k+ spectra, 100GB+ dense equivalent) to SpatialData/Zarr format without memory exhaustion.

Closes #68

The Challenge: Why Large MSI Data is Hard

Mass Spectrometry Imaging datasets are inherently sparse - each pixel has peaks at only ~500-2000 m/z positions out of 50,000+ possible bins. However, the standard conversion approach:

  1. Builds a dense or scipy sparse matrix in memory
  2. Passes it to AnnData which may densify during operations
  3. Writes through SpatialData.write() which can trigger additional copies

For a 900x1000 pixel dataset with 50k m/z bins, this means:

  • Dense: 900k * 50k * 4 bytes = ~170 GB (impossible)
  • Scipy sparse in-memory: Still requires ~2-4 GB RAM for intermediate structures

Our Solution: Two-Pass Direct-to-Zarr Streaming

We bypass the standard SpatialData write path entirely and write directly to Zarr arrays in a streaming fashion.

CSR vs CSC: Why Format Matters

CSR (Compressed Sparse Row) - stores row pointers, easier for row-wise iteration:

indptr:  [0, 3, 5, 8, ...]  # Where each row starts
indices: [2, 5, 9, ...]      # Column indices
data:    [1.0, 2.5, ...]     # Values

CSC (Compressed Sparse Column) - stores column pointers, harder for row-wise data:

indptr:  [0, 2, 4, 7, ...]  # Where each column starts  
indices: [0, 3, 1, ...]      # Row indices
data:    [1.0, 2.5, ...]     # Values

Why CSC is Harder (and Why We Use It)

MSI data arrives row-by-row (spectrum by spectrum), but CSC requires column-sorted data. This creates a fundamental mismatch:

Approach Memory I/O Complexity
CSR (row-major) Low Low Simple - data arrives in order
CSC (column-major) Higher Higher Hard - must transpose on-the-fly

Why CSC anyway? SpatialData/AnnData expects CSC format for efficient column (m/z bin) slicing, which is the common access pattern for ion images.

Our Two-Pass No-Cache Approach

Pass 1: Pre-scan
  - Count entries per column (for indptr)
  - Compute TIC values and average spectrum
  - No disk caching needed

Pass 2: Scatter to CSC
  - Allocate memmap files for indices/data
  - Process each spectrum, scatter to correct column positions
  - OS manages virtual memory via memmap

Key insight: Processing spectra twice is faster than writing a 200GB cache file to disk.

Implementation Details

Memory Management

  • Memmap arrays: OS handles paging, RAM usage stays ~100MB regardless of dataset size
  • Chunked writes: Flush to Zarr every N spectra to bound memory
  • Generator factory pattern: iter_spectra() creates fresh iterators for each pass

SpatialData Compatibility

  • Writes Zarr structure that SpatialData.read() understands
  • Includes TIC image and pixel shapes for visualization
  • Preserves all metadata (mass axis, coordinates, essential metadata)

Results

Tested with real MSI data (PEA dataset, 250k+ pixels):

Metric Standard Approach Streaming (This PR)
Peak Memory ~2-4 GB ~100 MB
Memory Scaling O(n) with data O(1) constant
Disk Cache N/A None needed

Potential for Non-MSI Data

This approach could benefit any large sparse tabular data in SpatialData:

  • Single-cell RNA-seq: Similar sparsity patterns (genes x cells)
  • Spatial transcriptomics: Spot-based methods with sparse gene expression
  • CODEX/IMC: Protein expression matrices

The core technique (two-pass streaming to CSC Zarr) is format-agnostic. The main MSI-specific parts are:

  • Spectrum resampling to common m/z axis
  • TIC image generation

Question for @LucaMarconato: Would a generalized version of this streaming writer be useful in spatialdata-io? The pattern could be:

# Hypothetical API
with StreamingTableWriter(output_path, n_rows, n_cols) as writer:
    for row_idx, (indices, values) in enumerate(data_generator):
        writer.add_row(row_idx, indices, values)

Code Quality

  • Test coverage: 85% on streaming converter
  • Complexity: All functions below threshold (refactored from 16 to ~5)
  • 256 unit tests passing

Files Changed

File Purpose
streaming_converter.py Main streaming implementation
imzml_extractor.py Per-pixel peak counts for pre-allocation
base_reader.py Generator factory pattern for readers
test_streaming_converter.py 17 comprehensive tests

Test Plan

  • Unit tests for all streaming paths (CSR, CSC, COO)
  • Optical image loading tests
  • Auto-mode threshold detection
  • TIC image and shapes verification
  • Rectangular grid handling
  • Integration test with real large dataset

References

…ersion

Implements direct Zarr streaming that bypasses SpatialData.write() to handle
large MSI datasets (900k+ spectra) without memory exhaustion.

Key features:
- Single-pass streaming with bounded memory (~60 MB regardless of dataset size)
- Incremental CSR matrix construction directly to Zarr arrays
- SpatialData.read() compatible output
- Mock data generator for testing without real datasets

Closes #68
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 3
Average Complexity: 3.07
Maximum Complexity: 16

Complexity Distribution

  • 1-5 (Low): 408
  • 6-10 (Moderate): 65
  • 11-15 (High): 1
  • 16-20 (Very High): 2
  • 21+ (Critical): 0

Top Complex Functions

  1. thyra/converters/spatialdata/streaming_converter.py:610 - _stream_write_direct (16)
  2. thyra/converters/spatialdata/streaming_converter.py:893 - _stream_write_direct_zarr (16)
  3. thyra/converters/spatialdata/streaming_converter.py:186 - _stream_build_coo (15)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

…aset conversion

This commit implements a memory-efficient two-pass streaming approach for
converting large MSI datasets to SpatialData format without disk caching.

Key changes:

Streaming Converter Improvements:
- Add no-cache CSC streaming: two-pass approach (prescan + scatter) that
  eliminates ~200GB cache file I/O for large datasets
- Use memory-mapped files (numpy memmap) for CSC arrays - OS manages
  virtual memory, keeping RAM usage minimal regardless of dataset size
- Add TIC image and pixel shapes generation during CSC conversion
- Suppress expected SpatialData warning about table annotating shapes
  that are written immediately after
- Fix reader reset handling for second pass iteration
- Remove ~500 lines of dead code (cache-based CSC methods)
- Use dynamic version from package instead of hardcoded string

Reader Enhancements:
- Add iter_spectra() generator factory pattern to all readers
- Add get_peak_counts_per_pixel() method for CSR indptr construction
- Fix coordinate iteration order consistency across readers

Test Coverage:
- Increase streaming converter test coverage from 78% to 85%
- Add 7 new tests: auto mode detection, custom temp directory,
  optical image loading (grayscale and RGB), chunk writing
- Move mock_msi_generator.py to tests/fixtures/
- Fix line endings across codebase (pre-commit)

The no-cache approach processes spectra twice but eliminates massive
temporary file I/O, resulting in faster conversion for datasets where
I/O is the bottleneck.
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 2
Average Complexity: 3.06
Maximum Complexity: 16

Complexity Distribution

  • 1-5 (Low): 417
  • 6-10 (Moderate): 69
  • 11-15 (High): 1
  • 16-20 (Very High): 1
  • 21+ (Critical): 0

Top Complex Functions

  1. thyra/converters/spatialdata/streaming_converter.py:281 - _stream_build_coo (16)
  2. thyra/metadata/extractors/imzml_extractor.py:110 - _get_mass_range_complete (13)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

Extract the two-pass COO building logic into smaller helper methods:
- _coo_pass1_count_nonzeros: handles counting pass
- _coo_setup_zarr_arrays: sets up Zarr structure
- _coo_pass2_write_data: handles data writing pass
- _flush_chunk_to_zarr: flushes chunk buffers

This brings the function complexity below the threshold of 10
and improves maintainability.
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 1
Average Complexity: 3.05
Maximum Complexity: 13

Complexity Distribution

  • 1-5 (Low): 420
  • 6-10 (Moderate): 71
  • 11-15 (High): 1
  • 16-20 (Very High): 0
  • 21+ (Critical): 0

Top Complex Functions

  1. thyra/metadata/extractors/imzml_extractor.py:110 - _get_mass_range_complete (13)

Warning: Consider refactoring functions with complexity > 15 for better maintainability.

Extract the mass range scanning logic into smaller helper methods:
- _init_peak_counts_array: initializes per-pixel peak counts array
- _scan_all_spectra: main scanning loop
- _process_spectrum_for_range: processes single spectrum
- _store_pixel_peak_count: stores peak count for a pixel

This brings the function complexity below the threshold of 10.
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

- Remove unused zero_copy parameter (always True, never used False)
- Remove _convert_with_scipy() method (duplicate of COO path in convert())
- Simplify convert() method docstring

This removes ~50 lines of untested dead code.
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

- Update module and class docstrings to reflect current implementation
- Remove outdated zero_copy references from documentation
- Add _suppress_reader_progress() helper to consolidate repeated pattern
- Simplify _process_spectrum with early return for non-resampling case
- Remove duplicate fallback code for non-NN resampling
@github-actions
Copy link
Copy Markdown

Complexity Monitoring Report

Threshold: 10
Total Violations: 0

Excellent! No complexity violations found.

@Tomatokeftes Tomatokeftes marked this pull request as ready for review January 23, 2026 10:22
@Tomatokeftes Tomatokeftes merged commit fe81d0e into main Jan 23, 2026
7 checks passed
@Tomatokeftes Tomatokeftes deleted the feature/streaming-large-datasets branch January 23, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

1 participant