Skip to content

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion #68

@Tomatokeftes

Description

@Tomatokeftes

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

Problem Statement

Mass Spectrometry Imaging (MSI) datasets present unique computational challenges due to their inherent structure. A single MSI experiment captures a full mass spectrum at every spatial position (pixel), creating a 3D data cube where:

  • Spatial dimensions: Tissue sections typically range from 100x100 to 2000x2000+ pixels
  • Spectral dimension: Each pixel contains hundreds of thousands of m/z values when raw
  • Data volume: A modest 1000x1000 pixel dataset with 300k mass channels results in 300 billion data points

For example, our Xenium dataset contains ~920,000 spectra, and when converted to a sparse table format, this creates matrices with billions of potential entries. Even with sparse storage (CSR/CSC/COO), the conversion process itself can exhaust available memory because intermediate representations must be held in RAM before writing.

The current workflow using SpatialData.write() requires the complete AnnData object in memory, which becomes prohibitive for large-scale MSI studies.

Proposed Solution

Implement direct Zarr streaming that bypasses SpatialData.write() entirely:

  1. Single-pass streaming: Iterate through spectra once, writing directly to Zarr arrays
  2. Bounded memory: Buffer only ~500k values at a time regardless of dataset size
  3. Incremental CSR/CSC construction: Build indptr, indices, and data arrays on-the-fly
  4. SpatialData compatible: Output follows the expected Zarr structure so SpatialData.read() works without modifications

Architecture

Reader.iter_spectra()
    |
    v
[Process spectrum -> resample to common mass axis]
    |
    v
[Buffer indices/data in memory (~500k values)]
    |
    v
[Flush to Zarr arrays when buffer full]
    |
    v
[Build indptr incrementally]
    |
    v
Final: SpatialData-compatible Zarr store

Results

Tested with synthetic data (250k pixels, 50k mass channels, 31M non-zeros):

Metric Standard Approach Streaming Approach
Peak Memory ~2-4 GB 60 MB
Memory Scaling Linear with data Constant
SpatialData Compatible Yes Yes

Related Work

This connects to ongoing SpatialData memory optimization efforts:

Implementation

  • thyra/converters/spatialdata/streaming_converter.py - Core implementation
  • mock_msi_generator.py - Synthetic data generator for testing without real datasets

Mock Data Generator

For easy collaboration and testing without large real datasets:

# Quick test (100x100 pixels, ~5s)
poetry run python mock_msi_generator.py small

# Realistic test (500x500 pixels, ~2min)
poetry run python mock_msi_generator.py medium

# Stress test (1000x1000 pixels)
poetry run python mock_msi_generator.py large

Questions for Discussion

  1. Is direct Zarr construction (bypassing SpatialData.write()) a supported/recommended pattern?
  2. Are there required attributes we might be missing for full compatibility?
  3. Can images/shapes be added lazily after the table is created?
  4. Would this approach be useful to generalize for other large-scale SpatialData use cases?

Checklist

  • Single-pass streaming implementation
  • CSR matrix direct-to-Zarr writing
  • SpatialData.read() compatibility verified
  • Mock data generator for testing
  • Add images/shapes lazily after table creation
  • Integration with existing converter API
  • Documentation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions