[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

# [FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

## Problem Statement

Mass Spectrometry Imaging (MSI) datasets present unique computational challenges due to their inherent structure. A single MSI experiment captures a full mass spectrum at every spatial position (pixel), creating a 3D data cube where:

- **Spatial dimensions**: Tissue sections typically range from 100x100 to 2000x2000+ pixels
- **Spectral dimension**: Each pixel contains hundreds of thousands of m/z values when raw
- **Data volume**: A modest 1000x1000 pixel dataset with 300k mass channels results in 300 billion data points

For example, our Xenium dataset contains ~920,000 spectra, and when converted to a sparse table format, this creates matrices with billions of potential entries. Even with sparse storage (CSR/CSC/COO), the conversion process itself can exhaust available memory because intermediate representations must be held in RAM before writing.

The current workflow using `SpatialData.write()` requires the complete AnnData object in memory, which becomes prohibitive for large-scale MSI studies.

## Proposed Solution

Implement **direct Zarr streaming** that bypasses `SpatialData.write()` entirely:

1. **Single-pass streaming**: Iterate through spectra once, writing directly to Zarr arrays
2. **Bounded memory**: Buffer only ~500k values at a time regardless of dataset size
3. **Incremental CSR/CSC construction**: Build `indptr`, `indices`, and `data` arrays on-the-fly
4. **SpatialData compatible**: Output follows the expected Zarr structure so `SpatialData.read()` works without modifications

### Architecture

```
Reader.iter_spectra()
    |
    v
[Process spectrum -> resample to common mass axis]
    |
    v
[Buffer indices/data in memory (~500k values)]
    |
    v
[Flush to Zarr arrays when buffer full]
    |
    v
[Build indptr incrementally]
    |
    v
Final: SpatialData-compatible Zarr store
```

## Results

Tested with synthetic data (250k pixels, 50k mass channels, 31M non-zeros):

| Metric | Standard Approach | Streaming Approach |
|--------|------------------|-------------------|
| Peak Memory | ~2-4 GB | **60 MB** |
| Memory Scaling | Linear with data | **Constant** |
| SpatialData Compatible | Yes | Yes |

## Related Work

This connects to ongoing SpatialData memory optimization efforts:
- scverse/spatialdata-io#337 - Dask for Xenium mask loading
- scverse/spatialdata-io#228 - `output_path` parameter for element-by-element saving
- scverse/spatialdata-io#279 - Chunkwise image loader

## Implementation

- `thyra/converters/spatialdata/streaming_converter.py` - Core implementation
- `mock_msi_generator.py` - Synthetic data generator for testing without real datasets

### Mock Data Generator

For easy collaboration and testing without large real datasets:

```bash
# Quick test (100x100 pixels, ~5s)
poetry run python mock_msi_generator.py small

# Realistic test (500x500 pixels, ~2min)
poetry run python mock_msi_generator.py medium

# Stress test (1000x1000 pixels)
poetry run python mock_msi_generator.py large
```

## Questions for Discussion

1. Is direct Zarr construction (bypassing `SpatialData.write()`) a supported/recommended pattern?
2. Are there required attributes we might be missing for full compatibility?
3. Can images/shapes be added lazily after the table is created?
4. Would this approach be useful to generalize for other large-scale SpatialData use cases?

## Checklist

- [x] Single-pass streaming implementation
- [x] CSR matrix direct-to-Zarr writing
- [x] SpatialData.read() compatibility verified
- [x] Mock data generator for testing
- [ ] Add images/shapes lazily after table creation
- [ ] Integration with existing converter API
- [ ] Documentation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion #68

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

Problem Statement

Proposed Solution

Architecture

Results

Related Work

Implementation

Mock Data Generator

Questions for Discussion

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Standard Approach	Streaming Approach
Peak Memory	~2-4 GB	60 MB
Memory Scaling	Linear with data	Constant
SpatialData Compatible	Yes	Yes

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion #68

Description

[FEATURE] Direct Zarr Streaming for Memory-Efficient MSI Conversion

Problem Statement

Proposed Solution

Architecture

Results

Related Work

Implementation

Mock Data Generator

Questions for Discussion

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions