Skip to content

Refactor to_imaris() for memory-safe chunked processing#170

Draft
Copilot wants to merge 20 commits intomainfrom
copilot/make-to-imaris-memory-safe
Draft

Refactor to_imaris() for memory-safe chunked processing#170
Copilot wants to merge 20 commits intomainfrom
copilot/make-to-imaris-memory-safe

Conversation

Copy link
Contributor

Copilot AI commented Dec 15, 2025

The to_imaris() method materialized entire datasets in RAM via compute(), causing OOM failures on >100GB volumes. Refactored to chunk-by-chunk processing with bounded memory usage.

Changes

Core Refactoring

  • Removed global compute(): Data stays as lazy Dask/Zarr reference throughout
  • Chunked HDF5 writes: Create empty datasets, populate incrementally (16 Z-slices per iteration)
  • Streaming statistics: Two-pass approach computes min/max, then writes data + accumulates histogram
  • Streaming MIP thumbnails: Maintain running maximum across chunks, only Y×X plane in memory

Memory Impact

# Before: 100GB dataset requires 100GB RAM
data = ngff_image_to_save.data.compute()  # ❌ Materializes full array

# After: 100GB dataset requires ~256MB RAM
for z_start in range(0, z, chunk_z_size):  # ✅ Process 16 slices at a time
    chunk = channel_data[z_start:z_end, :, :]
    chunk_data = chunk.compute()  # Only materialize chunk

Algorithm

  1. Pass 1: Iterate chunks to compute global min/max for histogram range
  2. Pass 2: Iterate chunks to write HDF5 data + accumulate histogram bins
  3. Thumbnail: Iterate chunks to compute incremental MIP (np.maximum per chunk)

Memory usage: O(chunk_z × Y × X) instead of O(Z × Y × X), independent of total Z dimension.

Testing

  • Added 6 tests verifying correctness: streaming stats match full-array computation, MIP identical, round-trip integrity preserved
  • All 26 Imaris tests pass (20 existing + 6 new)
  • Tested edge cases: Z < chunk_size, multi-channel, various dtypes

Documentation

  • docs/memory_safe_imaris.md: Technical details on chunking strategy
  • examples/memory_safe_imaris_export.py: Demonstration with verification
  • Updated method docstring with memory-safe implementation notes
Original prompt

Agent Task: Make to_imaris() Memory-Safe (Chunked Zarr/Dask → HDF5)

Context

We have a to_imaris() method that exports NGFF/Zarr-backed image data to Imaris (.ims, HDF5).
The current implementation is NOT memory-safe: it loads the entire image into RAM, which fails for large volumes.

This function currently:

  • Calls compute() on Dask arrays
  • Converts full images to NumPy
  • Writes HDF5 datasets in one shot
  • Computes min/max, histograms, and thumbnails from full arrays

Your task is to refactor this method so it operates chunk-by-chunk, with bounded memory usage, while preserving exact Imaris compatibility.


Critical Problems to Fix (Must Address All)

  1. Global compute()
    The function currently forces the full dataset into memory by calling compute() on Dask arrays.
    This must be removed entirely. No full-array materialization is allowed.

  2. Whole-array HDF5 writes
    The current implementation writes the full image data directly when creating the HDF5 dataset.
    Instead, the dataset must be created empty and populated incrementally in chunks.

  3. Global reductions
    Global min, max, and histogram calculations are currently done on full arrays.
    These must be rewritten as streaming reductions, updated per chunk.

  4. Thumbnail generation
    Current MIP and downsampling logic uses full-resolution arrays.
    Thumbnail generation must be rewritten to stream over chunks and only keep small intermediate arrays in memory.


Target Design (Required)

1. No full image in memory

  • Never call compute() on the full dataset
  • Never convert the full image to a NumPy array
  • Only operate on chunk-sized NumPy arrays

2. Chunk-wise HDF5 writing

  • Create HDF5 datasets with correct shape, dtype, compression, and chunk layout
  • Write data chunk-by-chunk

3. Streaming statistics

  • Compute HistogramMin and HistogramMax incrementally
  • Accumulate histogram bins per chunk
  • Final attribute values must match the current behavior

4. Streaming thumbnail generation

  • Compute a maximum-intensity projection incrementally along Z
  • Downsample progressively or after the MIP is complete
  • Only keep 256×256 (or similarly small) arrays in memory

5. Preserve Imaris compatibility

  • HDF5 group and dataset structure must remain unchanged
  • Attribute names, formats, and byte-array encoding must remain exactly the same
  • Output files must open in Imaris without warnings

Inputs You Can Assume

  • ngff_image_to_save.data may be:
    • a NumPy array
    • a Zarr array
    • a Dask array backed by Zarr
  • Axis reordering (XYZ to ZYX) has already been handled upstream
  • Scale metadata is correct and must not change

Implementation Guidance (Preferred)

  • Use explicit iteration over Zarr or Dask chunks
  • Avoid loading more than one chunk per channel into memory at a time
  • Avoid introducing new heavy dependencies
  • Keep behavior identical for small datasets

Deliverables

  1. A refactored implementation of to_imaris()
  2. Clear comments explaining:
    • chunk iteration strategy
    • streaming min/max and histogram logic
    • streaming thumbnail logic
  3. No changes to the public API
  4. No regression in Imaris compatibility

Definition of Done

  • Exporting a >100 GB Zarr dataset does not exceed a few hundred MB of RAM
  • Works for single-channel and multi-channel data
  • Resulting .ims files open cleanly in Imaris (user to confirm)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits December 15, 2025 02:31
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Copilot AI changed the title [WIP] Refactor to_imaris() for memory-safe chunked export Refactor to_imaris() for memory-safe chunked processing Dec 15, 2025
Copilot AI requested a review from akhanf December 15, 2025 02:39
Copilot AI and others added 16 commits December 15, 2025 14:11
…rechunking

Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
…unk-size

Update Imaris HDF5 chunking to 16×256×256 (ZYX) with dask rechunking
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
…tiple-datasets

Implement multi-resolution pyramid generation for Imaris export
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Co-authored-by: akhanf <11492701+akhanf@users.noreply.github.com>
Fix memory blowup from Z-only chunking in to_imaris()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants