Skip to content

Conversation

@thewtex
Copy link
Collaborator

@thewtex thewtex commented Feb 9, 2026

Summary

Fixes conversion of large pyramidal TIFF files (e.g., JPEG2000 OME-TIFF) that previously caused excessive memory usage, task graph explosion, and hour-long processing times.

Resolves #310

Problem

Converting a 3GB JPEG2000 OME-TIFF (602a12_z_stack.qupath.j2k.ome.tif) with 512×512 tiles was:

  1. Crashing with AttributeError: 'Group' object has no attribute 'ndim' — pyramidal TIFFs return a zarr.Group, not an Array
  2. Exploding the dask task graph from 100K → 1.6M+ tasks due to rechunking from 512px input tiles to 128px default output chunks
  3. Taking ~1 hour because it discarded the TIFF's existing 4-level pyramid and regenerated 10 levels from scratch via expensive Gaussian downsampling

Changes

1. Handle zarr Groups from pyramidal TIFFs (to_ngff_image.py)

  • Added _extract_array_from_group() to extract the full-resolution array from a zarr.Group, using multiscales metadata when available or falling back to the largest array
  • to_ngff_image() now transparently handles zarr.Group inputs

2. Reduce dask task explosion (to_multiscales.py)

  • Input-aligned output chunks: detects tiled sources (e.g., 512px TIFF tiles) and aligns output chunks to match, avoiding 16× task multiplication from rechunking
  • Preserve channel chunking: keeps the channel dimension intact instead of forcing it to 1, which tripled the task count unnecessarily
  • Enabled task count threshold check: re-enabled the task_count() guard that was commented out
  • Extracted _find_optimal_chunk_size() to module level for reuse
  • Added _cache_2d_strips() and _cache_1d_segments() for strip-based caching of 2D/1D large images (previously a TODO)

3. Reuse existing pyramid levels (cli.py)

  • New _multiscales_from_tifffile_pyramid(): when a TIFF already contains multiple resolution levels, builds Multiscales directly from them instead of regenerating via to_multiscales()
  • Computes proper scale/translation metadata for each level using shape ratios, matching the formula from _next_scale_metadata()
  • Refactored _apply_cli_metadata_overrides() out of _ngff_image_to_multiscales() for reuse by both code paths
  • Single-level TIFFs fall back to the existing to_multiscales() path

Performance impact

For the 3GB test TIFF (shape: 3×57128×153122×3, 4 pyramid levels, 512×512 JPEG2000 tiles):

Metric Before After
Task graph 1.6M+ tasks (crash) ~107K tasks
Pyramid levels 10 (recomputed) 4 (from TIFF)
Downsampling computation Full Gaussian blur on all levels None (reused)
Total pixels processed ~131B ~84B

Testing

  • 16 new tests in test_large_image_chunking.py covering input-aligned chunks, channel preservation, 2D strip caching, and 1D segment caching
  • All 309 existing tests pass, all lint checks pass

Fixes excessive memory usage and slow graph construction when converting
large tiled TIFF files (e.g. 3GB JPEG2000 OME-TIFF with 512x512 tiles).

The root cause was that the default 128px output chunks (for 3D images)
are smaller than typical input tiles (512px), causing a 16x task count
explosion during dask rechunking. A 100K-task input became 1.6M+ tasks,
consuming 2GB RAM just for graph construction.

Changes:
- Enable the existing task_count check (was commented out) to trigger
  disk caching when task count exceeds config.task_target (50K)
- Auto-detect tiled input sources and use max(default, input_chunk) for
  spatial dimensions to avoid unnecessary rechunking
- Preserve channel dimension chunking (don't split RGB from 3 to 1)
- Extract _find_optimal_chunk_size to module-level function
- Add _cache_2d_strips for strip-based caching of large 2D images
  (replaces the TODO at the old line 289)
- Add _cache_1d_segments for 1D image caching edge case
- Add comprehensive tests for all new functionality

Refs: #310, dask/dask#8570
When tifffile opens a multi-level pyramidal TIFF (e.g., OME-TIFF with
pyramid levels), it returns a zarr Group instead of an Array. This
caused an AttributeError on 'Group' object has no attribute 'ndim'
in to_ngff_image().

Extract the full-resolution array from the Group automatically, using
multiscales metadata when available or falling back to the largest
array by size.
When converting pyramidal OME-TIFFs that already contain multiple
resolution levels, reuse the pre-built pyramid directly instead of
regenerating all levels from scratch via expensive downsampling.

This avoids Gaussian blur + downsample computation entirely and
eliminates redundant reads of the full-resolution data. For a 3GB
JPEG2000 OME-TIFF with 4 pyramid levels, this reduces total work
from ~131B pixels (10 recomputed levels) to ~84B pixels (4 existing
levels, read-and-write only).

Also refactors CLI metadata override logic into a reusable
_apply_cli_metadata_overrides() function.
@thewtex thewtex changed the title tiff long time, lots of memory, many tasks (vibe-kanban) fix(py): fix large pyramidal TIFF conversion performance and crashes Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

imagecodecs.Jpeg2kError: opj_decode or opj_end_decompress failed

1 participant