fix(py): fix large pyramidal TIFF conversion performance and crashes #342

thewtex · 2026-02-09T22:07:14Z

Summary

Fixes conversion of large pyramidal TIFF files (e.g., JPEG2000 OME-TIFF) that previously caused excessive memory usage, task graph explosion, and hour-long processing times.

Resolves #310

Problem

Converting a 3GB JPEG2000 OME-TIFF (602a12_z_stack.qupath.j2k.ome.tif) with 512×512 tiles was:

Crashing with AttributeError: 'Group' object has no attribute 'ndim' — pyramidal TIFFs return a zarr.Group, not an Array
Exploding the dask task graph from 100K → 1.6M+ tasks due to rechunking from 512px input tiles to 128px default output chunks
Taking ~1 hour because it discarded the TIFF's existing 4-level pyramid and regenerated 10 levels from scratch via expensive Gaussian downsampling

Changes

1. Handle zarr Groups from pyramidal TIFFs (`to_ngff_image.py`)

Added _extract_array_from_group() to extract the full-resolution array from a zarr.Group, using multiscales metadata when available or falling back to the largest array
to_ngff_image() now transparently handles zarr.Group inputs

2. Reduce dask task explosion (`to_multiscales.py`)

Input-aligned output chunks: detects tiled sources (e.g., 512px TIFF tiles) and aligns output chunks to match, avoiding 16× task multiplication from rechunking
Preserve channel chunking: keeps the channel dimension intact instead of forcing it to 1, which tripled the task count unnecessarily
Enabled task count threshold check: re-enabled the task_count() guard that was commented out
Extracted _find_optimal_chunk_size() to module level for reuse
Added _cache_2d_strips() and _cache_1d_segments() for strip-based caching of 2D/1D large images (previously a TODO)

3. Reuse existing pyramid levels (`cli.py`)

New _multiscales_from_tifffile_pyramid(): when a TIFF already contains multiple resolution levels, builds Multiscales directly from them instead of regenerating via to_multiscales()
Computes proper scale/translation metadata for each level using shape ratios, matching the formula from _next_scale_metadata()
Refactored _apply_cli_metadata_overrides() out of _ngff_image_to_multiscales() for reuse by both code paths
Single-level TIFFs fall back to the existing to_multiscales() path

Performance impact

For the 3GB test TIFF (shape: 3×57128×153122×3, 4 pyramid levels, 512×512 JPEG2000 tiles):

Metric	Before	After
Task graph	1.6M+ tasks (crash)	~107K tasks
Pyramid levels	10 (recomputed)	4 (from TIFF)
Downsampling computation	Full Gaussian blur on all levels	None (reused)
Total pixels processed	~131B	~84B

Testing

16 new tests in test_large_image_chunking.py covering input-aligned chunks, channel preservation, 2D strip caching, and 1D segment caching
All 309 existing tests pass, all lint checks pass

Fixes excessive memory usage and slow graph construction when converting large tiled TIFF files (e.g. 3GB JPEG2000 OME-TIFF with 512x512 tiles). The root cause was that the default 128px output chunks (for 3D images) are smaller than typical input tiles (512px), causing a 16x task count explosion during dask rechunking. A 100K-task input became 1.6M+ tasks, consuming 2GB RAM just for graph construction. Changes: - Enable the existing task_count check (was commented out) to trigger disk caching when task count exceeds config.task_target (50K) - Auto-detect tiled input sources and use max(default, input_chunk) for spatial dimensions to avoid unnecessary rechunking - Preserve channel dimension chunking (don't split RGB from 3 to 1) - Extract _find_optimal_chunk_size to module-level function - Add _cache_2d_strips for strip-based caching of large 2D images (replaces the TODO at the old line 289) - Add _cache_1d_segments for 1D image caching edge case - Add comprehensive tests for all new functionality Refs: #310, dask/dask#8570

When tifffile opens a multi-level pyramidal TIFF (e.g., OME-TIFF with pyramid levels), it returns a zarr Group instead of an Array. This caused an AttributeError on 'Group' object has no attribute 'ndim' in to_ngff_image(). Extract the full-resolution array from the Group automatically, using multiscales metadata when available or falling back to the largest array by size.

When converting pyramidal OME-TIFFs that already contain multiple resolution levels, reuse the pre-built pyramid directly instead of regenerating all levels from scratch via expensive downsampling. This avoids Gaussian blur + downsample computation entirely and eliminates redundant reads of the full-resolution data. For a 3GB JPEG2000 OME-TIFF with 4 pyramid levels, this reduces total work from ~131B pixels (10 recomputed levels) to ~84B pixels (4 existing levels, read-and-write only). Also refactors CLI metadata override logic into a reusable _apply_cli_metadata_overrides() function.

thewtex added 4 commits February 6, 2026 13:57

chore(py): update pixi.lock for version 0.24.0

c5f383a

thewtex changed the title ~~tiff long time, lots of memory, many tasks (vibe-kanban)~~ fix(py): fix large pyramidal TIFF conversion performance and crashes Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(py): fix large pyramidal TIFF conversion performance and crashes #342

fix(py): fix large pyramidal TIFF conversion performance and crashes #342

thewtex commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(py): fix large pyramidal TIFF conversion performance and crashes #342

Are you sure you want to change the base?

fix(py): fix large pyramidal TIFF conversion performance and crashes #342

Conversation

thewtex commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

1. Handle zarr Groups from pyramidal TIFFs (to_ngff_image.py)

2. Reduce dask task explosion (to_multiscales.py)

3. Reuse existing pyramid levels (cli.py)

Performance impact

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thewtex commented Feb 9, 2026 •

edited

Loading

1. Handle zarr Groups from pyramidal TIFFs (`to_ngff_image.py`)

2. Reduce dask task explosion (`to_multiscales.py`)

3. Reuse existing pyramid levels (`cli.py`)