Development Process

This document records how this repository was created, the tools and methods used, what was learned about TIAToolbox in the context of IDC, and known limitations.

How This Repository Was Created

Date

Initial creation: February 16, 2026

Tools Used

Claude Code (Anthropic's CLI agent, model: Claude Opus 4.6) — used to research both TIAToolbox and IDC, design the project structure, and generate all notebooks and supporting files
IDC imaging-data-commons skill — a Claude Code skill (from idc-claude-skill) that provides structured knowledge about IDC's data model, idc-index API, index tables (sm_index, ann_index, ann_group_index, seg_index, etc.), SQL query patterns, download workflows, and digital pathology guide
idc-index 0.11.9 (IDC data version v23) — installed locally; used for verifying API patterns and table schemas

Process

Research phase: Claude Code explored TIAToolbox capabilities via web search (GitHub, PyPI, readthedocs, published papers) and IDC slide microscopy data via the imaging-data-commons skill. Three parallel exploration agents researched:
- TIAToolbox analysis tools inventory (all modules, classes, pretrained models)
- TIAToolbox data format support (WSI readers, DICOM support, dependencies)
- IDC slide microscopy collections, annotation tables, and digital pathology workflows
Design phase: A planning agent synthesized the research into a detailed notebook-by-notebook plan, including:
- Which TIAToolbox tool to pair with which IDC collection (matching model training domains)
- Data size budgets per notebook
- Colab compatibility constraints
- Code patterns validated against existing IDC-Tutorials conventions
Implementation phase: Notebooks were written sequentially (01-07), each building on patterns established in earlier ones. The existing IDC-Tutorials pathomics notebooks were used as reference for IDC API conventions (download patterns, dirTemplate, DICOM ANN parsing with highdicom).

Prompting

The initial prompt was:

"we are starting a new project that will explore the various analysis tools available in TIAToolbox, evaluate their applicability for processing slide microscopy data available in Imaging Data Commons, and will create working examples helping users get started with applying that open source tool for IDC data analysis. Use /imaging-data-commons skill for everything related to IDC."

Follow-up decisions made interactively:

GitHub hosting: github.com/fedorov/idc-tiatoolbox
Scope: All 7 notebooks (not incremental)
Environment: Colab-first design
License: Apache 2.0

Understanding of TIAToolbox in the IDC Context

Key Integration Point: DICOMWSIReader

The critical connector between TIAToolbox and IDC is DICOMWSIReader:

TIAToolbox's WSIReader.open() auto-detects DICOM WSI directories (checks for .dcm files) and returns a DICOMWSIReader
Under the hood, DICOMWSIReader uses the wsidicom library
IDC's DICOM WSI files (downloaded as <SeriesInstanceUID>/<crdc_instance_uuid>.dcm) are directly compatible
The download pattern dirTemplate='%SeriesInstanceUID' creates the per-series folders that WSIReader.open() expects

Important findings from testing:

DICOMWSIReader does not populate info.objective_power from DICOM metadata (it remains None). This must be set manually from IDC's sm_index metadata after opening the reader.
info.mpp may also be None depending on the DICOM file; can be computed from sm_index.min_PixelSpacing_2sf (mm → µm).
read_rect with coord_space="baseline" (the default) may cause WsiDicomOutOfBoundsError with the DICOMWSIReader. Using coord_space="resolution" is more reliable.
Tools like SlidingWindowPatchExtractor should be passed the fixed reader object (not a path) so they inherit the corrected metadata.

TIAToolbox Analysis Tools Applicable to IDC Data

Tool	Class	Pretrained Models	Best IDC Collections	Notes
WSI Reading	`WSIReader` / `DICOMWSIReader`	N/A	All SM collections	Foundation for everything else
Tissue Masking	`OtsuTissueMasker`	N/A	All	Essential preprocessing; Otsu works well on H&E
Patch Extraction	`SlidingWindowPatchExtractor`	N/A	All	Pairs with masking; configurable resolution/stride
Stain Normalization	`MacenkoNormalizer`, `ReinhardNormalizer`, `VahadaneNormalizer`	N/A	Multi-collection studies	Important when combining data from different scanners/sites
Patch Classification	`PatchPredictor`	`resnet18-kather100k` (9-class CRC), various PCam models	`tcga_coad`, `tcga_read` (for Kather100K); any for PCam	Model-data domain match matters
Semantic Segmentation	`SemanticSegmentor`	`fcn_resnet50_unet-bcss` (5-class breast)	`tcga_brca`, `cptac_brca`	BCSS model trained on breast cancer; using on other tissue types will produce poor results
Nucleus Segmentation	`NucleusInstanceSegmentor`	`hovernet_fast-pannuke` (6 nuclei types), `hovernet_original-kumar`	All H&E collections	PanNuke is multi-organ; works broadly
Feature Extraction	`DeepFeatureExtractor`	ImageNet ResNets	All	For downstream graph/clustering analysis
Foundation Models	via `DeepFeatureExtractor` + `timm`	UNI, Prov-GigaPath, H-optimus-0	All	Requires model access (gated HuggingFace models)
Graph Analysis	`SlideGraphConstructor`	SlideGraph+	All	Requires feature extraction first

IDC Slide Microscopy Data Profile

Based on queries via sm_index:

Multiple TCGA collections have hundreds of H&E slides (BRCA, LUAD, LUSC, COAD, etc.)
CPTAC collections also have pathology slides with different scanners than TCGA
ObjectiveLensPower varies (20x and 40x are common); 20x slides are ~2-4x smaller in file size
Pan-Cancer-Nuclei-Seg-DICOM provides nucleus centroid annotations across many TCGA collections (DICOM ANN format)
Annotation parsing requires highdicom; coordinates may be in mm (SCOORD3D) requiring pixel spacing conversion

Collection-to-Model Domain Matching

This is a key design consideration: pretrained models perform best on tissue types matching their training data.

Model	Training Data	Best IDC Collections
`resnet18-kather100k`	NCT-CRC-HE-100K (colorectal)	`tcga_coad`, `tcga_read`
`fcn_resnet50_unet-bcss`	BCSS (breast cancer)	`tcga_brca`, `cptac_brca`
`hovernet_fast-pannuke`	PanNuke (19 organs)	Broad applicability
PCam models	Patch Camelyon (lymph node)	`camelyon16`, `camelyon17` (if available)

Issues Found and Fixed During Colab Testing

Notebook 01 was partially tested on Google Colab on February 16, 2026. Several issues were discovered and fixed:

1. Colab Dependency Conflicts

Problem: Installing tiatoolbox on Colab upgrades numpy, but the old numpy version is already loaded in memory, causing ValueError: numpy.dtype size changed, may indicate binary incompatibility.

Fix: Changed install cells in all notebooks to use %pip install (instead of !pip install) followed by automatic runtime restart:

%pip install tiatoolbox idc-index openslide-bin "numcodecs<0.16"
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

2. Missing OpenSlide Shared Library

Problem: Colab does not have OpenSlide pre-installed. Importing tiatoolbox fails with OSError: libopenslide.so.1: cannot open shared object file.

Fix: Added openslide-bin to the pip install line in all notebooks. This Python package bundles the OpenSlide shared library.

3. zarr / numcodecs Incompatibility

Problem: tiatoolbox 1.6.0 requires zarr<=2.18.3, but pip resolves numcodecs>=0.16 which renamed cbuffer_sizes to _cbuffer_sizes, breaking zarr's import. This was reproduced locally.

Fix: Pinned "numcodecs<0.16" in all notebook install cells. Verified locally that numcodecs 0.15.1 + zarr 2.18.3 + tiatoolbox 1.6.0 work together.

4. DICOMWSIReader Does Not Populate `objective_power`

Problem: WSIReader.open() on DICOM WSI directories returns a DICOMWSIReader where info.objective_power is None. This breaks any code that uses units="power" for reading regions, generating thumbnails, etc.

Fix: After opening the reader, populate objective_power and mpp from IDC's sm_index metadata (which we already query):

info = reader.info
if info.objective_power is None:
    info.objective_power = float(selected['ObjectiveLensPower'])
if info.mpp is None:
    pixel_spacing_um = float(selected['pixel_spacing_mm']) * 1000
    info.mpp = np.array([pixel_spacing_um, pixel_spacing_um])

Applied to all 7 notebooks.

5. `read_bounds`/`read_rect` Coordinate Bug with DICOMWSIReader

Problem: Both read_bounds() and read_rect() pass baseline coordinates unscaled to wsidicom when the requested resolution maps to a non-baseline pyramid level. This causes WsiDicomOutOfBoundsError whenever baseline coordinates exceed the target level's dimensions.

Root cause (confirmed by local testing): The test slide (C3L-03262, 9959x9023 at 20x) has two pyramid levels: level 0 (9959x9023, baseline) and level 1 (2489x2255, ~5x). When requesting at 5x, TIAToolbox reads from level 1 but passes baseline coordinates (e.g., position 3843,3951) which exceed level 1's size (2489x2255). The same error occurs with read_bounds, read_rect (both coord_space="baseline" and coord_space="resolution"), and at any resolution that maps to level 1 (including mpp ≥ 2.0). Resolutions that read from level 0 and post-process (e.g., 10x, 7.5x, 5.1x) work fine because baseline coordinates are valid for level 0.

Fix: Read at native resolution and resize with PIL:

from PIL import Image

# Read at native (always reads from level 0 where baseline coords are valid)
region_native = reader.read_bounds(
    bounds=bounds, resolution=info.objective_power, units="power"
)

# Resize to target magnification
scale = target_power / info.objective_power
region = np.array(Image.fromarray(region_native).resize(
    (int(region_native.shape[1] * scale), int(region_native.shape[0] * scale)),
    Image.LANCZOS
))

For fixed-size patch extraction, read_rect() at the native resolution with coord_space="resolution" works reliably (since at native resolution, coordinates equal baseline coordinates):

patch = reader.read_rect(location=(loc_x, loc_y), size=(256, 256),
                         resolution=native_power, units="power",
                         coord_space="resolution")

All TIAToolbox read APIs confirmed affected (tested locally with dev/test_dicom_reader_bug.py):

read_bounds, read_rect, PointsPatchExtractor, SlidingWindowPatchExtractor, TilePyramidGenerator
All fail at resolutions mapping to non-baseline pyramid levels; all work at native resolution

Applied to all 7 notebooks (not just Notebook 01):

NB 01: Multi-resolution demo uses native read + PIL resize
NB 02: SlidingWindowPatchExtractor uses native resolution
NB 03: find_tissue_patch() uses read_rect at native with coord_space="resolution"
NB 04: Patch visualization uses read_bounds at native
NB 05-07: Tile extraction uses read_bounds at native + PIL resize to target size

Remaining Known Limitations

Notebooks Not Fully Executed

Notebook 01 has been partially tested through the data discovery, download, and region reading cells. Notebooks 02-07 have not been executed. They still need end-to-end testing on Google Colab to verify:

API surface accuracy: TIAToolbox API was researched via web search, not by inspecting installed source code. Specific concerns:
- PatchPredictor.predict() parameter names and return format may differ between versions
- SemanticSegmentor output file naming (.raw.0.npy) may vary
- NucleusInstanceSegmentor output format (.dat via joblib) may have changed
- The predictor.labels attribute for class names may not exist on all model wrappers
Coordinate systems: The tile extraction and annotation coordinate conversion code involves multiple coordinate transforms (baseline ↔ thumbnail, baseline ↔ mpp-scaled, mm ↔ pixel). Further issues may be discovered during testing.
Data availability: The specific IDC collections and slides selected by the SQL queries depend on the current IDC data version (v23). Collections, slide counts, and metadata may change in future IDC releases.

TIAToolbox Version Sensitivity

Tested with tiatoolbox 1.6.0, zarr 2.18.3, numcodecs 0.15.1, openslide-bin
TIAToolbox is under active development; API changes between versions are possible
The pretrained_model string identifiers (e.g., "hovernet_fast-pannuke", "resnet18-kather100k") need to match exactly what the installed version supports
Foundation model support (UNI, Prov-GigaPath) was not included in the notebooks because these models require gated HuggingFace access

Scope Limitations

No custom model training: All notebooks use pretrained models only. No fine-tuning or training workflows are demonstrated.
No full WSI processing: Notebooks 04-07 process tiles/regions, not entire slides, to keep Colab runtime manageable. Users wanting full-slide analysis need to adapt the code.
No DICOM output: TIAToolbox results are saved in native formats (numpy, joblib). Converting results back to DICOM (e.g., using highdicom to create DICOM SEG or ANN objects) is not demonstrated.
No multiplexed/immunofluorescence: Only H&E-stained slides are covered. IDC also has multiplexed immunofluorescence data that TIAToolbox can potentially process.
No clinical data integration: IDC's clinical_index is not used. Correlating analysis results with clinical outcomes would be a natural extension.
Single-slide workflows: Each notebook processes one slide. Batch processing across many slides (cohort analysis) is not demonstrated.

IDC-Specific Limitations

Download sizes: Even "small" slides are ~200-600 MB. On slow connections, downloads may timeout.
Colab disk space: The total across all notebooks is ~2-4 GB, well within Colab's ~100 GB, but users running multiple notebooks in one session should watch disk usage.
IDC data versioning: Queries use the current IDC index. When idc-index is upgraded, query results may change (different slides, different sizes).

Comparison Notebook (07) Limitations

The Pan-Cancer-Nuclei-Seg-DICOM annotations contain centroids (POINT graphic type), while HoVer-Net produces full contours. The comparison is therefore centroid-to-centroid, not contour-to-contour.
No formal detection matching (e.g., Hungarian matching with distance threshold) is implemented; only spatial density correlation is computed.
The comparison is on a single tile from a single slide; it's illustrative, not a rigorous benchmark.

Recommendations for Next Steps

Run all notebooks on Colab and fix any issues (this is the highest priority)
Verify TIAToolbox DICOM compatibility by testing WSIReader.open() on several IDC collections
Add error handling for common failure modes (download failures, missing wsidicom, GPU not available)
Consider adding a notebook for feature extraction with foundation models (requires addressing HuggingFace model access)
Consider DICOM output — using highdicom to convert TIAToolbox results back to DICOM SEG/ANN for re-upload or SLIM visualization
Consider batch processing — a notebook showing analysis across an entire collection (e.g., all TCGA-BRCA slides)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Process

How This Repository Was Created

Date

Tools Used

Process

Prompting

Understanding of TIAToolbox in the IDC Context

Key Integration Point: DICOMWSIReader

TIAToolbox Analysis Tools Applicable to IDC Data

IDC Slide Microscopy Data Profile

Collection-to-Model Domain Matching

Issues Found and Fixed During Colab Testing

1. Colab Dependency Conflicts

2. Missing OpenSlide Shared Library

3. zarr / numcodecs Incompatibility

4. DICOMWSIReader Does Not Populate `objective_power`

5. `read_bounds`/`read_rect` Coordinate Bug with DICOMWSIReader

Remaining Known Limitations

Notebooks Not Fully Executed

TIAToolbox Version Sensitivity

Scope Limitations

IDC-Specific Limitations

Comparison Notebook (07) Limitations

Recommendations for Next Steps

FilesExpand file tree

PROCESS.md

Latest commit

History

PROCESS.md

File metadata and controls

Development Process

How This Repository Was Created

Date

Tools Used

Process

Prompting

Understanding of TIAToolbox in the IDC Context

Key Integration Point: DICOMWSIReader

TIAToolbox Analysis Tools Applicable to IDC Data

IDC Slide Microscopy Data Profile

Collection-to-Model Domain Matching

Issues Found and Fixed During Colab Testing

1. Colab Dependency Conflicts

2. Missing OpenSlide Shared Library

3. zarr / numcodecs Incompatibility

4. DICOMWSIReader Does Not Populate objective_power

5. read_bounds/read_rect Coordinate Bug with DICOMWSIReader

Remaining Known Limitations

Notebooks Not Fully Executed

TIAToolbox Version Sensitivity

Scope Limitations

IDC-Specific Limitations

Comparison Notebook (07) Limitations

Recommendations for Next Steps

4. DICOMWSIReader Does Not Populate `objective_power`

5. `read_bounds`/`read_rect` Coordinate Bug with DICOMWSIReader