Skip to content

Forward-merge release/26.02 into main#1010

Merged
jameslamb merged 4 commits intomainfrom
release/26.02
Feb 3, 2026
Merged

Forward-merge release/26.02 into main#1010
jameslamb merged 4 commits intomainfrom
release/26.02

Conversation

@rapids-bot
Copy link

@rapids-bot rapids-bot bot commented Jan 26, 2026

Forward-merge triggered by push to release/26.02 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.

## Description
We should be building packages when commits are merged into the
`release/` branches, otherwise projects can get stuck waiting for
nightlies. Additionally, some packages like `rapids-dask-dependency`
don't get built in the nightly runs.

xref: rapidsai/build-planning#224
@rapids-bot rapids-bot bot requested a review from a team as a code owner January 26, 2026 19:04
@rapids-bot rapids-bot bot requested a review from gforsyth January 26, 2026 19:04
@rapids-bot
Copy link
Author

rapids-bot bot commented Jan 26, 2026

FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/forward-merger/

IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the /merge comment). Instead, an admin must manually merge by changing the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

grlee77 and others added 2 commits January 27, 2026 15:21
While working on a couple of new things I came across a few issues in the existing benchmark codes. This PR

- Fixes a bug that prevented benchmarks being run on GPU only via the `--no_cpu` command-line argument.
- Fixes a bug with replicated device names in the generated benchmark tables
- Adds a new CUCIM_BENCHMARK_MAX_DURATION environment variable for setting benchmark case duration without modifying the bash scripts
- stores any kwargs that were pass to the function in the benchmark table

Authors:
  - Gregory Lee (https://github.com/grlee77)
  - https://github.com/jakirkham

Approvers:
  - Gigon Bae (https://github.com/gigony)

URL: #1002
- Replace strlen() with strnlen() in cuimage.cpp to prevent potential
  buffer overread if strings are unexpectedly not null-terminated
- Add maximum length constraints for spacing_units (256 bytes) and
  coord_sys (16 bytes) based on expected string sizes
- Addresses SonarQube security analysis for safe C string handling

Authors:
  - Gigon Bae (https://github.com/gigony)

Approvers:
  - Gregory Lee (https://github.com/grlee77)

URL: #1015
@rapids-bot rapids-bot bot requested a review from a team as a code owner January 28, 2026 16:47
This PR implements batch ROI decoding for cuslide2 using nvImageCodec v0.7.0+'s native batch decoding API

### Background


This approach provides performance improvements by:
- amortizing GPU kernel launch overhead across multiple regions
- enabling parallel decoding of multiple ROIs
- reducing memory allocation overhead through batching

## Changes

### New Files

- `cpp/plugins/cucim.kit.cuslide2/src/cuslide/loader/nvimgcodec_processor.h`
- `cpp/plugins/cucim.kit.cuslide2/src/cuslide/loader/nvimgcodec_processor.cpp`
  - `NvImageCodecProcessor` class inheriting from `BatchDataProcessor`
  - Integrates with existing `ThreadBatchDataLoader` infrastructure
  - Supports both CPU and CUDA output devices

- `python/cucim/tests/unit/clara/test_batch_decoding.py`
  - comprehensive test suite with 47 tests

### Modified Files

- `cpp/plugins/cucim.kit.cuslide2/src/cuslide/nvimgcodec/nvimgcodec_decoder.h`
  - Added `RoiRegion` and `BatchDecodeResult` structs
  - Added `decode_batch_regions_nvimgcodec()` function declaration

- `cpp/plugins/cucim.kit.cuslide2/src/cuslide/nvimgcodec/nvimgcodec_decoder.cpp`
  - Implemented `decode_batch_regions_nvimgcodec()` using:
    1. `nvimgcodecCodeStreamGetSubCodeStream()` with ROI for each region
    2. Single `nvimgcodecDecoderDecode()` call with all streams
    3. Batch result processing

- `cpp/plugins/cucim.kit.cuslide2/src/cuslide/tiff/ifd.cpp`
  - Updated `IFD::read()` to use `ThreadBatchDataLoader` with `NvImageCodecProcessor`
  - Supports `num_workers`, `batch_size`, `prefetch_factor`, `shuffle`, `drop_last` parameters

- `cpp/plugins/cucim.kit.cuslide2/CMakeLists.txt`
  - Added new loader source files to build

## Architecture

```
IFD::read()
    |
    +-- Single Location (location_len=1)
    |   +-- decode_ifd_region_nvimgcodec()
    |
    +-- Multiple Locations (location_len>1 or batch_size>1)
        +-- ThreadBatchDataLoader + NvImageCodecProcessor
            +-- decode_batch_regions_nvimgcodec()
                +-- nvimgcodecCodeStreamGetSubCodeStream() x N
                +-- nvimgcodecDecoderDecode() (single batch call)
```

## Test Results

All 47 tests passing:

| Test Category | Compression Types | Count | Status |
|---------------|-------------------|-------|--------|
| TestBatchDecoding (CPU) | JPEG, Deflate, Raw | 21 | PASS |
| TestBatchDecodingCUDA | JPEG | 2 | PASS |
| TestBatchDecodingPerformance | JPEG, Deflate, Raw | 24 | PASS |

**Note:** CUDA output is only supported for JPEG compression. Deflate and Raw use CPU decoding with optional GPU memory transfer.



## How to Run Tests

```bash
# Run all batch decoding tests
cd cucim
pytest python/cucim/tests/unit/clara/test_batch_decoding.py -v

# Run specific test categories
pytest python/cucim/tests/unit/clara/test_batch_decoding.py::TestBatchDecoding -v
pytest python/cucim/tests/unit/clara/test_batch_decoding.py::TestBatchDecodingCUDA -v
pytest python/cucim/tests/unit/clara/test_batch_decoding.py::TestBatchDecodingPerformance -v
```

## Example Usage

```python
from cucim import CuImage
import numpy as np

# Open TIFF file
img = CuImage("slide.tiff")

# Batch decode multiple locations
locations = [(0, 0), (256, 256), (512, 512), (768, 768)]
size = (256, 256)

# CPU output with parallel workers
for region in img.read_region(locations, size, level=0, num_workers=4):
    arr = np.asarray(region)
    print(f"Decoded: {arr.shape}")

# CUDA output (JPEG only)
import cupy as cp
for region in img.read_region(locations, size, level=0, num_workers=4, device="cuda"):
    arr = cp.asarray(region)
    print(f"GPU decoded: {arr.shape}")
```

Authors:
  - https://github.com/cdinea
  - https://github.com/jakirkham

Approvers:
  - Gregory Lee (https://github.com/grlee77)
  - Gigon Bae (https://github.com/gigony)
  - https://github.com/jakirkham

URL: #1007
@jakirkham
Copy link
Member

Fixing the forward merger in PR: #1019

@jameslamb jameslamb merged commit 3fe2eed into main Feb 3, 2026
420 of 430 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants