Skip to content

morphology: add h_maxima, h_minima, local_maxima, local_minima. improve reconstruct#1061

Open
grlee77 wants to merge 5 commits intorapidsai:mainfrom
grlee77:grelee/morphology-extrema
Open

morphology: add h_maxima, h_minima, local_maxima, local_minima. improve reconstruct#1061
grlee77 wants to merge 5 commits intorapidsai:mainfrom
grlee77:grelee/morphology-extrema

Conversation

@grlee77
Copy link
Copy Markdown
Contributor

@grlee77 grlee77 commented Mar 19, 2026

local_minima and local_maxima

The local_minima and local_maxima functions are an implementation via a different GPU algorithm that gives equivalent results to the scikit-image Cython code.

Side-by-Side Comparison (CPU vs. GPU algorithm)

Aspect scikit-image (CPU) cuCIM (GPU)
Core strategy Sequential flood-fill per plateau Parallel filter + label + validate
Candidate detection 1D scan along last dim, then flood-fill maximum_filter (bulk)
Plateau discovery BFS flood-fill (Cython queue) ndi.label (connected components)
Plateau validation During flood-fill: check neighbors Post-hoc: maximum_filter on non-candidates + minimum_filter
Implementation language Cython (nogil) Pure Python + CuPy ndimage
Memory overhead O(1) extra (flags + queue) O(N) extra (multiple intermediate arrays)
Parallelism None (serial per pixel) Full GPU parallelism
Border handling Pad with min / crop Same approach
local_minima invert(image, signed_float=True) then local_maxima Same: invert(image, signed_float=True) then local_maxima

h_minima, h_maxima and reconstruction

The h_maxima and h_minima functions are trivial conversions of the corresponding scikit-image functions. The core of these is the cucim.skimage.morphology.reconstruct algorithm. I found that reconstruct still relied on CPU fallback for the core operation, so was still relatively inefficient. I implemented a simple iterative approach on the GPU that is much faster for typical cases, but would be slower in some pathological cases. There is a new keyword-only argument that allows optionally using the older CPU-based implementation, but the GPU case is now used by default.

Benchmarks

local_maxima benchmarks

shape dtype connectivity allow_borders CPU (ms) GPU (ms) speedup
(512, 512) uint8 1 True 4.198 2.161 1.9
(512, 512) uint8 1 False 3.858 2.554 1.5
(512, 512) float32 1 True 4.566 2.281 2.0
(512, 512) float32 1 False 4.304 2.229 1.9
(512, 512) float64 1 True 4.524 2.270 2.0
(512, 512) float64 1 False 4.380 2.235 2.0
(512, 512) uint8 2 True 4.337 2.406 1.8
(512, 512) uint8 2 False 4.082 2.725 1.5
(512, 512) float32 2 True 4.966 2.156 2.3
(512, 512) float32 2 False 4.512 2.436 1.9
(512, 512) float64 2 True 4.781 2.254 2.1
(512, 512) float64 2 False 4.469 2.408 1.9
(1024, 1024) uint8 1 True 15.212 2.336 6.5
(1024, 1024) uint8 1 False 15.586 2.625 5.9
(1024, 1024) float32 1 True 18.163 2.294 7.9
(1024, 1024) float32 1 False 16.433 2.289 7.2
(1024, 1024) float64 1 True 18.153 2.389 7.6
(1024, 1024) float64 1 False 16.419 2.372 6.9
(1024, 1024) uint8 2 True 16.157 2.493 6.5
(1024, 1024) uint8 2 False 15.473 2.860 5.4
(1024, 1024) float32 2 True 18.541 2.225 8.3
(1024, 1024) float32 2 False 18.007 2.538 7.1
(1024, 1024) float64 2 True 18.908 2.253 8.4
(1024, 1024) float64 2 False 17.069 2.554 6.7
(2048, 2048) uint8 1 True 63.228 3.284 19.3
(2048, 2048) uint8 1 False 64.046 3.643 17.6
(2048, 2048) float32 1 True 67.281 3.321 20.3
(2048, 2048) float32 1 False 63.733 3.304 19.3
(2048, 2048) float64 1 True 76.022 3.721 20.4
(2048, 2048) float64 1 False 64.955 3.675 17.7
(2048, 2048) uint8 2 True 67.681 3.986 17.0
(2048, 2048) uint8 2 False 63.733 4.306 14.8
(2048, 2048) float32 2 True 73.484 3.540 20.8
(2048, 2048) float32 2 False 70.727 3.945 17.9
(2048, 2048) float64 2 True 79.178 3.809 20.8
(2048, 2048) float64 2 False 67.721 3.961 17.1
(4096, 4096) uint8 1 True 243.888 7.959 30.6
(4096, 4096) uint8 1 False 237.681 8.479 28.0
(4096, 4096) float32 1 True 274.830 8.465 32.5
(4096, 4096) float32 1 False 252.003 8.310 30.3
(4096, 4096) float64 1 True 297.186 10.240 29.0
(4096, 4096) float64 1 False 257.524 9.720 26.5
(4096, 4096) uint8 2 True 258.160 9.968 25.9
(4096, 4096) uint8 2 False 250.471 10.494 23.9
(4096, 4096) float32 2 True 295.520 9.759 30.3
(4096, 4096) float32 2 False 276.706 10.137 27.3
(4096, 4096) float64 2 True 306.567 10.339 29.7
(4096, 4096) float64 2 False 269.581 10.469 25.7
(64, 64, 64) uint8 1 True 4.357 2.269 1.9
(64, 64, 64) uint8 1 False 3.938 2.592 1.5
(64, 64, 64) float32 1 True 5.007 2.342 2.1
(64, 64, 64) float32 1 False 4.195 2.308 1.8
(64, 64, 64) float64 1 True 5.332 2.342 2.3
(64, 64, 64) float64 1 False 4.390 2.309 1.9
(64, 64, 64) uint8 3 True 5.122 2.552 2.0
(64, 64, 64) uint8 3 False 4.555 2.913 1.6
(64, 64, 64) float32 3 True 6.680 2.335 2.9
(64, 64, 64) float32 3 False 6.249 2.628 2.4
(64, 64, 64) float64 3 True 6.367 2.356 2.7
(64, 64, 64) float64 3 False 5.215 2.635 2.0
(128, 128, 128) uint8 1 True 33.572 2.872 11.7
(128, 128, 128) uint8 1 False 30.980 3.193 9.7
(128, 128, 128) float32 1 True 36.483 2.945 12.4
(128, 128, 128) float32 1 False 33.374 2.975 11.2
(128, 128, 128) float64 1 True 38.752 3.464 11.2
(128, 128, 128) float64 1 False 33.869 3.345 10.1
(128, 128, 128) uint8 3 True 38.628 3.273 11.8
(128, 128, 128) uint8 3 False 36.215 3.678 9.8
(128, 128, 128) float32 3 True 50.714 3.012 16.8
(128, 128, 128) float32 3 False 46.849 3.299 14.2
(128, 128, 128) float64 3 True 47.963 3.063 15.7
(128, 128, 128) float64 3 False 47.073 3.335 14.1
(256, 256, 256) uint8 1 True 252.847 10.045 25.2
(256, 256, 256) uint8 1 False 243.610 10.343 23.6
(256, 256, 256) float32 1 True 290.053 10.530 27.5
(256, 256, 256) float32 1 False 265.375 10.675 24.9
(256, 256, 256) float64 1 True 312.965 13.910 22.5
(256, 256, 256) float64 1 False 268.880 13.145 20.5
(256, 256, 256) uint8 3 True 303.185 12.521 24.2
(256, 256, 256) uint8 3 False 291.095 12.845 22.7
(256, 256, 256) float32 3 True 407.507 12.249 33.3
(256, 256, 256) float32 3 False 375.998 12.474 30.1
(256, 256, 256) float64 3 True 380.317 12.622 30.1
(256, 256, 256) float64 3 False 333.183 12.288 27.1

h_maxima benchmarks (reconstruct_on_cpu=True)

shape dtype h CPU (ms) GPU (ms) speedup
(512, 512) uint8 10 64.612 27.607 2.3
(512, 512) uint8 40 63.330 26.163 2.4
(512, 512) float32 0.5 64.302 27.579 2.3
(512, 512) float32 2 62.958 26.504 2.4
(512, 512) float64 0.5 73.931 27.872 2.7
(512, 512) float64 2 73.955 26.853 2.8
(1024, 1024) uint8 10 292.537 134.151 2.2
(1024, 1024) uint8 40 372.474 138.855 2.7
(1024, 1024) float32 0.5 317.679 148.027 2.1
(1024, 1024) float32 2 314.687 133.266 2.4
(1024, 1024) float64 0.5 403.959 137.907 2.9
(1024, 1024) float64 2 372.604 128.456 2.9
(2048, 2048) uint8 10 2450.768 1057.142 2.3
(2048, 2048) uint8 40 2454.066 1009.442 2.4
(2048, 2048) float32 0.5 2317.305 1078.062 2.1
(2048, 2048) float32 2 2206.711 975.934 2.3
(2048, 2048) float64 0.5 2608.819 1068.013 2.4
(2048, 2048) float64 2 2468.483 985.687 2.5
(64, 64, 64) uint8 10 88.767 50.435 1.8
(64, 64, 64) uint8 40 101.429 49.159 2.1
(64, 64, 64) float32 0.5 97.956 50.777 1.9
(64, 64, 64) float32 2 96.890 49.342 2.0
(64, 64, 64) float64 0.5 105.754 52.766 2.0
(64, 64, 64) float64 2 109.363 49.444 2.2
(128, 128, 128) uint8 10 1358.238 704.423 1.9
(128, 128, 128) uint8 40 1369.407 616.142 2.2
(128, 128, 128) float32 0.5 1307.700 738.584 1.8
(128, 128, 128) float32 2 1219.374 721.620 1.7
(128, 128, 128) float64 0.5 1415.700 732.485 1.9
(128, 128, 128) float64 2 1346.298 688.349 2.0

h_maxima benchmarks (reconstruct_on_cpu=False)

shape dtype h CPU (ms) GPU (ms) speedup
(512, 512) uint8 10 66.099 4.295 15.4
(512, 512) uint8 40 67.482 8.787 7.7
(512, 512) float32 0.5 65.929 4.286 15.4
(512, 512) float32 2 66.191 17.703 3.7
(512, 512) float64 0.5 75.106 4.331 17.3
(512, 512) float64 2 73.122 17.858 4.1
(1024, 1024) uint8 10 299.417 4.306 69.5
(1024, 1024) uint8 40 376.800 8.861 42.5
(1024, 1024) float32 0.5 326.853 4.367 74.9
(1024, 1024) float32 2 313.616 35.335 8.9
(1024, 1024) float64 0.5 385.809 4.470 86.3
(1024, 1024) float64 2 365.304 35.985 10.2
(2048, 2048) uint8 10 2382.893 7.197 331.1
(2048, 2048) uint8 40 2431.437 15.035 161.7
(2048, 2048) float32 0.5 2323.522 7.802 297.8
(2048, 2048) float32 2 2107.158 64.964 32.4
(2048, 2048) float64 0.5 2563.756 8.449 303.4
(2048, 2048) float64 2 2375.402 65.127 36.5
(64, 64, 64) uint8 10 89.861 4.795 18.7
(64, 64, 64) uint8 40 103.720 9.891 10.5
(64, 64, 64) float32 0.5 96.680 4.861 19.9
(64, 64, 64) float32 2 96.879 39.907 2.4
(64, 64, 64) float64 0.5 108.979 4.859 22.4
(64, 64, 64) float64 2 105.044 40.333 2.6
(128, 128, 128) uint8 10 1341.580 6.101 219.9
(128, 128, 128) uint8 40 1367.311 12.668 107.9
(128, 128, 128) float32 0.5 1272.679 6.275 202.8
(128, 128, 128) float32 2 1211.710 51.905 23.3
(128, 128, 128) float64 0.5 1466.083 6.236 235.1
(128, 128, 128) float64 2 1364.406 50.118 27.2
Benchmark script ```py """Benchmark local_maxima: scikit-image (CPU) vs cucim.skimage (GPU)."""

import math
import time

import cupy as cp
import numpy as np
from cupyx.profiler import benchmark
from skimage.morphology import local_maxima as local_maxima_cpu
from skimage.morphology import h_maxima as h_maxima_cpu

from cucim.skimage.morphology import local_maxima as local_maxima_gpu
from cucim.skimage.morphology import h_maxima as h_maxima_gpu

---------------------------------------------------------------------------

Test configurations

---------------------------------------------------------------------------

shapes_2d = [
(512, 512),
(1024, 1024),
(2048, 2048),
(4096, 4096),
]

shapes_3d = [
(64, 64, 64),
(128, 128, 128),
(256, 256, 256),
]

dtypes = [np.uint8, np.float32, np.float64]
connectivities_2d = [1, 2] # 4-conn and 8-conn
connectivities_3d = [1, 3] # 6-conn and 26-conn

def make_image(shape, dtype, rng_seed=42):
"""Create a test image with some local structure."""
rng = np.random.default_rng(rng_seed)
if np.issubdtype(dtype, np.integer):
img = rng.integers(0, 256, size=shape, dtype=dtype)
else:
img = rng.standard_normal(shape).astype(dtype)
return img

def run_benchmarks():
header = (
"| shape | dtype | connectivity | allow_borders "
"| CPU (ms) | GPU (ms) | speedup |"
)
sep = (
"|-------|-------|--------------|---------------"
"|----------|----------|---------|"
)

print(header)
print(sep)

configs = []
for shape in shapes_2d:
    for conn in connectivities_2d:
        configs.append((shape, conn))
for shape in shapes_3d:
    for conn in connectivities_3d:
        configs.append((shape, conn))

for shape, conn in configs:
    for dtype in dtypes:
        for allow_borders in [True, False]:
            img_cpu = make_image(shape, dtype)
            img_gpu = cp.asarray(img_cpu)

            # --- CPU timing (single run, these are slow) ---
            t0 = time.perf_counter()
            result = local_maxima_cpu(
                img_cpu,
                connectivity=conn,
                allow_borders=allow_borders,
            )
            duration_cpu = (time.perf_counter() - t0) * 1000

            # --- GPU timing (via cupyx benchmark) ---
            perf = benchmark(
                local_maxima_gpu,
                (img_gpu,),
                dict(connectivity=conn, allow_borders=allow_borders),
                n_repeat=10000,
                n_warmup=10,
                max_duration=3,
            )
            duration_gpu = perf.gpu_times.mean() * 1000

            speedup = duration_cpu / duration_gpu if duration_gpu > 0 else float("inf")

            # Also validate that the result is equal
            gpu_result = local_maxima_gpu(
                img_gpu, connectivity=conn, allow_borders=allow_borders
            )
            cp.testing.assert_array_equal(result, gpu_result)

            print(
                f"| {str(shape):>17s} | {np.dtype(dtype).name:>7s} "
                f"| {conn:>12d} | {str(allow_borders):>13s} "
                f"| {duration_cpu:>8.3f} | {duration_gpu:>8.3f} "
                f"| {speedup:>7.1f} |"
            )

def _h_values_for_dtype(dtype):
"""Return h values that exercise the real code path for the given dtype.

For integer images (0-255 range), use h=10 and h=40.
For float images (standard_normal, range ~6), use h=0.5 and h=2.0.
"""
if np.issubdtype(dtype, np.integer):
    return [10, 40]
else:
    return [0.5, 2.0]

def run_h_maxima_benchmarks():
header = (
"| shape | dtype | h "
"| CPU (ms) | GPU (ms) | speedup |"
)
sep = (
"|-------|-------|------"
"|----------|----------|---------|"
)

print()
print("h_maxima benchmark")
print(header)
print(sep)

configs = []
for shape in shapes_2d:
    configs.append(shape)
for shape in shapes_3d:
    configs.append(shape)

for shape in configs:
    # skip large shapes where h_maxima is very slow
    if math.prod(shape) > 2048 * 2048:
        continue
    for dtype in dtypes:
        for h in _h_values_for_dtype(dtype):
            img_cpu = make_image(shape, dtype)
            img_gpu = cp.asarray(img_cpu)

            # --- CPU timing ---
            t0 = time.perf_counter()
            result = h_maxima_cpu(img_cpu, h)
            duration_cpu = (time.perf_counter() - t0) * 1000

            # --- GPU timing ---
            perf = benchmark(
                h_maxima_gpu,
                (img_gpu, h),
                n_repeat=10000,
                n_warmup=10,
                max_duration=3,
            )
            duration_gpu = perf.gpu_times.mean() * 1000

            speedup = duration_cpu / duration_gpu if duration_gpu > 0 else float("inf")

            # Validate
            gpu_result = h_maxima_gpu(img_gpu, h)
            cp.testing.assert_array_equal(result, gpu_result)

            print(
                f"| {str(shape):>17s} | {np.dtype(dtype).name:>7s} "
                f"| {h:>5g} "
                f"| {duration_cpu:>8.3f} | {duration_gpu:>8.3f} "
                f"| {speedup:>7.1f} |"
            )

if name == "main":
run_benchmarks()
run_h_maxima_benchmarks()

</details>





grlee77 added 4 commits March 19, 2026 14:20
This is not a port of the scikit-image Cython code, but an independent
implementation using already implemented GPU kernels.
These rely on cucim.skimage.morphology.grayreconstruct which is still largely
CPU-based, so acceleration is only around 2x vs. scikit-image for these two
functions.
@grlee77 grlee77 requested review from a team as code owners March 19, 2026 21:09
@grlee77 grlee77 requested a review from AyodeAwe March 19, 2026 21:09
@grlee77 grlee77 self-assigned this Mar 19, 2026
@grlee77 grlee77 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Mar 19, 2026
@grlee77 grlee77 added this to the v26.06.00 milestone Mar 19, 2026
@grlee77 grlee77 added this to cucim Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant