morphology: add `h_maxima`, `h_minima`, `local_maxima`, `local_minima`. improve `reconstruct` by grlee77 · Pull Request #1061 · rapidsai/cucim

grlee77 · 2026-03-19T21:09:27Z

local_minima and local_maxima

The local_minima and local_maxima functions are an implementation via a different GPU algorithm that gives equivalent results to the scikit-image Cython code.

Side-by-Side Comparison (CPU vs. GPU algorithm)

Aspect	scikit-image (CPU)	cuCIM (GPU)
Core strategy	Sequential flood-fill per plateau	Parallel filter + label + validate
Candidate detection	1D scan along last dim, then flood-fill	`maximum_filter` (bulk)
Plateau discovery	BFS flood-fill (Cython queue)	`ndi.label` (connected components)
Plateau validation	During flood-fill: check neighbors	Post-hoc: maximum_filter on non-candidates + minimum_filter
Implementation language	Cython (nogil)	Pure Python + CuPy ndimage
Memory overhead	O(1) extra (flags + queue)	O(N) extra (multiple intermediate arrays)
Parallelism	None (serial per pixel)	Full GPU parallelism
Border handling	Pad with min / crop	Same approach
`local_minima`	`invert(image, signed_float=True)` then `local_maxima`	Same: `invert(image, signed_float=True)` then `local_maxima`

h_minima, h_maxima and reconstruction

The h_maxima and h_minima functions are trivial conversions of the corresponding scikit-image functions. The core of these is the cucim.skimage.morphology.reconstruct algorithm. I found that reconstruct still relied on CPU fallback for the core operation, so was still relatively inefficient. I implemented a simple iterative approach on the GPU that is much faster for typical cases, but would be slower in some pathological cases. There is a new keyword-only argument that allows optionally using the older CPU-based implementation, but the GPU case is now used by default.

Benchmarks

local_maxima benchmarks

shape	dtype	connectivity	allow_borders	CPU (ms)	GPU (ms)	speedup
(512, 512)	uint8	1	True	4.198	2.161	1.9
(512, 512)	uint8	1	False	3.858	2.554	1.5
(512, 512)	float32	1	True	4.566	2.281	2.0
(512, 512)	float32	1	False	4.304	2.229	1.9
(512, 512)	float64	1	True	4.524	2.270	2.0
(512, 512)	float64	1	False	4.380	2.235	2.0
(512, 512)	uint8	2	True	4.337	2.406	1.8
(512, 512)	uint8	2	False	4.082	2.725	1.5
(512, 512)	float32	2	True	4.966	2.156	2.3
(512, 512)	float32	2	False	4.512	2.436	1.9
(512, 512)	float64	2	True	4.781	2.254	2.1
(512, 512)	float64	2	False	4.469	2.408	1.9
(1024, 1024)	uint8	1	True	15.212	2.336	6.5
(1024, 1024)	uint8	1	False	15.586	2.625	5.9
(1024, 1024)	float32	1	True	18.163	2.294	7.9
(1024, 1024)	float32	1	False	16.433	2.289	7.2
(1024, 1024)	float64	1	True	18.153	2.389	7.6
(1024, 1024)	float64	1	False	16.419	2.372	6.9
(1024, 1024)	uint8	2	True	16.157	2.493	6.5
(1024, 1024)	uint8	2	False	15.473	2.860	5.4
(1024, 1024)	float32	2	True	18.541	2.225	8.3
(1024, 1024)	float32	2	False	18.007	2.538	7.1
(1024, 1024)	float64	2	True	18.908	2.253	8.4
(1024, 1024)	float64	2	False	17.069	2.554	6.7
(2048, 2048)	uint8	1	True	63.228	3.284	19.3
(2048, 2048)	uint8	1	False	64.046	3.643	17.6
(2048, 2048)	float32	1	True	67.281	3.321	20.3
(2048, 2048)	float32	1	False	63.733	3.304	19.3
(2048, 2048)	float64	1	True	76.022	3.721	20.4
(2048, 2048)	float64	1	False	64.955	3.675	17.7
(2048, 2048)	uint8	2	True	67.681	3.986	17.0
(2048, 2048)	uint8	2	False	63.733	4.306	14.8
(2048, 2048)	float32	2	True	73.484	3.540	20.8
(2048, 2048)	float32	2	False	70.727	3.945	17.9
(2048, 2048)	float64	2	True	79.178	3.809	20.8
(2048, 2048)	float64	2	False	67.721	3.961	17.1
(4096, 4096)	uint8	1	True	243.888	7.959	30.6
(4096, 4096)	uint8	1	False	237.681	8.479	28.0
(4096, 4096)	float32	1	True	274.830	8.465	32.5
(4096, 4096)	float32	1	False	252.003	8.310	30.3
(4096, 4096)	float64	1	True	297.186	10.240	29.0
(4096, 4096)	float64	1	False	257.524	9.720	26.5
(4096, 4096)	uint8	2	True	258.160	9.968	25.9
(4096, 4096)	uint8	2	False	250.471	10.494	23.9
(4096, 4096)	float32	2	True	295.520	9.759	30.3
(4096, 4096)	float32	2	False	276.706	10.137	27.3
(4096, 4096)	float64	2	True	306.567	10.339	29.7
(4096, 4096)	float64	2	False	269.581	10.469	25.7
(64, 64, 64)	uint8	1	True	4.357	2.269	1.9
(64, 64, 64)	uint8	1	False	3.938	2.592	1.5
(64, 64, 64)	float32	1	True	5.007	2.342	2.1
(64, 64, 64)	float32	1	False	4.195	2.308	1.8
(64, 64, 64)	float64	1	True	5.332	2.342	2.3
(64, 64, 64)	float64	1	False	4.390	2.309	1.9
(64, 64, 64)	uint8	3	True	5.122	2.552	2.0
(64, 64, 64)	uint8	3	False	4.555	2.913	1.6
(64, 64, 64)	float32	3	True	6.680	2.335	2.9
(64, 64, 64)	float32	3	False	6.249	2.628	2.4
(64, 64, 64)	float64	3	True	6.367	2.356	2.7
(64, 64, 64)	float64	3	False	5.215	2.635	2.0
(128, 128, 128)	uint8	1	True	33.572	2.872	11.7
(128, 128, 128)	uint8	1	False	30.980	3.193	9.7
(128, 128, 128)	float32	1	True	36.483	2.945	12.4
(128, 128, 128)	float32	1	False	33.374	2.975	11.2
(128, 128, 128)	float64	1	True	38.752	3.464	11.2
(128, 128, 128)	float64	1	False	33.869	3.345	10.1
(128, 128, 128)	uint8	3	True	38.628	3.273	11.8
(128, 128, 128)	uint8	3	False	36.215	3.678	9.8
(128, 128, 128)	float32	3	True	50.714	3.012	16.8
(128, 128, 128)	float32	3	False	46.849	3.299	14.2
(128, 128, 128)	float64	3	True	47.963	3.063	15.7
(128, 128, 128)	float64	3	False	47.073	3.335	14.1
(256, 256, 256)	uint8	1	True	252.847	10.045	25.2
(256, 256, 256)	uint8	1	False	243.610	10.343	23.6
(256, 256, 256)	float32	1	True	290.053	10.530	27.5
(256, 256, 256)	float32	1	False	265.375	10.675	24.9
(256, 256, 256)	float64	1	True	312.965	13.910	22.5
(256, 256, 256)	float64	1	False	268.880	13.145	20.5
(256, 256, 256)	uint8	3	True	303.185	12.521	24.2
(256, 256, 256)	uint8	3	False	291.095	12.845	22.7
(256, 256, 256)	float32	3	True	407.507	12.249	33.3
(256, 256, 256)	float32	3	False	375.998	12.474	30.1
(256, 256, 256)	float64	3	True	380.317	12.622	30.1
(256, 256, 256)	float64	3	False	333.183	12.288	27.1

h_maxima benchmarks (reconstruct_on_cpu=True)

shape	dtype	h	CPU (ms)	GPU (ms)	speedup
(512, 512)	uint8	10	64.612	27.607	2.3
(512, 512)	uint8	40	63.330	26.163	2.4
(512, 512)	float32	0.5	64.302	27.579	2.3
(512, 512)	float32	2	62.958	26.504	2.4
(512, 512)	float64	0.5	73.931	27.872	2.7
(512, 512)	float64	2	73.955	26.853	2.8
(1024, 1024)	uint8	10	292.537	134.151	2.2
(1024, 1024)	uint8	40	372.474	138.855	2.7
(1024, 1024)	float32	0.5	317.679	148.027	2.1
(1024, 1024)	float32	2	314.687	133.266	2.4
(1024, 1024)	float64	0.5	403.959	137.907	2.9
(1024, 1024)	float64	2	372.604	128.456	2.9
(2048, 2048)	uint8	10	2450.768	1057.142	2.3
(2048, 2048)	uint8	40	2454.066	1009.442	2.4
(2048, 2048)	float32	0.5	2317.305	1078.062	2.1
(2048, 2048)	float32	2	2206.711	975.934	2.3
(2048, 2048)	float64	0.5	2608.819	1068.013	2.4
(2048, 2048)	float64	2	2468.483	985.687	2.5
(64, 64, 64)	uint8	10	88.767	50.435	1.8
(64, 64, 64)	uint8	40	101.429	49.159	2.1
(64, 64, 64)	float32	0.5	97.956	50.777	1.9
(64, 64, 64)	float32	2	96.890	49.342	2.0
(64, 64, 64)	float64	0.5	105.754	52.766	2.0
(64, 64, 64)	float64	2	109.363	49.444	2.2
(128, 128, 128)	uint8	10	1358.238	704.423	1.9
(128, 128, 128)	uint8	40	1369.407	616.142	2.2
(128, 128, 128)	float32	0.5	1307.700	738.584	1.8
(128, 128, 128)	float32	2	1219.374	721.620	1.7
(128, 128, 128)	float64	0.5	1415.700	732.485	1.9
(128, 128, 128)	float64	2	1346.298	688.349	2.0

h_maxima benchmarks (reconstruct_on_cpu=False)

shape	dtype	h	CPU (ms)	GPU (ms)	speedup
(512, 512)	uint8	10	66.099	4.295	15.4
(512, 512)	uint8	40	67.482	8.787	7.7
(512, 512)	float32	0.5	65.929	4.286	15.4
(512, 512)	float32	2	66.191	17.703	3.7
(512, 512)	float64	0.5	75.106	4.331	17.3
(512, 512)	float64	2	73.122	17.858	4.1
(1024, 1024)	uint8	10	299.417	4.306	69.5
(1024, 1024)	uint8	40	376.800	8.861	42.5
(1024, 1024)	float32	0.5	326.853	4.367	74.9
(1024, 1024)	float32	2	313.616	35.335	8.9
(1024, 1024)	float64	0.5	385.809	4.470	86.3
(1024, 1024)	float64	2	365.304	35.985	10.2
(2048, 2048)	uint8	10	2382.893	7.197	331.1
(2048, 2048)	uint8	40	2431.437	15.035	161.7
(2048, 2048)	float32	0.5	2323.522	7.802	297.8
(2048, 2048)	float32	2	2107.158	64.964	32.4
(2048, 2048)	float64	0.5	2563.756	8.449	303.4
(2048, 2048)	float64	2	2375.402	65.127	36.5
(64, 64, 64)	uint8	10	89.861	4.795	18.7
(64, 64, 64)	uint8	40	103.720	9.891	10.5
(64, 64, 64)	float32	0.5	96.680	4.861	19.9
(64, 64, 64)	float32	2	96.879	39.907	2.4
(64, 64, 64)	float64	0.5	108.979	4.859	22.4
(64, 64, 64)	float64	2	105.044	40.333	2.6
(128, 128, 128)	uint8	10	1341.580	6.101	219.9
(128, 128, 128)	uint8	40	1367.311	12.668	107.9
(128, 128, 128)	float32	0.5	1272.679	6.275	202.8
(128, 128, 128)	float32	2	1211.710	51.905	23.3
(128, 128, 128)	float64	0.5	1466.083	6.236	235.1
(128, 128, 128)	float64	2	1364.406	50.118	27.2

Benchmark script

```py """Benchmark local_maxima: scikit-image (CPU) vs cucim.skimage (GPU)."""

import math
import time

import cupy as cp
import numpy as np
from cupyx.profiler import benchmark
from skimage.morphology import local_maxima as local_maxima_cpu
from skimage.morphology import h_maxima as h_maxima_cpu

from cucim.skimage.morphology import local_maxima as local_maxima_gpu
from cucim.skimage.morphology import h_maxima as h_maxima_gpu

---------------------------------------------------------------------------

Test configurations

---------------------------------------------------------------------------

shapes_2d = [
(512, 512),
(1024, 1024),
(2048, 2048),
(4096, 4096),
]

shapes_3d = [
(64, 64, 64),
(128, 128, 128),
(256, 256, 256),
]

dtypes = [np.uint8, np.float32, np.float64]
connectivities_2d = [1, 2] # 4-conn and 8-conn
connectivities_3d = [1, 3] # 6-conn and 26-conn

def make_image(shape, dtype, rng_seed=42):
"""Create a test image with some local structure."""
rng = np.random.default_rng(rng_seed)
if np.issubdtype(dtype, np.integer):
img = rng.integers(0, 256, size=shape, dtype=dtype)
else:
img = rng.standard_normal(shape).astype(dtype)
return img

def run_benchmarks():
header = (
"| shape | dtype | connectivity | allow_borders "
"| CPU (ms) | GPU (ms) | speedup |"
)
sep = (
"|-------|-------|--------------|---------------"
"|----------|----------|---------|"
)

print(header)
print(sep)

configs = []
for shape in shapes_2d:
    for conn in connectivities_2d:
        configs.append((shape, conn))
for shape in shapes_3d:
    for conn in connectivities_3d:
        configs.append((shape, conn))

for shape, conn in configs:
    for dtype in dtypes:
        for allow_borders in [True, False]:
            img_cpu = make_image(shape, dtype)
            img_gpu = cp.asarray(img_cpu)

            # --- CPU timing (single run, these are slow) ---
            t0 = time.perf_counter()
            result = local_maxima_cpu(
                img_cpu,
                connectivity=conn,
                allow_borders=allow_borders,
            )
            duration_cpu = (time.perf_counter() - t0) * 1000

            # --- GPU timing (via cupyx benchmark) ---
            perf = benchmark(
                local_maxima_gpu,
                (img_gpu,),
                dict(connectivity=conn, allow_borders=allow_borders),
                n_repeat=10000,
                n_warmup=10,
                max_duration=3,
            )
            duration_gpu = perf.gpu_times.mean() * 1000

            speedup = duration_cpu / duration_gpu if duration_gpu > 0 else float("inf")

            # Also validate that the result is equal
            gpu_result = local_maxima_gpu(
                img_gpu, connectivity=conn, allow_borders=allow_borders
            )
            cp.testing.assert_array_equal(result, gpu_result)

            print(
                f"| {str(shape):>17s} | {np.dtype(dtype).name:>7s} "
                f"| {conn:>12d} | {str(allow_borders):>13s} "
                f"| {duration_cpu:>8.3f} | {duration_gpu:>8.3f} "
                f"| {speedup:>7.1f} |"
            )

def _h_values_for_dtype(dtype):
"""Return h values that exercise the real code path for the given dtype.

For integer images (0-255 range), use h=10 and h=40.
For float images (standard_normal, range ~6), use h=0.5 and h=2.0.
"""
if np.issubdtype(dtype, np.integer):
    return [10, 40]
else:
    return [0.5, 2.0]

def run_h_maxima_benchmarks():
header = (
"| shape | dtype | h "
"| CPU (ms) | GPU (ms) | speedup |"
)
sep = (
"|-------|-------|------"
"|----------|----------|---------|"
)

print()
print("h_maxima benchmark")
print(header)
print(sep)

configs = []
for shape in shapes_2d:
    configs.append(shape)
for shape in shapes_3d:
    configs.append(shape)

for shape in configs:
    # skip large shapes where h_maxima is very slow
    if math.prod(shape) > 2048 * 2048:
        continue
    for dtype in dtypes:
        for h in _h_values_for_dtype(dtype):
            img_cpu = make_image(shape, dtype)
            img_gpu = cp.asarray(img_cpu)

            # --- CPU timing ---
            t0 = time.perf_counter()
            result = h_maxima_cpu(img_cpu, h)
            duration_cpu = (time.perf_counter() - t0) * 1000

            # --- GPU timing ---
            perf = benchmark(
                h_maxima_gpu,
                (img_gpu, h),
                n_repeat=10000,
                n_warmup=10,
                max_duration=3,
            )
            duration_gpu = perf.gpu_times.mean() * 1000

            speedup = duration_cpu / duration_gpu if duration_gpu > 0 else float("inf")

            # Validate
            gpu_result = h_maxima_gpu(img_gpu, h)
            cp.testing.assert_array_equal(result, gpu_result)

            print(
                f"| {str(shape):>17s} | {np.dtype(dtype).name:>7s} "
                f"| {h:>5g} "
                f"| {duration_cpu:>8.3f} | {duration_gpu:>8.3f} "
                f"| {speedup:>7.1f} |"
            )

if name == "main":
run_benchmarks()
run_h_maxima_benchmarks()

</details>

This is not a port of the scikit-image Cython code, but an independent implementation using already implemented GPU kernels.

These rely on cucim.skimage.morphology.grayreconstruct which is still largely CPU-based, so acceleration is only around 2x vs. scikit-image for these two functions.

…on and make it the default

grlee77 added 4 commits March 19, 2026 14:20

Add local_maxima and local_minima implementations on GPU

1412948

This is not a port of the scikit-image Cython code, but an independent implementation using already implemented GPU kernels.

Copy h_minima and h_maxima implementations from scikit-image

8e6356d

These rely on cucim.skimage.morphology.grayreconstruct which is still largely CPU-based, so acceleration is only around 2x vs. scikit-image for these two functions.

update morphology benchmark scripts

4d4fac9

cucim.skimage.morphology.reconstruction with a GPU-based implementati…

2707393

…on and make it the default

grlee77 requested review from a team as code owners March 19, 2026 21:09

grlee77 requested a review from AyodeAwe March 19, 2026 21:09

grlee77 self-assigned this Mar 19, 2026

grlee77 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Mar 19, 2026

grlee77 added this to the v26.06.00 milestone Mar 19, 2026

grlee77 added this to cucim Mar 19, 2026

Merge branch 'main' into grelee/morphology-extrema

224543c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

morphology: add `h_maxima`, `h_minima`, `local_maxima`, `local_minima`. improve `reconstruct`#1061

morphology: add `h_maxima`, `h_minima`, `local_maxima`, `local_minima`. improve `reconstruct`#1061
grlee77 wants to merge 5 commits intorapidsai:mainfrom
grlee77:grelee/morphology-extrema

grlee77 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grlee77 commented Mar 19, 2026

local_minima and local_maxima

Side-by-Side Comparison (CPU vs. GPU algorithm)

h_minima, h_maxima and reconstruction

Benchmarks

local_maxima benchmarks

h_maxima benchmarks (reconstruct_on_cpu=True)

h_maxima benchmarks (reconstruct_on_cpu=False)

---------------------------------------------------------------------------

Test configurations

---------------------------------------------------------------------------

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant