Skip to content

Feature: nk.scale feature request: rounding mode + performance analysis across saturation regimesΒ #327

@ternaus

Description

@ternaus

Describe what you are looking for

nk.scale feature request: rounding mode + performance analysis across saturation regimes

We're using both numkong and stringzilla in albucore β€” a low-level image processing library. Benchmarking affine uint8 scaling across two saturation regimes and a canonical shape grid reveals two actionable gaps.


Benchmark setup

Two non-trivial affine cases chosen to exercise saturation at different ends:

  • Case A alpha=1.3, beta=30: 83/256 output values clip to 255 (upper saturation)
  • Case B alpha=0.8, beta=-20: 26/256 output values clip to 0 (lower saturation)

Platform: macOS arm64 Β· cv2 4.13.0 Β· numpy 2.4.2 Β· numkong 7.0.0 Β· stringzilla 4.6.0
Repeats: 41, warmup: 12. Times in milliseconds (median). Layout: HWC = (H,W,C), DHWC = (D,H,W,C), NDHWC = (N,D,H,W,C).

Case A β€” alpha=1.3, beta=30 (83 values saturate at 255)

layout shape bytes nk.scale sz cv2.LUT numpy fastest nk/best
HWC 128Γ—128Γ—1 16,384 0.0014 0.0017 0.0041 0.0181 nk.scale 1.00Γ—
HWC 128Γ—128Γ—3 49,152 0.0033 0.0039 0.0114 0.0466 nk.scale 1.00Γ—
HWC 128Γ—128Γ—9 147,456 0.0080 0.0103 0.0343 0.1545 nk.scale 1.00Γ—
HWC 256Γ—256Γ—1 65,536 0.0060 0.0049 0.0151 0.0722 sz 1.22Γ—
HWC 256Γ—256Γ—3 196,608 0.0114 0.0135 0.0444 0.2230 nk.scale 1.00Γ—
HWC 256Γ—256Γ—9 589,824 0.0312 0.0386 0.1385 0.9184 nk.scale 1.00Γ—
HWC 512Γ—512Γ—1 262,144 0.0134 0.0177 0.0311 0.2936 nk.scale 1.00Γ—
HWC 512Γ—512Γ—3 786,432 0.0390 0.0515 0.0685 1.1949 nk.scale 1.00Γ—
HWC 512Γ—512Γ—9 2,359,296 0.2573 0.2206 0.2105 2.2635 cv2 1.22Γ—
HWC 1024Γ—1024Γ—1 1,048,576 0.0550 0.0677 0.1303 1.6142 nk.scale 1.00Γ—
HWC 1024Γ—1024Γ—3 3,145,728 0.3429 0.5382 0.2063 3.0362 cv2 1.66Γ—
HWC 1024Γ—1024Γ—9 9,437,184 0.4700 0.6218 0.3089 8.7483 cv2 1.52Γ—
DHWC 16Γ—128Γ—128Γ—1 262,144 0.0180 0.0175 0.0632 0.2899 sz 1.03Γ—
DHWC 16Γ—128Γ—128Γ—3 786,432 0.0499 0.0516 0.1758 1.2973 nk.scale 1.00Γ—
DHWC 32Γ—128Γ—128Γ—1 524,288 0.0280 0.0343 0.1254 0.8759 nk.scale 1.00Γ—
DHWC 32Γ—128Γ—128Γ—3 1,572,864 0.1863 0.1419 0.3710 2.5942 sz 1.31Γ—
DHWC 64Γ—128Γ—128Γ—3 3,145,728 0.3647 0.5462 0.8308 3.0274 nk.scale 1.00Γ—
DHWC 128Γ—128Γ—128Γ—1 2,097,152 0.2258 0.4532 0.5568 3.4568 nk.scale 1.00Γ—
DHWC 48Γ—256Γ—256Γ—3 9,437,184 0.4722 0.6246 2.2108 8.7893 nk.scale 1.00Γ—
NDHWC 2Γ—32Γ—128Γ—128Γ—1 1,048,576 0.0548 0.0687 0.2450 1.6958 nk.scale 1.00Γ—
NDHWC 2Γ—32Γ—128Γ—128Γ—3 3,145,728 0.3717 0.2931 0.8331 3.0156 sz 1.27Γ—
NDHWC 2Γ—64Γ—128Γ—128Γ—3 6,291,456 0.6991 0.5868 1.6812 6.1590 sz 1.19Γ—
NDHWC 4Γ—16Γ—128Γ—128Γ—3 3,145,728 0.3526 0.4462 0.8124 3.0149 nk.scale 1.00Γ—

Case B β€” alpha=0.8, beta=-20 (26 values saturate at 0)

layout shape bytes nk.scale sz cv2.LUT numpy fastest nk/best
HWC 128Γ—128Γ—1 16,384 0.0014 0.0017 0.0041 0.0263 nk.scale 1.00Γ—
HWC 128Γ—128Γ—3 49,152 0.0038 0.0042 0.0114 0.0595 nk.scale 1.00Γ—
HWC 128Γ—128Γ—9 147,456 0.0091 0.0103 0.0359 0.1610 nk.scale 1.00Γ—
HWC 256Γ—256Γ—1 65,536 0.0044 0.0049 0.0167 0.0772 nk.scale 1.00Γ—
HWC 256Γ—256Γ—3 196,608 0.0115 0.0135 0.0456 0.2277 nk.scale 1.00Γ—
HWC 256Γ—256Γ—9 589,824 0.0333 0.0385 0.1382 0.8984 nk.scale 1.00Γ—
HWC 512Γ—512Γ—1 262,144 0.0148 0.0176 0.0319 0.3167 nk.scale 1.00Γ—
HWC 512Γ—512Γ—3 786,432 0.0463 0.0509 0.0697 1.2126 nk.scale 1.00Γ—
HWC 512Γ—512Γ—9 2,359,296 0.2550 0.2122 0.2080 2.2503 cv2 1.23Γ—
HWC 1024Γ—1024Γ—1 1,048,576 0.0495 0.0676 0.0862 1.6403 nk.scale 1.00Γ—
HWC 1024Γ—1024Γ—3 3,145,728 0.3514 0.3067 0.2023 3.0348 cv2 1.74Γ—
HWC 1024Γ—1024Γ—9 9,437,184 0.4733 0.6301 0.3272 8.7480 cv2 1.45Γ—
DHWC 16Γ—128Γ—128Γ—1 262,144 0.0220 0.0177 0.0637 0.2952 sz 1.24Γ—
DHWC 16Γ—128Γ—128Γ—3 786,432 0.0416 0.0518 0.1889 1.3046 nk.scale 1.00Γ—
DHWC 32Γ—128Γ—128Γ—1 524,288 0.0257 0.0345 0.1177 0.8647 nk.scale 1.00Γ—
DHWC 32Γ—128Γ—128Γ—3 1,572,864 0.2005 0.1537 0.3708 2.5196 sz 1.30Γ—
DHWC 64Γ—128Γ—128Γ—3 3,145,728 0.3725 0.5380 0.8441 3.0171 nk.scale 1.00Γ—
DHWC 128Γ—128Γ—128Γ—1 2,097,152 0.2537 0.2062 0.5667 3.7150 sz 1.23Γ—
DHWC 48Γ—256Γ—256Γ—3 9,437,184 0.4756 0.6307 2.2156 8.7619 nk.scale 1.00Γ—
NDHWC 2Γ—32Γ—128Γ—128Γ—1 1,048,576 0.0652 0.0676 0.2443 1.8025 nk.scale 1.00Γ—
NDHWC 2Γ—32Γ—128Γ—128Γ—3 3,145,728 0.4201 0.3139 0.8472 3.0572 sz 1.34Γ—
NDHWC 2Γ—64Γ—128Γ—128Γ—3 6,291,456 0.7009 1.2994 1.6702 6.1743 nk.scale 1.00Γ—
NDHWC 4Γ—16Γ—128Γ—128Γ—3 3,145,728 0.3767 0.2905 0.8295 3.0647 sz 1.30Γ—

Analysis

The winner flips not only by shape/size but also by saturation regime, which is the key finding here:

nk.scale consistently wins:

  • All small buffers (< ~200 KB), any layout, any saturation
  • Multi-channel volumes like 64Γ—128Γ—128Γ—3 and 48Γ—256Γ—256Γ—3 where sz is inexplicably slower despite larger buffers

sz.translate wins for specific (size, layout, saturation) triples, e.g.:

  • 32Γ—128Γ—128Γ—3 1.5 MB: sz wins by 31% (case A) / 30% (case B) β€” consistent across both
  • 2Γ—32Γ—128Γ—128Γ—3 3 MB: sz wins by 27% (A) / 34% (B) β€” also consistent
  • 128Γ—128Γ—128Γ—1 2 MB: sz wins by 2Γ— in case A but only 23% in case B β€” saturation-sensitive!

cv2.LUT wins for large HWC multi-channel:

  • 1024Γ—1024Γ—3 3 MB: cv2 is 1.66Γ— (A) / 1.74Γ— (B) faster than nk.scale
  • 1024Γ—1024Γ—9 9 MB: cv2 is 1.52Γ— (A) / 1.45Γ— (B) faster

Critical observation: 128Γ—128Γ—128Γ—1 (2 MB isotropic grayscale volume) shows sz at 0.45 ms (case A, upper sat) vs 0.21 ms (case B, lower sat) β€” a 2Γ— difference in sz performance across saturation cases, while nk.scale barely moves (0.23 vs 0.25 ms). This suggests sz.translate's SIMD path is sensitive to the distribution of table lookups (cache effects on the 256-byte LUT?), while nk.scale's arithmetic path is not.


Feature requests

1. Rounding mode for nk.scale

nk.scale truncates (floor), but standard saturated uint8 arithmetic uses banker's rounding / round(). This prevents using nk.scale as a drop-in for LUT-based affine ops β€” outputs differ by Β±1 at half-integer boundaries, breaking pixel-exact tests.

import numpy as np, numkong as nk

img = np.array([1, 3, 5, 7, 9], dtype=np.uint8)  # alpha*x = 1.5, 4.5, 7.5, 10.5, 13.5
flat = img.copy()
out_nk = np.frombuffer(nk.scale(nk.Tensor(flat), alpha=1.5, beta=0.0), dtype=np.uint8)
out_np = np.clip(np.round(1.5 * img.astype(np.float32)), 0, 255).astype(np.uint8)

print("nk:", out_nk)  # [1 4 7 10 13] β€” truncation
print("np:", out_np)  # [2 4 8 10 14] β€” round

A rounding='round' | 'floor' | 'trunc' kwarg would make nk.scale substitutable anywhere a LUT is used.

2. Investigate the saturation-regime sensitivity

The 2Γ— variance in sz.translate performance on 128Γ—128Γ—128Γ—1 between upper- and lower-saturation cases (0.45 ms vs 0.21 ms) is surprising and worth understanding β€” is it a LUT cache effect, branch predictor behavior on the clamp, or something else? If it's a known limitation, documenting it would help users choose the right tool.


Reproduction

import numpy as np, numkong as nk, stringzilla as sz, cv2, time

def make_lut(alpha, beta):
    x = np.arange(256, dtype=np.float32)
    return np.clip(np.round(alpha * x + beta), 0, 255).astype(np.uint8)

def bench(fn, n=41, w=12):
    for _ in range(w): fn()
    t = []
    for _ in range(n):
        s = time.perf_counter(); fn(); t.append(time.perf_counter() - s)
    return float(np.median(t)) * 1e3

CASES = [("upper sat", 1.3, 30.0), ("lower sat", 0.8, -20.0)]
SHAPES = [(128,128,128,1), (32,128,128,3), (1024,1024,3)]
rng = np.random.default_rng(0)

for label, alpha, beta in CASES:
    lut = make_lut(alpha, beta)
    print(f"\n--- {label}: alpha={alpha}, beta={beta} ---")
    for sh in SHAPES:
        img = rng.integers(0, 256, size=sh, dtype=np.uint8)
        flat = np.ascontiguousarray(img).reshape(-1)
        t_nk = bench(lambda: nk.scale(nk.Tensor(flat), alpha=alpha, beta=beta))
        t_sz = bench(lambda: sz.translate(memoryview(flat.copy()), memoryview(lut), inplace=False))
        t_cv = bench(lambda: cv2.LUT(img, lut))
        print(f"  {'Γ—'.join(map(str,sh)):22s}  nk={t_nk:.4f}  sz={t_sz:.4f}  cv2={t_cv:.4f} ms")

Thanks for both libraries β€” the combination covers almost all of our hot paths. The rounding mode would be the single most impactful addition for correctness; understanding the saturation sensitivity would help us route more reliably.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions