-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Describe what you are looking for
nk.scale feature request: rounding mode + performance analysis across saturation regimes
We're using both numkong and stringzilla in albucore β a low-level image processing library. Benchmarking affine uint8 scaling across two saturation regimes and a canonical shape grid reveals two actionable gaps.
Benchmark setup
Two non-trivial affine cases chosen to exercise saturation at different ends:
- Case A
alpha=1.3, beta=30: 83/256 output values clip to 255 (upper saturation) - Case B
alpha=0.8, beta=-20: 26/256 output values clip to 0 (lower saturation)
Platform: macOS arm64 Β· cv2 4.13.0 Β· numpy 2.4.2 Β· numkong 7.0.0 Β· stringzilla 4.6.0
Repeats: 41, warmup: 12. Times in milliseconds (median). Layout: HWC = (H,W,C), DHWC = (D,H,W,C), NDHWC = (N,D,H,W,C).
Case A β alpha=1.3, beta=30 (83 values saturate at 255)
| layout | shape | bytes | nk.scale | sz | cv2.LUT | numpy | fastest | nk/best |
|---|---|---|---|---|---|---|---|---|
| HWC | 128Γ128Γ1 | 16,384 | 0.0014 | 0.0017 | 0.0041 | 0.0181 | nk.scale | 1.00Γ |
| HWC | 128Γ128Γ3 | 49,152 | 0.0033 | 0.0039 | 0.0114 | 0.0466 | nk.scale | 1.00Γ |
| HWC | 128Γ128Γ9 | 147,456 | 0.0080 | 0.0103 | 0.0343 | 0.1545 | nk.scale | 1.00Γ |
| HWC | 256Γ256Γ1 | 65,536 | 0.0060 | 0.0049 | 0.0151 | 0.0722 | sz | 1.22Γ |
| HWC | 256Γ256Γ3 | 196,608 | 0.0114 | 0.0135 | 0.0444 | 0.2230 | nk.scale | 1.00Γ |
| HWC | 256Γ256Γ9 | 589,824 | 0.0312 | 0.0386 | 0.1385 | 0.9184 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ1 | 262,144 | 0.0134 | 0.0177 | 0.0311 | 0.2936 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ3 | 786,432 | 0.0390 | 0.0515 | 0.0685 | 1.1949 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ9 | 2,359,296 | 0.2573 | 0.2206 | 0.2105 | 2.2635 | cv2 | 1.22Γ |
| HWC | 1024Γ1024Γ1 | 1,048,576 | 0.0550 | 0.0677 | 0.1303 | 1.6142 | nk.scale | 1.00Γ |
| HWC | 1024Γ1024Γ3 | 3,145,728 | 0.3429 | 0.5382 | 0.2063 | 3.0362 | cv2 | 1.66Γ |
| HWC | 1024Γ1024Γ9 | 9,437,184 | 0.4700 | 0.6218 | 0.3089 | 8.7483 | cv2 | 1.52Γ |
| DHWC | 16Γ128Γ128Γ1 | 262,144 | 0.0180 | 0.0175 | 0.0632 | 0.2899 | sz | 1.03Γ |
| DHWC | 16Γ128Γ128Γ3 | 786,432 | 0.0499 | 0.0516 | 0.1758 | 1.2973 | nk.scale | 1.00Γ |
| DHWC | 32Γ128Γ128Γ1 | 524,288 | 0.0280 | 0.0343 | 0.1254 | 0.8759 | nk.scale | 1.00Γ |
| DHWC | 32Γ128Γ128Γ3 | 1,572,864 | 0.1863 | 0.1419 | 0.3710 | 2.5942 | sz | 1.31Γ |
| DHWC | 64Γ128Γ128Γ3 | 3,145,728 | 0.3647 | 0.5462 | 0.8308 | 3.0274 | nk.scale | 1.00Γ |
| DHWC | 128Γ128Γ128Γ1 | 2,097,152 | 0.2258 | 0.4532 | 0.5568 | 3.4568 | nk.scale | 1.00Γ |
| DHWC | 48Γ256Γ256Γ3 | 9,437,184 | 0.4722 | 0.6246 | 2.2108 | 8.7893 | nk.scale | 1.00Γ |
| NDHWC | 2Γ32Γ128Γ128Γ1 | 1,048,576 | 0.0548 | 0.0687 | 0.2450 | 1.6958 | nk.scale | 1.00Γ |
| NDHWC | 2Γ32Γ128Γ128Γ3 | 3,145,728 | 0.3717 | 0.2931 | 0.8331 | 3.0156 | sz | 1.27Γ |
| NDHWC | 2Γ64Γ128Γ128Γ3 | 6,291,456 | 0.6991 | 0.5868 | 1.6812 | 6.1590 | sz | 1.19Γ |
| NDHWC | 4Γ16Γ128Γ128Γ3 | 3,145,728 | 0.3526 | 0.4462 | 0.8124 | 3.0149 | nk.scale | 1.00Γ |
Case B β alpha=0.8, beta=-20 (26 values saturate at 0)
| layout | shape | bytes | nk.scale | sz | cv2.LUT | numpy | fastest | nk/best |
|---|---|---|---|---|---|---|---|---|
| HWC | 128Γ128Γ1 | 16,384 | 0.0014 | 0.0017 | 0.0041 | 0.0263 | nk.scale | 1.00Γ |
| HWC | 128Γ128Γ3 | 49,152 | 0.0038 | 0.0042 | 0.0114 | 0.0595 | nk.scale | 1.00Γ |
| HWC | 128Γ128Γ9 | 147,456 | 0.0091 | 0.0103 | 0.0359 | 0.1610 | nk.scale | 1.00Γ |
| HWC | 256Γ256Γ1 | 65,536 | 0.0044 | 0.0049 | 0.0167 | 0.0772 | nk.scale | 1.00Γ |
| HWC | 256Γ256Γ3 | 196,608 | 0.0115 | 0.0135 | 0.0456 | 0.2277 | nk.scale | 1.00Γ |
| HWC | 256Γ256Γ9 | 589,824 | 0.0333 | 0.0385 | 0.1382 | 0.8984 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ1 | 262,144 | 0.0148 | 0.0176 | 0.0319 | 0.3167 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ3 | 786,432 | 0.0463 | 0.0509 | 0.0697 | 1.2126 | nk.scale | 1.00Γ |
| HWC | 512Γ512Γ9 | 2,359,296 | 0.2550 | 0.2122 | 0.2080 | 2.2503 | cv2 | 1.23Γ |
| HWC | 1024Γ1024Γ1 | 1,048,576 | 0.0495 | 0.0676 | 0.0862 | 1.6403 | nk.scale | 1.00Γ |
| HWC | 1024Γ1024Γ3 | 3,145,728 | 0.3514 | 0.3067 | 0.2023 | 3.0348 | cv2 | 1.74Γ |
| HWC | 1024Γ1024Γ9 | 9,437,184 | 0.4733 | 0.6301 | 0.3272 | 8.7480 | cv2 | 1.45Γ |
| DHWC | 16Γ128Γ128Γ1 | 262,144 | 0.0220 | 0.0177 | 0.0637 | 0.2952 | sz | 1.24Γ |
| DHWC | 16Γ128Γ128Γ3 | 786,432 | 0.0416 | 0.0518 | 0.1889 | 1.3046 | nk.scale | 1.00Γ |
| DHWC | 32Γ128Γ128Γ1 | 524,288 | 0.0257 | 0.0345 | 0.1177 | 0.8647 | nk.scale | 1.00Γ |
| DHWC | 32Γ128Γ128Γ3 | 1,572,864 | 0.2005 | 0.1537 | 0.3708 | 2.5196 | sz | 1.30Γ |
| DHWC | 64Γ128Γ128Γ3 | 3,145,728 | 0.3725 | 0.5380 | 0.8441 | 3.0171 | nk.scale | 1.00Γ |
| DHWC | 128Γ128Γ128Γ1 | 2,097,152 | 0.2537 | 0.2062 | 0.5667 | 3.7150 | sz | 1.23Γ |
| DHWC | 48Γ256Γ256Γ3 | 9,437,184 | 0.4756 | 0.6307 | 2.2156 | 8.7619 | nk.scale | 1.00Γ |
| NDHWC | 2Γ32Γ128Γ128Γ1 | 1,048,576 | 0.0652 | 0.0676 | 0.2443 | 1.8025 | nk.scale | 1.00Γ |
| NDHWC | 2Γ32Γ128Γ128Γ3 | 3,145,728 | 0.4201 | 0.3139 | 0.8472 | 3.0572 | sz | 1.34Γ |
| NDHWC | 2Γ64Γ128Γ128Γ3 | 6,291,456 | 0.7009 | 1.2994 | 1.6702 | 6.1743 | nk.scale | 1.00Γ |
| NDHWC | 4Γ16Γ128Γ128Γ3 | 3,145,728 | 0.3767 | 0.2905 | 0.8295 | 3.0647 | sz | 1.30Γ |
Analysis
The winner flips not only by shape/size but also by saturation regime, which is the key finding here:
nk.scale consistently wins:
- All small buffers (< ~200 KB), any layout, any saturation
- Multi-channel volumes like
64Γ128Γ128Γ3and48Γ256Γ256Γ3whereszis inexplicably slower despite larger buffers
sz.translate wins for specific (size, layout, saturation) triples, e.g.:
32Γ128Γ128Γ31.5 MB: sz wins by 31% (case A) / 30% (case B) β consistent across both2Γ32Γ128Γ128Γ33 MB: sz wins by 27% (A) / 34% (B) β also consistent128Γ128Γ128Γ12 MB: sz wins by 2Γ in case A but only 23% in case B β saturation-sensitive!
cv2.LUT wins for large HWC multi-channel:
1024Γ1024Γ33 MB: cv2 is 1.66Γ (A) / 1.74Γ (B) faster thannk.scale1024Γ1024Γ99 MB: cv2 is 1.52Γ (A) / 1.45Γ (B) faster
Critical observation: 128Γ128Γ128Γ1 (2 MB isotropic grayscale volume) shows sz at 0.45 ms (case A, upper sat) vs 0.21 ms (case B, lower sat) β a 2Γ difference in sz performance across saturation cases, while nk.scale barely moves (0.23 vs 0.25 ms). This suggests sz.translate's SIMD path is sensitive to the distribution of table lookups (cache effects on the 256-byte LUT?), while nk.scale's arithmetic path is not.
Feature requests
1. Rounding mode for nk.scale
nk.scale truncates (floor), but standard saturated uint8 arithmetic uses banker's rounding / round(). This prevents using nk.scale as a drop-in for LUT-based affine ops β outputs differ by Β±1 at half-integer boundaries, breaking pixel-exact tests.
import numpy as np, numkong as nk
img = np.array([1, 3, 5, 7, 9], dtype=np.uint8) # alpha*x = 1.5, 4.5, 7.5, 10.5, 13.5
flat = img.copy()
out_nk = np.frombuffer(nk.scale(nk.Tensor(flat), alpha=1.5, beta=0.0), dtype=np.uint8)
out_np = np.clip(np.round(1.5 * img.astype(np.float32)), 0, 255).astype(np.uint8)
print("nk:", out_nk) # [1 4 7 10 13] β truncation
print("np:", out_np) # [2 4 8 10 14] β roundA rounding='round' | 'floor' | 'trunc' kwarg would make nk.scale substitutable anywhere a LUT is used.
2. Investigate the saturation-regime sensitivity
The 2Γ variance in sz.translate performance on 128Γ128Γ128Γ1 between upper- and lower-saturation cases (0.45 ms vs 0.21 ms) is surprising and worth understanding β is it a LUT cache effect, branch predictor behavior on the clamp, or something else? If it's a known limitation, documenting it would help users choose the right tool.
Reproduction
import numpy as np, numkong as nk, stringzilla as sz, cv2, time
def make_lut(alpha, beta):
x = np.arange(256, dtype=np.float32)
return np.clip(np.round(alpha * x + beta), 0, 255).astype(np.uint8)
def bench(fn, n=41, w=12):
for _ in range(w): fn()
t = []
for _ in range(n):
s = time.perf_counter(); fn(); t.append(time.perf_counter() - s)
return float(np.median(t)) * 1e3
CASES = [("upper sat", 1.3, 30.0), ("lower sat", 0.8, -20.0)]
SHAPES = [(128,128,128,1), (32,128,128,3), (1024,1024,3)]
rng = np.random.default_rng(0)
for label, alpha, beta in CASES:
lut = make_lut(alpha, beta)
print(f"\n--- {label}: alpha={alpha}, beta={beta} ---")
for sh in SHAPES:
img = rng.integers(0, 256, size=sh, dtype=np.uint8)
flat = np.ascontiguousarray(img).reshape(-1)
t_nk = bench(lambda: nk.scale(nk.Tensor(flat), alpha=alpha, beta=beta))
t_sz = bench(lambda: sz.translate(memoryview(flat.copy()), memoryview(lut), inplace=False))
t_cv = bench(lambda: cv2.LUT(img, lut))
print(f" {'Γ'.join(map(str,sh)):22s} nk={t_nk:.4f} sz={t_sz:.4f} cv2={t_cv:.4f} ms")Thanks for both libraries β the combination covers almost all of our hot paths. The rounding mode would be the single most impactful addition for correctness; understanding the saturation sensitivity would help us route more reliably.
Can you contribute to the implementation?
- I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct