Feature: `nk.scale` feature request: rounding mode + performance analysis across saturation regimes

### Describe what you are looking for

## `nk.scale` feature request: rounding mode + performance analysis across saturation regimes

We're using both `numkong` and `stringzilla` in [albucore](https://github.com/albumentations-team/albucore) — a low-level image processing library. Benchmarking affine uint8 scaling across two saturation regimes and a canonical shape grid reveals two actionable gaps.

---

### Benchmark setup

Two non-trivial affine cases chosen to exercise saturation at different ends:

- **Case A** `alpha=1.3, beta=30`: 83/256 output values clip to 255 (upper saturation)
- **Case B** `alpha=0.8, beta=-20`: 26/256 output values clip to 0 (lower saturation)

Platform: macOS arm64 · cv2 4.13.0 · numpy 2.4.2 · numkong 7.0.0 · stringzilla 4.6.0  
Repeats: 41, warmup: 12. Times in milliseconds (median). Layout: HWC = `(H,W,C)`, DHWC = `(D,H,W,C)`, NDHWC = `(N,D,H,W,C)`.

#### Case A — `alpha=1.3, beta=30` (83 values saturate at 255)

| layout | shape | bytes | nk.scale | sz | cv2.LUT | numpy | fastest | nk/best |
|--------|-------|------:|---------:|---:|--------:|------:|---------|--------:|
| HWC | 128×128×1 | 16,384 | 0.0014 | 0.0017 | 0.0041 | 0.0181 | **nk.scale** | 1.00× |
| HWC | 128×128×3 | 49,152 | 0.0033 | 0.0039 | 0.0114 | 0.0466 | **nk.scale** | 1.00× |
| HWC | 128×128×9 | 147,456 | 0.0080 | 0.0103 | 0.0343 | 0.1545 | **nk.scale** | 1.00× |
| HWC | 256×256×1 | 65,536 | 0.0060 | 0.0049 | 0.0151 | 0.0722 | **sz** | 1.22× |
| HWC | 256×256×3 | 196,608 | 0.0114 | 0.0135 | 0.0444 | 0.2230 | **nk.scale** | 1.00× |
| HWC | 256×256×9 | 589,824 | 0.0312 | 0.0386 | 0.1385 | 0.9184 | **nk.scale** | 1.00× |
| HWC | 512×512×1 | 262,144 | 0.0134 | 0.0177 | 0.0311 | 0.2936 | **nk.scale** | 1.00× |
| HWC | 512×512×3 | 786,432 | 0.0390 | 0.0515 | 0.0685 | 1.1949 | **nk.scale** | 1.00× |
| HWC | 512×512×9 | 2,359,296 | 0.2573 | 0.2206 | 0.2105 | 2.2635 | **cv2** | 1.22× |
| HWC | 1024×1024×1 | 1,048,576 | 0.0550 | 0.0677 | 0.1303 | 1.6142 | **nk.scale** | 1.00× |
| HWC | 1024×1024×3 | 3,145,728 | 0.3429 | 0.5382 | 0.2063 | 3.0362 | **cv2** | 1.66× |
| HWC | 1024×1024×9 | 9,437,184 | 0.4700 | 0.6218 | 0.3089 | 8.7483 | **cv2** | 1.52× |
| DHWC | 16×128×128×1 | 262,144 | 0.0180 | 0.0175 | 0.0632 | 0.2899 | **sz** | 1.03× |
| DHWC | 16×128×128×3 | 786,432 | 0.0499 | 0.0516 | 0.1758 | 1.2973 | **nk.scale** | 1.00× |
| DHWC | 32×128×128×1 | 524,288 | 0.0280 | 0.0343 | 0.1254 | 0.8759 | **nk.scale** | 1.00× |
| DHWC | 32×128×128×3 | 1,572,864 | 0.1863 | 0.1419 | 0.3710 | 2.5942 | **sz** | 1.31× |
| DHWC | 64×128×128×3 | 3,145,728 | 0.3647 | 0.5462 | 0.8308 | 3.0274 | **nk.scale** | 1.00× |
| DHWC | 128×128×128×1 | 2,097,152 | 0.2258 | 0.4532 | 0.5568 | 3.4568 | **nk.scale** | 1.00× |
| DHWC | 48×256×256×3 | 9,437,184 | 0.4722 | 0.6246 | 2.2108 | 8.7893 | **nk.scale** | 1.00× |
| NDHWC | 2×32×128×128×1 | 1,048,576 | 0.0548 | 0.0687 | 0.2450 | 1.6958 | **nk.scale** | 1.00× |
| NDHWC | 2×32×128×128×3 | 3,145,728 | 0.3717 | 0.2931 | 0.8331 | 3.0156 | **sz** | 1.27× |
| NDHWC | 2×64×128×128×3 | 6,291,456 | 0.6991 | 0.5868 | 1.6812 | 6.1590 | **sz** | 1.19× |
| NDHWC | 4×16×128×128×3 | 3,145,728 | 0.3526 | 0.4462 | 0.8124 | 3.0149 | **nk.scale** | 1.00× |

#### Case B — `alpha=0.8, beta=-20` (26 values saturate at 0)

| layout | shape | bytes | nk.scale | sz | cv2.LUT | numpy | fastest | nk/best |
|--------|-------|------:|---------:|---:|--------:|------:|---------|--------:|
| HWC | 128×128×1 | 16,384 | 0.0014 | 0.0017 | 0.0041 | 0.0263 | **nk.scale** | 1.00× |
| HWC | 128×128×3 | 49,152 | 0.0038 | 0.0042 | 0.0114 | 0.0595 | **nk.scale** | 1.00× |
| HWC | 128×128×9 | 147,456 | 0.0091 | 0.0103 | 0.0359 | 0.1610 | **nk.scale** | 1.00× |
| HWC | 256×256×1 | 65,536 | 0.0044 | 0.0049 | 0.0167 | 0.0772 | **nk.scale** | 1.00× |
| HWC | 256×256×3 | 196,608 | 0.0115 | 0.0135 | 0.0456 | 0.2277 | **nk.scale** | 1.00× |
| HWC | 256×256×9 | 589,824 | 0.0333 | 0.0385 | 0.1382 | 0.8984 | **nk.scale** | 1.00× |
| HWC | 512×512×1 | 262,144 | 0.0148 | 0.0176 | 0.0319 | 0.3167 | **nk.scale** | 1.00× |
| HWC | 512×512×3 | 786,432 | 0.0463 | 0.0509 | 0.0697 | 1.2126 | **nk.scale** | 1.00× |
| HWC | 512×512×9 | 2,359,296 | 0.2550 | 0.2122 | 0.2080 | 2.2503 | **cv2** | 1.23× |
| HWC | 1024×1024×1 | 1,048,576 | 0.0495 | 0.0676 | 0.0862 | 1.6403 | **nk.scale** | 1.00× |
| HWC | 1024×1024×3 | 3,145,728 | 0.3514 | 0.3067 | 0.2023 | 3.0348 | **cv2** | 1.74× |
| HWC | 1024×1024×9 | 9,437,184 | 0.4733 | 0.6301 | 0.3272 | 8.7480 | **cv2** | 1.45× |
| DHWC | 16×128×128×1 | 262,144 | 0.0220 | 0.0177 | 0.0637 | 0.2952 | **sz** | 1.24× |
| DHWC | 16×128×128×3 | 786,432 | 0.0416 | 0.0518 | 0.1889 | 1.3046 | **nk.scale** | 1.00× |
| DHWC | 32×128×128×1 | 524,288 | 0.0257 | 0.0345 | 0.1177 | 0.8647 | **nk.scale** | 1.00× |
| DHWC | 32×128×128×3 | 1,572,864 | 0.2005 | 0.1537 | 0.3708 | 2.5196 | **sz** | 1.30× |
| DHWC | 64×128×128×3 | 3,145,728 | 0.3725 | 0.5380 | 0.8441 | 3.0171 | **nk.scale** | 1.00× |
| DHWC | 128×128×128×1 | 2,097,152 | 0.2537 | 0.2062 | 0.5667 | 3.7150 | **sz** | 1.23× |
| DHWC | 48×256×256×3 | 9,437,184 | 0.4756 | 0.6307 | 2.2156 | 8.7619 | **nk.scale** | 1.00× |
| NDHWC | 2×32×128×128×1 | 1,048,576 | 0.0652 | 0.0676 | 0.2443 | 1.8025 | **nk.scale** | 1.00× |
| NDHWC | 2×32×128×128×3 | 3,145,728 | 0.4201 | 0.3139 | 0.8472 | 3.0572 | **sz** | 1.34× |
| NDHWC | 2×64×128×128×3 | 6,291,456 | 0.7009 | 1.2994 | 1.6702 | 6.1743 | **nk.scale** | 1.00× |
| NDHWC | 4×16×128×128×3 | 3,145,728 | 0.3767 | 0.2905 | 0.8295 | 3.0647 | **sz** | 1.30× |

---

### Analysis

The winner flips not only by shape/size but also by **saturation regime**, which is the key finding here:

**`nk.scale` consistently wins:**
- All small buffers (< ~200 KB), any layout, any saturation
- Multi-channel volumes like `64×128×128×3` and `48×256×256×3` where `sz` is inexplicably slower despite larger buffers

**`sz.translate` wins for specific (size, layout, saturation) triples, e.g.:**
- `32×128×128×3` 1.5 MB: sz wins by 31% (case A) / 30% (case B) — consistent across both
- `2×32×128×128×3` 3 MB: sz wins by 27% (A) / 34% (B) — also consistent
- `128×128×128×1` 2 MB: sz **wins by 2× in case A** but only 23% in case B — saturation-sensitive!

**`cv2.LUT` wins for large HWC multi-channel:**
- `1024×1024×3` 3 MB: cv2 is 1.66× (A) / 1.74× (B) faster than `nk.scale`
- `1024×1024×9` 9 MB: cv2 is 1.52× (A) / 1.45× (B) faster

**Critical observation:** `128×128×128×1` (2 MB isotropic grayscale volume) shows `sz` at 0.45 ms (case A, upper sat) vs 0.21 ms (case B, lower sat) — a 2× difference in `sz` performance across saturation cases, while `nk.scale` barely moves (0.23 vs 0.25 ms). This suggests `sz.translate`'s SIMD path is sensitive to the distribution of table lookups (cache effects on the 256-byte LUT?), while `nk.scale`'s arithmetic path is not.

---

### Feature requests

**1. Rounding mode for `nk.scale`**

`nk.scale` truncates (floor), but standard saturated uint8 arithmetic uses banker's rounding / `round()`. This prevents using `nk.scale` as a drop-in for LUT-based affine ops — outputs differ by ±1 at half-integer boundaries, breaking pixel-exact tests.

```python
import numpy as np, numkong as nk

img = np.array([1, 3, 5, 7, 9], dtype=np.uint8)  # alpha*x = 1.5, 4.5, 7.5, 10.5, 13.5
flat = img.copy()
out_nk = np.frombuffer(nk.scale(nk.Tensor(flat), alpha=1.5, beta=0.0), dtype=np.uint8)
out_np = np.clip(np.round(1.5 * img.astype(np.float32)), 0, 255).astype(np.uint8)

print("nk:", out_nk)  # [1 4 7 10 13] — truncation
print("np:", out_np)  # [2 4 8 10 14] — round
```

A `rounding='round' | 'floor' | 'trunc'` kwarg would make `nk.scale` substitutable anywhere a LUT is used.

**2. Investigate the saturation-regime sensitivity**

The 2× variance in `sz.translate` performance on `128×128×128×1` between upper- and lower-saturation cases (0.45 ms vs 0.21 ms) is surprising and worth understanding — is it a LUT cache effect, branch predictor behavior on the clamp, or something else? If it's a known limitation, documenting it would help users choose the right tool.

---

### Reproduction

```python
import numpy as np, numkong as nk, stringzilla as sz, cv2, time

def make_lut(alpha, beta):
    x = np.arange(256, dtype=np.float32)
    return np.clip(np.round(alpha * x + beta), 0, 255).astype(np.uint8)

def bench(fn, n=41, w=12):
    for _ in range(w): fn()
    t = []
    for _ in range(n):
        s = time.perf_counter(); fn(); t.append(time.perf_counter() - s)
    return float(np.median(t)) * 1e3

CASES = [("upper sat", 1.3, 30.0), ("lower sat", 0.8, -20.0)]
SHAPES = [(128,128,128,1), (32,128,128,3), (1024,1024,3)]
rng = np.random.default_rng(0)

for label, alpha, beta in CASES:
    lut = make_lut(alpha, beta)
    print(f"\n--- {label}: alpha={alpha}, beta={beta} ---")
    for sh in SHAPES:
        img = rng.integers(0, 256, size=sh, dtype=np.uint8)
        flat = np.ascontiguousarray(img).reshape(-1)
        t_nk = bench(lambda: nk.scale(nk.Tensor(flat), alpha=alpha, beta=beta))
        t_sz = bench(lambda: sz.translate(memoryview(flat.copy()), memoryview(lut), inplace=False))
        t_cv = bench(lambda: cv2.LUT(img, lut))
        print(f"  {'×'.join(map(str,sh)):22s}  nk={t_nk:.4f}  sz={t_sz:.4f}  cv2={t_cv:.4f} ms")
```

---

Thanks for both libraries — the combination covers almost all of our hot paths. The rounding mode would be the single most impactful addition for correctness; understanding the saturation sensitivity would help us route more reliably.

### Can you contribute to the implementation?

- [ ] I can contribute

### Is your feature request specific to a certain interface?

It applies to everything

### Contact Details

_No response_

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: `nk.scale` feature request: rounding mode + performance analysis across saturation regimes #327

Describe what you are looking for

`nk.scale` feature request: rounding mode + performance analysis across saturation regimes

Benchmark setup

Case A — `alpha=1.3, beta=30` (83 values saturate at 255)

Case B — `alpha=0.8, beta=-20` (26 values saturate at 0)

Analysis

Feature requests

Reproduction

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

layout	shape	bytes	nk.scale	sz	cv2.LUT	numpy	fastest	nk/best
HWC	128×128×1	16,384	0.0014	0.0017	0.0041	0.0181	nk.scale	1.00×
HWC	128×128×3	49,152	0.0033	0.0039	0.0114	0.0466	nk.scale	1.00×
HWC	128×128×9	147,456	0.0080	0.0103	0.0343	0.1545	nk.scale	1.00×
HWC	256×256×1	65,536	0.0060	0.0049	0.0151	0.0722	sz	1.22×
HWC	256×256×3	196,608	0.0114	0.0135	0.0444	0.2230	nk.scale	1.00×
HWC	256×256×9	589,824	0.0312	0.0386	0.1385	0.9184	nk.scale	1.00×
HWC	512×512×1	262,144	0.0134	0.0177	0.0311	0.2936	nk.scale	1.00×
HWC	512×512×3	786,432	0.0390	0.0515	0.0685	1.1949	nk.scale	1.00×
HWC	512×512×9	2,359,296	0.2573	0.2206	0.2105	2.2635	cv2	1.22×
HWC	1024×1024×1	1,048,576	0.0550	0.0677	0.1303	1.6142	nk.scale	1.00×
HWC	1024×1024×3	3,145,728	0.3429	0.5382	0.2063	3.0362	cv2	1.66×
HWC	1024×1024×9	9,437,184	0.4700	0.6218	0.3089	8.7483	cv2	1.52×
DHWC	16×128×128×1	262,144	0.0180	0.0175	0.0632	0.2899	sz	1.03×
DHWC	16×128×128×3	786,432	0.0499	0.0516	0.1758	1.2973	nk.scale	1.00×
DHWC	32×128×128×1	524,288	0.0280	0.0343	0.1254	0.8759	nk.scale	1.00×
DHWC	32×128×128×3	1,572,864	0.1863	0.1419	0.3710	2.5942	sz	1.31×
DHWC	64×128×128×3	3,145,728	0.3647	0.5462	0.8308	3.0274	nk.scale	1.00×
DHWC	128×128×128×1	2,097,152	0.2258	0.4532	0.5568	3.4568	nk.scale	1.00×
DHWC	48×256×256×3	9,437,184	0.4722	0.6246	2.2108	8.7893	nk.scale	1.00×
NDHWC	2×32×128×128×1	1,048,576	0.0548	0.0687	0.2450	1.6958	nk.scale	1.00×
NDHWC	2×32×128×128×3	3,145,728	0.3717	0.2931	0.8331	3.0156	sz	1.27×
NDHWC	2×64×128×128×3	6,291,456	0.6991	0.5868	1.6812	6.1590	sz	1.19×
NDHWC	4×16×128×128×3	3,145,728	0.3526	0.4462	0.8124	3.0149	nk.scale	1.00×

Feature: nk.scale feature request: rounding mode + performance analysis across saturation regimes #327

Description

Describe what you are looking for

nk.scale feature request: rounding mode + performance analysis across saturation regimes

Benchmark setup

Case A — alpha=1.3, beta=30 (83 values saturate at 255)

Case B — alpha=0.8, beta=-20 (26 values saturate at 0)

Analysis

Feature requests

Reproduction

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Feature: `nk.scale` feature request: rounding mode + performance analysis across saturation regimes #327

`nk.scale` feature request: rounding mode + performance analysis across saturation regimes

Case A — `alpha=1.3, beta=30` (83 values saturate at 255)

Case B — `alpha=0.8, beta=-20` (26 values saturate at 0)