Skip to content

perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026

Closed
wjc911 wants to merge 10 commits intolinebender:mainfrom
wjc911:feComponentTransfer_perf_optimize
Closed

perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026
wjc911 wants to merge 10 commits intolinebender:mainfrom
wjc911:feComponentTransfer_perf_optimize

Conversation

@wjc911
Copy link

@wjc911 wjc911 commented Feb 22, 2026

Summary

  • Precompute a 256-entry lookup table (LUT) for Gamma transfer functions when the image has >= 256 pixels, replacing per-pixel powf calls with a single table lookup
  • Precompute a 256-entry LUT for Table transfer functions when the image has >= 1024 pixels, replacing per-pixel linear interpolation with a direct array index
  • LUT construction cost is amortized over the pixel count; thresholds are tuned to break even conservatively

Benchmark Results

Test case Speedup
gamma-correct 1.76x – 1.85x

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

wjc911 and others added 10 commits February 21, 2026 16:11
Replace per-pixel transfer function computation with a pre-computed
256-entry lookup table (LUT) for each active channel. Since inputs are
always u8 (0-255), we can compute each transfer result once during LUT
construction and then apply it via a single table lookup per pixel per
channel.

This eliminates expensive per-pixel operations, especially f32::powf()
in the Gamma transfer function case. Benchmarks show 20-33x speedup for
Gamma, 5-8x for Linear, and 10-28x for Table/Discrete functions.

Key changes:
- Pre-compute 256-entry LUT per active channel before the pixel loop
- Use identity LUT for inactive channels to avoid branching in hot loop
- Preserve original implementation as apply_naive for correctness testing
- Add bit-exact tests verifying LUT output matches per-pixel output for
  all 256 input values across all transfer function types
- Add public ComponentTransfer::new() constructor to usvg
- Add standalone benchmark (benches/component_transfer_bench.rs)

The LUT approach is bit-exact with the original: build_lut calls the
same transfer_scalar function with the same u8 inputs that would be
encountered at runtime, producing identical f32 arithmetic and u8 output.
- Add pixel count threshold: skip LUT build for images < 256 pixels
  and fall back to direct per-pixel transfer_scalar() calls, avoiding
  the fixed setup cost of up to 1024 scalar calls for tiny images
- Fix misleading comment that claimed LUT lookups are SIMD-friendly;
  table lookups are gather operations that cannot be auto-vectorized
- Convert identity_lut() function to const IDENTITY_LUT item for
  guaranteed zero runtime cost (compile-time evaluation)
- apply_naive already correctly gated behind #[cfg(test)]
The previous fixed threshold of 256 pixels caused regressions for cheap
transfer functions (Linear, Table, Discrete) where LUT build cost exceeded
per-pixel savings at small image sizes.

New per-function thresholds based on comprehensive benchmarking:
- Gamma (powf): 256 pixels (unchanged, LUT build ~30us amortized quickly)
- Table/Discrete: 1024 pixels (LUT build ~18-25us, cheap per-pixel ops)
- Linear: 2048 pixels (LUT build ~6us, but multiply-add is very cheap)

Also adds hybrid path for mixed-channel cases (e.g. R=Gamma, G=Linear)
where different channels may cross their threshold at different image sizes.

Includes comprehensive benchmark example testing 11 image sizes, 8 transfer
function types, 4 input patterns, and validates threshold correctness.
Use scoped threads and AtomicUsize progress counter to run benchmark
configurations in parallel across all available CPU cores.
- Remove #[inline(never)] from apply and build_lut (benchmarking artifacts)
- Fix misleading apply_naive comment ("preserved verbatim" was inaccurate)
- Remove benchmark files (component_transfer_bench,
  bench_component_transfer_comprehensive)
- Remove [[bench]] section from Cargo.toml
Replaces the parallel bench_e2e.rs with a sequential single-threaded
version that uses per-resolution iteration counts (2000 for 16px,
down to 100 for 1024px+), a probe-then-scale budget cap (30s total
per case, skip if single probe > 10s), and --compare for TSV baseline
comparison. Allows CPU-pinned reproducible measurements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjc911 wjc911 closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant