perf: LUT precomputation for feComponentTransfer Gamma and Table functions by wjc911 · Pull Request #1026 · linebender/resvg

wjc911 · 2026-02-22T18:31:26Z

Summary

Precompute a 256-entry lookup table (LUT) for Gamma transfer functions when the image has >= 256 pixels, replacing per-pixel powf calls with a single table lookup
Precompute a 256-entry LUT for Table transfer functions when the image has >= 1024 pixels, replacing per-pixel linear interpolation with a direct array index
LUT construction cost is amortized over the pixel count; thresholds are tuned to break even conservatively

Benchmark Results

Test case	Speedup
gamma-correct	1.76x – 1.85x

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

Replace per-pixel transfer function computation with a pre-computed 256-entry lookup table (LUT) for each active channel. Since inputs are always u8 (0-255), we can compute each transfer result once during LUT construction and then apply it via a single table lookup per pixel per channel. This eliminates expensive per-pixel operations, especially f32::powf() in the Gamma transfer function case. Benchmarks show 20-33x speedup for Gamma, 5-8x for Linear, and 10-28x for Table/Discrete functions. Key changes: - Pre-compute 256-entry LUT per active channel before the pixel loop - Use identity LUT for inactive channels to avoid branching in hot loop - Preserve original implementation as apply_naive for correctness testing - Add bit-exact tests verifying LUT output matches per-pixel output for all 256 input values across all transfer function types - Add public ComponentTransfer::new() constructor to usvg - Add standalone benchmark (benches/component_transfer_bench.rs) The LUT approach is bit-exact with the original: build_lut calls the same transfer_scalar function with the same u8 inputs that would be encountered at runtime, producing identical f32 arithmetic and u8 output.

- Add pixel count threshold: skip LUT build for images < 256 pixels and fall back to direct per-pixel transfer_scalar() calls, avoiding the fixed setup cost of up to 1024 scalar calls for tiny images - Fix misleading comment that claimed LUT lookups are SIMD-friendly; table lookups are gather operations that cannot be auto-vectorized - Convert identity_lut() function to const IDENTITY_LUT item for guaranteed zero runtime cost (compile-time evaluation) - apply_naive already correctly gated behind #[cfg(test)]

The previous fixed threshold of 256 pixels caused regressions for cheap transfer functions (Linear, Table, Discrete) where LUT build cost exceeded per-pixel savings at small image sizes. New per-function thresholds based on comprehensive benchmarking: - Gamma (powf): 256 pixels (unchanged, LUT build ~30us amortized quickly) - Table/Discrete: 1024 pixels (LUT build ~18-25us, cheap per-pixel ops) - Linear: 2048 pixels (LUT build ~6us, but multiply-add is very cheap) Also adds hybrid path for mixed-channel cases (e.g. R=Gamma, G=Linear) where different channels may cross their threshold at different image sizes. Includes comprehensive benchmark example testing 11 image sizes, 8 transfer function types, 4 input patterns, and validates threshold correctness.

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.

- Remove #[inline(never)] from apply and build_lut (benchmarking artifacts) - Fix misleading apply_naive comment ("preserved verbatim" was inaccurate) - Remove benchmark files (component_transfer_bench, bench_component_transfer_comprehensive) - Remove [[bench]] section from Cargo.toml

Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 and others added 10 commits February 21, 2026 16:11

bench: parallelize feComponentTransfer benchmark with std::thread::scope

dc05351

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.

Add inline(never) annotations to component_transfer optimized functions

506054a

Add component_transfer benchmarks for real-world usage patterns

833234a

Apply cargo fmt to bench_e2e.rs

2bc35e2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply cargo fmt --all to fix CI formatting check

e4b525a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026

perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026
wjc911 wants to merge 10 commits intolinebender:mainfrom
wjc911:feComponentTransfer_perf_optimize

wjc911 commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjc911 commented Feb 22, 2026

Summary

Benchmark Results

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant