perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026
Closed
wjc911 wants to merge 10 commits intolinebender:mainfrom
Closed
perf: LUT precomputation for feComponentTransfer Gamma and Table functions#1026wjc911 wants to merge 10 commits intolinebender:mainfrom
wjc911 wants to merge 10 commits intolinebender:mainfrom
Conversation
Replace per-pixel transfer function computation with a pre-computed 256-entry lookup table (LUT) for each active channel. Since inputs are always u8 (0-255), we can compute each transfer result once during LUT construction and then apply it via a single table lookup per pixel per channel. This eliminates expensive per-pixel operations, especially f32::powf() in the Gamma transfer function case. Benchmarks show 20-33x speedup for Gamma, 5-8x for Linear, and 10-28x for Table/Discrete functions. Key changes: - Pre-compute 256-entry LUT per active channel before the pixel loop - Use identity LUT for inactive channels to avoid branching in hot loop - Preserve original implementation as apply_naive for correctness testing - Add bit-exact tests verifying LUT output matches per-pixel output for all 256 input values across all transfer function types - Add public ComponentTransfer::new() constructor to usvg - Add standalone benchmark (benches/component_transfer_bench.rs) The LUT approach is bit-exact with the original: build_lut calls the same transfer_scalar function with the same u8 inputs that would be encountered at runtime, producing identical f32 arithmetic and u8 output.
- Add pixel count threshold: skip LUT build for images < 256 pixels and fall back to direct per-pixel transfer_scalar() calls, avoiding the fixed setup cost of up to 1024 scalar calls for tiny images - Fix misleading comment that claimed LUT lookups are SIMD-friendly; table lookups are gather operations that cannot be auto-vectorized - Convert identity_lut() function to const IDENTITY_LUT item for guaranteed zero runtime cost (compile-time evaluation) - apply_naive already correctly gated behind #[cfg(test)]
The previous fixed threshold of 256 pixels caused regressions for cheap transfer functions (Linear, Table, Discrete) where LUT build cost exceeded per-pixel savings at small image sizes. New per-function thresholds based on comprehensive benchmarking: - Gamma (powf): 256 pixels (unchanged, LUT build ~30us amortized quickly) - Table/Discrete: 1024 pixels (LUT build ~18-25us, cheap per-pixel ops) - Linear: 2048 pixels (LUT build ~6us, but multiply-add is very cheap) Also adds hybrid path for mixed-channel cases (e.g. R=Gamma, G=Linear) where different channels may cross their threshold at different image sizes. Includes comprehensive benchmark example testing 11 image sizes, 8 transfer function types, 4 input patterns, and validates threshold correctness.
Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.
- Remove #[inline(never)] from apply and build_lut (benchmarking artifacts)
- Fix misleading apply_naive comment ("preserved verbatim" was inaccurate)
- Remove benchmark files (component_transfer_bench,
bench_component_transfer_comprehensive)
- Remove [[bench]] section from Cargo.toml
Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gammatransfer functions when the image has >= 256 pixels, replacing per-pixelpowfcalls with a single table lookupTabletransfer functions when the image has >= 1024 pixels, replacing per-pixel linear interpolation with a direct array indexBenchmark Results
Test Results
All 1723/1723 integration tests pass (
cargo test --release -p resvg --test integration).🤖 Generated with Claude Code