perf: document feColorMatrix f32_bound NaN behavior#1024
Closed
wjc911 wants to merge 8 commits intolinebender:mainfrom
Closed
perf: document feColorMatrix f32_bound NaN behavior#1024wjc911 wants to merge 8 commits intolinebender:mainfrom
wjc911 wants to merge 8 commits intolinebender:mainfrom
Conversation
Transpose the 4x5 row-major color matrix into five column vectors of [f32; 4] so LLVM can auto-vectorize the per-pixel matrix multiply into packed SIMD instructions. Benchmarks show ~1.5-1.7x throughput improvement for the full Matrix variant (the most expensive path). Saturate, HueRotate, and LuminanceToAlpha are left unchanged since their compact 3x3 / scalar loops are already well-optimized by the compiler. The original naive implementation is preserved as apply_naive for correctness testing. Ten bit-exact tests verify identical output across all matrix types, boundary angles, extreme coefficients, and all 256 possible channel values. A standalone benchmark (benches/color_matrix_bench.rs) covers all four matrix types at 64x64 through 4096x4096 resolutions.
Replace f32_bound(0.0, c, 1.0) with c.clamp(0.0, 1.0) in the color_matrix module for branchless SIMD-friendly clamping. The rust-lang/rust#44095 issue was resolved in Rust 1.35, so remove the outdated TODO comment. Gate the naive reference implementation behind #[cfg(test)] instead of #[allow(dead_code)] so it is only compiled for testing.
The column-major matrix transposition in the full 4x5 Matrix path was counterproductive: LLVM already auto-vectorizes the row-major scalar loop across 4 pixels simultaneously (processing 4 RGBA pixels at once), while the column-major approach vectorized across 4 channels of 1 pixel with heavy register spilling to stack memory (~422 instructions vs ~227). Revert the Matrix path to the original row-major scalar loop while keeping the f32::clamp() change (replacing manual f32_bound), which provides 1.1x-2.0x improvement across all matrix types. Add comprehensive benchmark (examples/bench_colormatrix_comprehensive.rs) testing 7 image sizes, 16 matrix types, 3 input patterns with median-of-7 interleaved measurement methodology. Update existing bench to compare old f32_bound vs new f32::clamp.
Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.
- Remove misleading auto-vectorization comment (apply was not changed) - Remove apply_naive and tests that compared identical implementations - Remove benchmark files (color_matrix_bench, bench_colormatrix_comprehensive) - Remove [[bench]] section from Cargo.toml - Add rationale comment for f32_bound explaining why it is preferred over f32::clamp (avoids NaN-propagation overhead that inhibits SIMD)
Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
f32_boundexplaining that the clamping order(v.max(0.0)).min(1.0)correctly propagates NaN as 0.0, matching the SVG specification's out-of-range clamping semanticsNotes
This is a documentation/clarity change for an already-optimized code path. The comment ensures the non-obvious NaN propagation behavior is not accidentally "fixed" in a way that would regress correctness or performance.
Test Results
All 1723/1723 integration tests pass (
cargo test --release -p resvg --test integration).🤖 Generated with Claude Code