perf: tiled vertical pass for feGaussianBlur on large images#1025
Closed
wjc911 wants to merge 13 commits intolinebender:mainfrom
Closed
perf: tiled vertical pass for feGaussianBlur on large images#1025wjc911 wants to merge 13 commits intolinebender:mainfrom
wjc911 wants to merge 13 commits intolinebender:mainfrom
Conversation
Box blur: Use [i32; 4] SIMD-friendly accumulators and process vertical columns in tiles of 16 for better cache locality. ~1.1-1.4x speedup. IIR blur: Process all 4 RGBA channels simultaneously using [f64; 4] arrays instead of 4 separate passes, and tile the vertical pass in strips of 32 columns. ~2.9-3.8x speedup. Both optimizations are bit-exact with the original implementations. Original functions preserved as *_naive for correctness verification.
- box_blur/iir_blur: fall back to naive implementation for images with width < 16 or height < 16 where tiling/interleaving overhead exceeds the benefit - Remove #[allow(dead_code)] from apply_naive since it is now used as a runtime fallback in release builds - Correct SIMD comments: [i32;4] enables 128-bit within-pixel SIMD but loop-carried val dependency prevents across-pixel vectorization; [f64;4] enables AVX 256-bit within-pixel but IIR serial dependency limits gains to reducing 16 passes to 4
Tests box blur (sigma >= 2.0) and IIR blur (sigma < 2.0) across 11 image sizes (4x4 to 1024x1024), 8 box sigma values, 5 IIR sigma values, 3 input patterns (opaque, gradient alpha, random alpha), asymmetric sigmas, and threshold boundary cases around 16x16.
Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.
… cold early-return
The tiled vertical pass showed consistent performance regressions for 1024x768 images (786k pixels) with sigma >= 10, believed to be caused by power-of-2 stride cache set conflicts at 4096-byte row stride. Raise the pixel threshold from 250,000 to 1,000,000 so the tiled path only activates for images larger than ~1M pixels (e.g. 1500x1000), where benchmarks confirm consistent speedups of 1.2x+. Also update both benchmark harnesses to mirror this threshold: - box_blur_opt::apply() now delegates to the naive path for images below the threshold, accurately reflecting production behavior - bench_blur_comprehensive runs sequentially (1 thread) to eliminate CPU cache contention between naive and optimized measurements - Added #[inline(always)] on the naive apply/box_blur_inner fallback path to prevent cross-module call overhead All cargo bench and bench_blur_comprehensive results now show >= 0.97x with no regressions below 0.95x. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sequential CPU-pinned benchmarks show the tiled path is beneficial for images starting at ~250k pixels (e.g. 600x400), not just 1M+: backdrop-blur 600x400 (240k px): 1.01x -> 1.13x backdrop-blur 800x600 (480k px): 1.01x -> 1.13x backdrop-blur 1024x768 (786k px): 1.16x -> 1.16x (unchanged) backdrop-blur 1500x1000 (1.5M px): 1.10x -> 1.14x The earlier 1M threshold was set based on noisy parallel benchmark measurements that incorrectly showed a regression at 786k pixels. Sequential benchmarks with proper CPU pinning confirm no regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark Results
Test Results
All 1723/1723 integration tests pass (
cargo test --release -p resvg --test integration).🤖 Generated with Claude Code