perf: tiled vertical pass for feGaussianBlur on large images by wjc911 · Pull Request #1025 · linebender/resvg

wjc911 · 2026-02-22T18:31:22Z

Summary

Add a tiled vertical blur pass for images larger than 1 million pixels with radius >= 8
Process columns in cache-friendly tiles (64-column blocks) to improve L1/L2 cache utilization during the vertical pass, which accesses memory with large strides
The horizontal pass is unaffected; tiling is applied only where stride-based access causes cache thrashing

Benchmark Results

Test case	Speedup
backdrop-blur (1500x1000, radius 8)	1.16x

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

Box blur: Use [i32; 4] SIMD-friendly accumulators and process vertical columns in tiles of 16 for better cache locality. ~1.1-1.4x speedup. IIR blur: Process all 4 RGBA channels simultaneously using [f64; 4] arrays instead of 4 separate passes, and tile the vertical pass in strips of 32 columns. ~2.9-3.8x speedup. Both optimizations are bit-exact with the original implementations. Original functions preserved as *_naive for correctness verification.

- box_blur/iir_blur: fall back to naive implementation for images with width < 16 or height < 16 where tiling/interleaving overhead exceeds the benefit - Remove #[allow(dead_code)] from apply_naive since it is now used as a runtime fallback in release builds - Correct SIMD comments: [i32;4] enables 128-bit within-pixel SIMD but loop-carried val dependency prevents across-pixel vectorization; [f64;4] enables AVX 256-bit within-pixel but IIR serial dependency limits gains to reducing 16 passes to 4

Tests box blur (sigma >= 2.0) and IIR blur (sigma < 2.0) across 11 image sizes (4x4 to 1024x1024), 8 box sigma values, 5 IIR sigma values, 3 input patterns (opaque, gradient alpha, random alpha), asymmetric sigmas, and threshold boundary cases around 16x16.

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.

… cold early-return

The tiled vertical pass showed consistent performance regressions for 1024x768 images (786k pixels) with sigma >= 10, believed to be caused by power-of-2 stride cache set conflicts at 4096-byte row stride. Raise the pixel threshold from 250,000 to 1,000,000 so the tiled path only activates for images larger than ~1M pixels (e.g. 1500x1000), where benchmarks confirm consistent speedups of 1.2x+. Also update both benchmark harnesses to mirror this threshold: - box_blur_opt::apply() now delegates to the naive path for images below the threshold, accurately reflecting production behavior - bench_blur_comprehensive runs sequentially (1 thread) to eliminate CPU cache contention between naive and optimized measurements - Added #[inline(always)] on the naive apply/box_blur_inner fallback path to prevent cross-module call overhead All cargo bench and bench_blur_comprehensive results now show >= 0.97x with no regressions below 0.95x. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sequential CPU-pinned benchmarks show the tiled path is beneficial for images starting at ~250k pixels (e.g. 600x400), not just 1M+: backdrop-blur 600x400 (240k px): 1.01x -> 1.13x backdrop-blur 800x600 (480k px): 1.01x -> 1.13x backdrop-blur 1024x768 (786k px): 1.16x -> 1.16x (unchanged) backdrop-blur 1500x1000 (1.5M px): 1.10x -> 1.14x The earlier 1M threshold was set based on noisy parallel benchmark measurements that incorrectly showed a regression at 786k pixels. Sequential benchmarks with proper CPU pinning confirm no regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 and others added 13 commits February 21, 2026 16:22

bench: parallelize feGaussianBlur benchmark with std::thread::scope

9e69ea1

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores.

Fix gaussian blur regression: restore original hot paths, optimize as…

892c26b

… cold early-return

Fix box_blur register spill: move cold dispatch outside hot loop

c7df0ea

Clean up feGaussianBlur optimization code

c1cc5b4

Rewrite blur benchmarks for real-world usage patterns

7e04336

Apply cargo fmt to bench_e2e.rs

5af3e75

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply cargo fmt --all to fix CI formatting check

7fd5d53

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tiled vertical pass for feGaussianBlur on large images#1025

perf: tiled vertical pass for feGaussianBlur on large images#1025
wjc911 wants to merge 13 commits intolinebender:mainfrom
wjc911:feGaussianBlur_perf_optimize

wjc911 commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjc911 commented Feb 22, 2026

Summary

Benchmark Results

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant