Skip to content

perf: tiled vertical pass for feGaussianBlur on large images#1025

Closed
wjc911 wants to merge 13 commits intolinebender:mainfrom
wjc911:feGaussianBlur_perf_optimize
Closed

perf: tiled vertical pass for feGaussianBlur on large images#1025
wjc911 wants to merge 13 commits intolinebender:mainfrom
wjc911:feGaussianBlur_perf_optimize

Conversation

@wjc911
Copy link

@wjc911 wjc911 commented Feb 22, 2026

Summary

  • Add a tiled vertical blur pass for images larger than 1 million pixels with radius >= 8
  • Process columns in cache-friendly tiles (64-column blocks) to improve L1/L2 cache utilization during the vertical pass, which accesses memory with large strides
  • The horizontal pass is unaffected; tiling is applied only where stride-based access causes cache thrashing

Benchmark Results

Test case Speedup
backdrop-blur (1500x1000, radius 8) 1.16x

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

wjc911 and others added 13 commits February 21, 2026 16:22
Box blur: Use [i32; 4] SIMD-friendly accumulators and process vertical
columns in tiles of 16 for better cache locality. ~1.1-1.4x speedup.

IIR blur: Process all 4 RGBA channels simultaneously using [f64; 4]
arrays instead of 4 separate passes, and tile the vertical pass in
strips of 32 columns. ~2.9-3.8x speedup.

Both optimizations are bit-exact with the original implementations.
Original functions preserved as *_naive for correctness verification.
- box_blur/iir_blur: fall back to naive implementation for images with
  width < 16 or height < 16 where tiling/interleaving overhead exceeds
  the benefit
- Remove #[allow(dead_code)] from apply_naive since it is now used as a
  runtime fallback in release builds
- Correct SIMD comments: [i32;4] enables 128-bit within-pixel SIMD but
  loop-carried val dependency prevents across-pixel vectorization;
  [f64;4] enables AVX 256-bit within-pixel but IIR serial dependency
  limits gains to reducing 16 passes to 4
Tests box blur (sigma >= 2.0) and IIR blur (sigma < 2.0) across 11 image
sizes (4x4 to 1024x1024), 8 box sigma values, 5 IIR sigma values, 3 input
patterns (opaque, gradient alpha, random alpha), asymmetric sigmas, and
threshold boundary cases around 16x16.
Use scoped threads and AtomicUsize progress counter to run benchmark
configurations in parallel across all available CPU cores.
The tiled vertical pass showed consistent performance regressions for
1024x768 images (786k pixels) with sigma >= 10, believed to be caused
by power-of-2 stride cache set conflicts at 4096-byte row stride.

Raise the pixel threshold from 250,000 to 1,000,000 so the tiled path
only activates for images larger than ~1M pixels (e.g. 1500x1000),
where benchmarks confirm consistent speedups of 1.2x+.

Also update both benchmark harnesses to mirror this threshold:
- box_blur_opt::apply() now delegates to the naive path for images
  below the threshold, accurately reflecting production behavior
- bench_blur_comprehensive runs sequentially (1 thread) to eliminate
  CPU cache contention between naive and optimized measurements
- Added #[inline(always)] on the naive apply/box_blur_inner fallback
  path to prevent cross-module call overhead

All cargo bench and bench_blur_comprehensive results now show >= 0.97x
with no regressions below 0.95x.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the parallel bench_e2e.rs with a sequential single-threaded
version that uses per-resolution iteration counts (2000 for 16px,
down to 100 for 1024px+), a probe-then-scale budget cap (30s total
per case, skip if single probe > 10s), and --compare for TSV baseline
comparison. Allows CPU-pinned reproducible measurements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sequential CPU-pinned benchmarks show the tiled path is beneficial for
images starting at ~250k pixels (e.g. 600x400), not just 1M+:

  backdrop-blur 600x400  (240k px): 1.01x -> 1.13x
  backdrop-blur 800x600  (480k px): 1.01x -> 1.13x
  backdrop-blur 1024x768 (786k px): 1.16x -> 1.16x (unchanged)
  backdrop-blur 1500x1000 (1.5M px): 1.10x -> 1.14x

The earlier 1M threshold was set based on noisy parallel benchmark
measurements that incorrectly showed a regression at 786k pixels.
Sequential benchmarks with proper CPU pinning confirm no regression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjc911 wjc911 closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant