Speed up baseline JPEG entropy encoding, which is currently ~4.7x slower than C mozjpeg.
- Entropy encoding is 59% of baseline encoding time
- C mozjpeg uses SSE2/AVX2 intrinsics for entropy encoding
- Rust encoder uses standard iteration with Write trait
Hypothesis: Write trait overhead and bit buffer structure are bottlenecks.
Changes:
- Owned
Vec<u8>instead ofWritetrait - 64-bit buffer with flush at 32+ bits
- SWAR 0xFF detection
#[cold]annotation on flush path
Results (rough, noisy benchmark):
| Config | Standard | Fast | Speedup |
|---|---|---|---|
| Q50 sparse | 0.14ms | 0.14ms | 1.0x |
| Q75 medium | 0.22ms | 0.24ms | 0.9x |
Conclusion: Write trait overhead is NOT the bottleneck. The standard BitWriter is already efficient. This approach added code complexity without benefit.
Hypothesis: Iterating through all 63 AC coefficients is slow; jumping to non-zero positions via tzcnt would be faster.
Changes:
- Build 64-bit mask of non-zero coefficients in zigzag order
- Use
trailing_zeros()(compiles to tzcnt) to find next non-zero - Calculate run length from position difference
Results:
| Config | Standard | Fast | Speedup |
|---|---|---|---|
| Q50 sparse | 0.15ms | 0.10ms | 1.5x |
| Q75 medium | 0.23ms | 0.23ms | 1.0x |
| Q90 dense | 0.29ms | 0.33ms | 0.88x |
Conclusion: tzcnt helps for SPARSE blocks but HURTS dense blocks. The zigzag mask building overhead (~14ns/block) dominates for dense blocks.
Hypothesis: Use popcount to detect density, choose algorithm accordingly.
Changes:
- Build zigzag mask
- If popcount < 20, use tzcnt path
- Otherwise, use linear iteration
Results: Similar to tzcnt-only. The mask building overhead still hurts.
Hypothesis: leading_zeros() calls are slow.
Changes:
- 256-entry lookup table for values 0-255
- Fallback to leading_zeros for larger values
Results: Minimal improvement. leading_zeros() is already single-cycle on modern CPUs.
- The Write trait is NOT the bottleneck - jpegli-rs approach didn't help
- Bit buffer operations are already efficient - standard 64-bit buffer is fine
- Sparse blocks benefit from tzcnt - but most real images have mixed density
- Mask building has overhead - ~14ns/block for zigzag-order mask
- The standard encoder is already well-optimized - simple linear iteration with SIMD early-exit for all-zero blocks
From jchuff-sse2.asm:
- Reorders coefficients into temp array first (zigzag order)
- Builds zero-mask during reorder (amortizes cost)
- Uses 64KB lookup table for jpeg_nbits (trades memory for speed)
- Speculative writes - writes 8 bytes, fixes up if 0xFF found
- Tight assembly loop with register allocation optimized
The key insight: mozjpeg's approach works because the coefficient reordering and mask building are fused into a single SIMD pass over the data.
- Fused zigzag reorder + mask build - do both in one SIMD pass
- Larger nbits lookup table - match mozjpeg's 64KB table
- Profile the actual hot path - use
perforflamegraphto identify real bottleneck - Compare against C mozjpeg's non-SIMD path - see if we're competitive there
Configuration: 4096 blocks, varying density
| Density | Standard (µs) | Fast (µs) | Ratio |
|---|---|---|---|
| 10% sparse | 199 | 198 | 1.01x (tie) |
| 20% sparse | 211 | 230 | 0.92x |
| 40% medium | 301 | 342 | 0.88x |
| 60% dense | 382 | 434 | 0.88x |
| 80% dense | 458 | 525 | 0.87x |
| 2k image (40%) | 4.65 ms | 5.40 ms | 0.86x |
The "fast" encoder is 12-14% SLOWER than standard!
- Write trait is NOT the bottleneck - VecBitWriter is already efficient
- Bit buffer ops already optimized - 64-bit buffer with good flush logic
- Missing SIMD early-exit - Fast encoder doesn't check for all-zero blocks
- Extra function call overhead - Separate encode_ac_linear adds indirection
- SIMD mask for early all-zero AC detection - Huge win for sparse blocks
- Inline everything - No function call overhead in hot path
- Simple linear iteration - Branch predictor-friendly
- Combined code+extra writes - Already has
put_bits_combined
Synthetic benchmark results were MISLEADING!
| Test Type | ns/block | Notes |
|---|---|---|
| Synthetic (Criterion) | 49-73 | Uniform coefficient distribution |
| Real image (timing_breakdown) | 220 | After DCT+quantization |
The 3x difference is because:
- Real DCT output has natural sparsity patterns
- Coefficient magnitudes follow power-law distribution
- Non-zero clustering in low frequencies
- More complex run-length encoding patterns
The fast encoder approaches didn't work because they were tested on unrealistic data.
- Use REAL image data through DCT+quantization pipeline
- Test multiple images with varying content
- Compare against C mozjpeg's actual entropy encoding time
Using Criterion for reliable measurements on busy system:
- Warm-up: 2-3 seconds
- Measurement: 5-8 seconds
- Sample size: 50-100
- Confidence interval: 95%
- MUST use real DCT/quant pipeline data
Using real PNG image (tests/images/1.png, 512x512) through DCT+quantization pipeline.
| Quality | Standard (µs) | Fast (µs) | Ratio |
|---|---|---|---|
| Q50 | 742 | 869 | 0.85x (17% slower) |
| Q75 | 877 | 1057 | 0.83x (20% slower) |
| Q85 | 976 | 1189 | 0.82x (22% slower) |
| Q95 | 1178 | 1432 | 0.82x (22% slower) |
| Quality | Standard (µs) | Fast (µs) | Ratio |
|---|---|---|---|
| Q50 | 500 | 549 | 0.91x (10% slower) |
| Q85 | 734 | 732 | 1.00x (tie) |
-
Real image data is harder to encode - Standard encoder takes 742µs on real vs 500µs on synthetic at Q50 (48% more time). This is because real images have more complex coefficient distributions.
-
Fast encoder is consistently slower on real images - 17-22% slower across all quality levels.
-
Synthetic data was misleading - At Q85 synthetic, the fast encoder ties. But on real Q85 data, fast encoder is 22% slower.
-
ns/block comparison:
- Standard: 181-288 ns/block on real data
- Fast: 212-350 ns/block on real data
The jpegli-rs approach and tzcnt-based optimizations have FAILED.
The "fast" entropy encoder is 17-22% SLOWER than the standard encoder on real image data. Both approaches:
- jpegli-rs BitWriter style - Write trait is not the bottleneck
- tzcnt zero-run skipping - Mask building overhead dominates
The standard encoder is already well-optimized:
- SIMD check for all-zero AC blocks (huge win for sparse blocks)
- Everything inlined (no function call overhead)
- Simple linear iteration (branch predictor friendly)
- Combined code+extra bit writes
- Fused zigzag reorder + mask build in single SIMD pass - This is what C mozjpeg does
- Speculative 8-byte writes with 0xFF fixup - Avoids per-byte checking
- Direct port of jchuff-sse2.asm - Match C mozjpeg's exact approach
- Profile with
perf- Identify actual hot spots in the encoding loop
However, given that:
- Trellis mode (where most encoding time is spent) already beats C mozjpeg by 10%
- Baseline mode gap is 4.7x, entropy is only part of that
- Further optimization may have diminishing returns
It may be more productive to focus on other bottlenecks (color conversion, DCT) or accept that the current entropy encoder is "good enough" for production use.
| Component | Rust Cycles | C Cycles | Slowdown |
|---|---|---|---|
| Total | 6.1B | 1.25B | 4.9x |
| Color conversion | 1.7B (28%) | 0.17B (14%) | 10x |
| Entropy encoding | 2.2B (36%) | 0.46B (37%) | 4.8x |
| DCT | 0.24B (4%) | 0.11B (9%) | 2.2x |
The encoder uses simd/scalar.rs for color conversion, NOT the yuv crate!
The fast-yuv feature only affects color.rs which isn't used by the encoder.
This alone accounts for 10x of the slowdown.
| Function | % Time | Description |
|---|---|---|
jsimd_huff_encode_one_block_sse2 |
37% | SIMD entropy encoding |
jsimd_rgb_ycc_convert_avx2 |
14% | AVX2 color conversion |
jsimd_fdct_islow_avx2 |
9% | AVX2 DCT |
jsimd_quantize_avx2 |
6% | AVX2 quantization |
jsimd_convsamp_avx2 |
7% | AVX2 sample conversion |
From /home/lilith/work/jpegli-rs/internal/jpegli-cpp/third_party/libjpeg-turbo/simd/x86_64/jchuff-sse2.asm:
-
Fused zigzag reorder + sign handling
- SSE2 shuffles (punpckldq, pshuflw, pinsrw) for zigzag
- pcmpgtw + paddw for sign handling in same pass
-
64KB lookup table for nbits
- Direct mapping: value → bit count
- Avoids leading_zeros() at runtime
-
Speculative 8-byte writes with SIMD 0xFF detection
- Write 8 bytes first (optimistic)
- pcmpeqb + pmovmskb to find 0xFF bytes
- Only fixup when stuffing needed
-
tzcnt for zero-run detection
- Uses trailing zero count to skip zeros efficiently
Fix color conversion (10x potential) - Make encoder use yuv crate✅ DONE- Port jchuff-sse2 (5.6x potential) - SIMD entropy encoding
Improve DCT (2.2x potential) - Better AVX2 intrinsics✅ DONE (1.6x remaining)
| Component | Rust Cycles | C Cycles | Slowdown |
|---|---|---|---|
| Total | 4.46B | 1.72B | 2.59x (was 4.9x) |
| Color conversion | 0.14B (3%) | 0.22B (13%) | Rust 1.6x faster! |
| Entropy encoding | 3.57B (80%) | 0.64B (37%) | 5.6x |
| DCT | 0.24B (5%) | 0.15B (9%) | 1.6x |
-
Color conversion now uses yuv crate - SimdOps::detect() routes to yuv crate's AVX2/SSE/NEON implementation when
fast-yuvfeature is enabled (default). Result: Rust is now 1.6x FASTER than C mozjpeg for color conversion! -
DCT uses hand-written AVX2 intrinsics - With
simd-intrinsicsfeature, usessimd::x86_64::avx2::forward_dct_8x8instead of multiversion scalar. Result: DCT improved from 2.2x to 1.6x slower.
The 2.59x slowdown is now 80% entropy encoding. The jchuff-sse2 port is the only remaining optimization with significant impact potential (5.6x improvement).
fast-yuv(default): Uses yuv crate for color conversion - essential, 10x speedupsimd-intrinsics(optional): Uses hand-written DCT intrinsics - 3% overall improvement
The simd-intrinsics feature is NOT in defaults because the 3% improvement is marginal
and the multiversion autovectorization is almost as good (93% of intrinsics perf).
Ported key techniques from jchuff-sse2.asm to Rust:
- Fused zigzag reorder + sign handling - SSE2 shuffles for zigzag order, pcmpgtw + paddw for sign handling in same pass
- 64KB nbits lookup table - Direct mapping: value → bit count
- 64-bit non-zero mask - Built during zigzag reorder
- tzcnt-based iteration -
trailing_zeros()to skip zeros efficiently
| Quality | Standard (µs) | SIMD (µs) | Speedup |
|---|---|---|---|
| Q50 | 775.81 | 330.10 | 2.35x |
| Q75 | ~880 | ~370 | 2.38x |
| Q85 | ~980 | ~420 | 2.33x |
| Q95 | ~1180 | ~550 | 2.15x |
- Safe wrapper
encode_block()calls unsafe SSE2 intrinsics internally cfg(target_arch = "x86_64")with fallback to standard encoder- All tests pass, produces valid JPEG files