Entropy Encoding Optimization Notes

Goal

Speed up baseline JPEG entropy encoding, which is currently ~4.7x slower than C mozjpeg.

Current State (Before Optimization)

Entropy encoding is 59% of baseline encoding time
C mozjpeg uses SSE2/AVX2 intrinsics for entropy encoding
Rust encoder uses standard iteration with Write trait

Approaches Tried

1. jpegli-rs Style BitWriter (FAILED)

Hypothesis: Write trait overhead and bit buffer structure are bottlenecks.

Changes:

Owned Vec<u8> instead of Write trait
64-bit buffer with flush at 32+ bits
SWAR 0xFF detection
#[cold] annotation on flush path

Results (rough, noisy benchmark):

Config	Standard	Fast	Speedup
Q50 sparse	0.14ms	0.14ms	1.0x
Q75 medium	0.22ms	0.24ms	0.9x

Conclusion: Write trait overhead is NOT the bottleneck. The standard BitWriter is already efficient. This approach added code complexity without benefit.

2. tzcnt-based Zero Run Skipping (PARTIAL SUCCESS)

Hypothesis: Iterating through all 63 AC coefficients is slow; jumping to non-zero positions via tzcnt would be faster.

Changes:

Build 64-bit mask of non-zero coefficients in zigzag order
Use trailing_zeros() (compiles to tzcnt) to find next non-zero
Calculate run length from position difference

Results:

Config	Standard	Fast	Speedup
Q50 sparse	0.15ms	0.10ms	1.5x
Q75 medium	0.23ms	0.23ms	1.0x
Q90 dense	0.29ms	0.33ms	0.88x

Conclusion: tzcnt helps for SPARSE blocks but HURTS dense blocks. The zigzag mask building overhead (~14ns/block) dominates for dense blocks.

3. Hybrid Sparse/Dense Approach (MARGINAL)

Hypothesis: Use popcount to detect density, choose algorithm accordingly.

Changes:

Build zigzag mask
If popcount < 20, use tzcnt path
Otherwise, use linear iteration

Results: Similar to tzcnt-only. The mask building overhead still hurts.

4. Lookup Table for jpeg_nbits (MARGINAL)

Hypothesis: leading_zeros() calls are slow.

Changes:

256-entry lookup table for values 0-255
Fallback to leading_zeros for larger values

Results: Minimal improvement. leading_zeros() is already single-cycle on modern CPUs.

Key Insights

The Write trait is NOT the bottleneck - jpegli-rs approach didn't help
Bit buffer operations are already efficient - standard 64-bit buffer is fine
Sparse blocks benefit from tzcnt - but most real images have mixed density
Mask building has overhead - ~14ns/block for zigzag-order mask
The standard encoder is already well-optimized - simple linear iteration with SIMD early-exit for all-zero blocks

What C mozjpeg Does Differently

From jchuff-sse2.asm:

Reorders coefficients into temp array first (zigzag order)
Builds zero-mask during reorder (amortizes cost)
Uses 64KB lookup table for jpeg_nbits (trades memory for speed)
Speculative writes - writes 8 bytes, fixes up if 0xFF found
Tight assembly loop with register allocation optimized

The key insight: mozjpeg's approach works because the coefficient reordering and mask building are fused into a single SIMD pass over the data.

Next Steps to Try

Fused zigzag reorder + mask build - do both in one SIMD pass
Larger nbits lookup table - match mozjpeg's 64KB table
Profile the actual hot path - use perf or flamegraph to identify real bottleneck
Compare against C mozjpeg's non-SIMD path - see if we're competitive there

Criterion Benchmark Results (Accurate)

Configuration: 4096 blocks, varying density

Density	Standard (µs)	Fast (µs)	Ratio
10% sparse	199	198	1.01x (tie)
20% sparse	211	230	0.92x
40% medium	301	342	0.88x
60% dense	382	434	0.88x
80% dense	458	525	0.87x
2k image (40%)	4.65 ms	5.40 ms	0.86x

The "fast" encoder is 12-14% SLOWER than standard!

Why jpegli-rs Approach Didn't Help

Write trait is NOT the bottleneck - VecBitWriter is already efficient
Bit buffer ops already optimized - 64-bit buffer with good flush logic
Missing SIMD early-exit - Fast encoder doesn't check for all-zero blocks
Extra function call overhead - Separate encode_ac_linear adds indirection

What the Standard Encoder Does Right

SIMD mask for early all-zero AC detection - Huge win for sparse blocks
Inline everything - No function call overhead in hot path
Simple linear iteration - Branch predictor-friendly
Combined code+extra writes - Already has put_bits_combined

Critical Finding: Synthetic vs Real Data

Synthetic benchmark results were MISLEADING!

Test Type	ns/block	Notes
Synthetic (Criterion)	49-73	Uniform coefficient distribution
Real image (timing_breakdown)	220	After DCT+quantization

The 3x difference is because:

Real DCT output has natural sparsity patterns
Coefficient magnitudes follow power-law distribution
Non-zero clustering in low frequencies
More complex run-length encoding patterns

The fast encoder approaches didn't work because they were tested on unrealistic data.

Revised Benchmark Requirements

Use REAL image data through DCT+quantization pipeline
Test multiple images with varying content
Compare against C mozjpeg's actual entropy encoding time

Benchmark Configuration

Using Criterion for reliable measurements on busy system:

Warm-up: 2-3 seconds
Measurement: 5-8 seconds
Sample size: 50-100
Confidence interval: 95%
MUST use real DCT/quant pipeline data

FINAL RESULTS: Real Image Benchmark (2026-01-19)

Using real PNG image (tests/images/1.png, 512x512) through DCT+quantization pipeline.

Real Image Data (4096 blocks)

Quality	Standard (µs)	Fast (µs)	Ratio
Q50	742	869	0.85x (17% slower)
Q75	877	1057	0.83x (20% slower)
Q85	976	1189	0.82x (22% slower)
Q95	1178	1432	0.82x (22% slower)

Synthetic Image Data (for comparison)

Quality	Standard (µs)	Fast (µs)	Ratio
Q50	500	549	0.91x (10% slower)
Q85	734	732	1.00x (tie)

Key Observations

Real image data is harder to encode - Standard encoder takes 742µs on real vs 500µs on synthetic at Q50 (48% more time). This is because real images have more complex coefficient distributions.
Fast encoder is consistently slower on real images - 17-22% slower across all quality levels.
Synthetic data was misleading - At Q85 synthetic, the fast encoder ties. But on real Q85 data, fast encoder is 22% slower.
ns/block comparison:
- Standard: 181-288 ns/block on real data
- Fast: 212-350 ns/block on real data

CONCLUSION

The jpegli-rs approach and tzcnt-based optimizations have FAILED.

The "fast" entropy encoder is 17-22% SLOWER than the standard encoder on real image data. Both approaches:

jpegli-rs BitWriter style - Write trait is not the bottleneck
tzcnt zero-run skipping - Mask building overhead dominates

The standard encoder is already well-optimized:

SIMD check for all-zero AC blocks (huge win for sparse blocks)
Everything inlined (no function call overhead)
Simple linear iteration (branch predictor friendly)
Combined code+extra bit writes

Potential Next Steps (if pursuing further)

Fused zigzag reorder + mask build in single SIMD pass - This is what C mozjpeg does
Speculative 8-byte writes with 0xFF fixup - Avoids per-byte checking
Direct port of jchuff-sse2.asm - Match C mozjpeg's exact approach
Profile with perf - Identify actual hot spots in the encoding loop

However, given that:

Trellis mode (where most encoding time is spent) already beats C mozjpeg by 10%
Baseline mode gap is 4.7x, entropy is only part of that
Further optimization may have diminishing returns

It may be more productive to focus on other bottlenecks (color conversion, DCT) or accept that the current entropy encoder is "good enough" for production use.

PERF Analysis: Rust vs C mozjpeg (2026-01-19)

Baseline Mode (no trellis) - 512x512 image

Component	Rust Cycles	C Cycles	Slowdown
Total	6.1B	1.25B	4.9x
Color conversion	1.7B (28%)	0.17B (14%)	10x
Entropy encoding	2.2B (36%)	0.46B (37%)	4.8x
DCT	0.24B (4%)	0.11B (9%)	2.2x

Key Finding: Color Conversion Bug

The encoder uses simd/scalar.rs for color conversion, NOT the yuv crate! The fast-yuv feature only affects color.rs which isn't used by the encoder. This alone accounts for 10x of the slowdown.

C mozjpeg SIMD Functions (from perf)

Function	% Time	Description
`jsimd_huff_encode_one_block_sse2`	37%	SIMD entropy encoding
`jsimd_rgb_ycc_convert_avx2`	14%	AVX2 color conversion
`jsimd_fdct_islow_avx2`	9%	AVX2 DCT
`jsimd_quantize_avx2`	6%	AVX2 quantization
`jsimd_convsamp_avx2`	7%	AVX2 sample conversion

jchuff-sse2.asm Key Techniques

From /home/lilith/work/jpegli-rs/internal/jpegli-cpp/third_party/libjpeg-turbo/simd/x86_64/jchuff-sse2.asm:

Fused zigzag reorder + sign handling
- SSE2 shuffles (punpckldq, pshuflw, pinsrw) for zigzag
- pcmpgtw + paddw for sign handling in same pass
64KB lookup table for nbits
- Direct mapping: value → bit count
- Avoids leading_zeros() at runtime
Speculative 8-byte writes with SIMD 0xFF detection
- Write 8 bytes first (optimistic)
- pcmpeqb + pmovmskb to find 0xFF bytes
- Only fixup when stuffing needed
tzcnt for zero-run detection
- Uses trailing zero count to skip zeros efficiently

Priority Fixes

Fix color conversion (10x potential) - Make encoder use yuv crate ✅ DONE
Port jchuff-sse2 (5.6x potential) - SIMD entropy encoding
Improve DCT (2.2x potential) - Better AVX2 intrinsics ✅ DONE (1.6x remaining)

PERF Analysis: After Color + DCT Fixes (2026-01-19)

Baseline Mode (no trellis) - 512x512 image, with fast-yuv + simd-intrinsics

Component	Rust Cycles	C Cycles	Slowdown
Total	4.46B	1.72B	2.59x (was 4.9x)
Color conversion	0.14B (3%)	0.22B (13%)	Rust 1.6x faster!
Entropy encoding	3.57B (80%)	0.64B (37%)	5.6x
DCT	0.24B (5%)	0.15B (9%)	1.6x

What Was Fixed

Color conversion now uses yuv crate - SimdOps::detect() routes to yuv crate's AVX2/SSE/NEON implementation when fast-yuv feature is enabled (default). Result: Rust is now 1.6x FASTER than C mozjpeg for color conversion!
DCT uses hand-written AVX2 intrinsics - With simd-intrinsics feature, uses simd::x86_64::avx2::forward_dct_8x8 instead of multiversion scalar. Result: DCT improved from 2.2x to 1.6x slower.

Remaining Bottleneck: Entropy Encoding

The 2.59x slowdown is now 80% entropy encoding. The jchuff-sse2 port is the only remaining optimization with significant impact potential (5.6x improvement).

Key Insight: fast-yuv vs simd-intrinsics

fast-yuv (default): Uses yuv crate for color conversion - essential, 10x speedup
simd-intrinsics (optional): Uses hand-written DCT intrinsics - 3% overall improvement

The simd-intrinsics feature is NOT in defaults because the 3% improvement is marginal and the multiversion autovectorization is almost as good (93% of intrinsics perf).

SIMD Entropy Encoder Port (2026-01-19)

Implementation

Ported key techniques from jchuff-sse2.asm to Rust:

Fused zigzag reorder + sign handling - SSE2 shuffles for zigzag order, pcmpgtw + paddw for sign handling in same pass
64KB nbits lookup table - Direct mapping: value → bit count
64-bit non-zero mask - Built during zigzag reorder
tzcnt-based iteration - trailing_zeros() to skip zeros efficiently

Benchmark Results (Real Image Data, 512x512)

Quality	Standard (µs)	SIMD (µs)	Speedup
Q50	775.81	330.10	2.35x
Q75	~880	~370	2.38x
Q85	~980	~420	2.33x
Q95	~1180	~550	2.15x

Integration

Safe wrapper encode_block() calls unsafe SSE2 intrinsics internally
cfg(target_arch = "x86_64") with fallback to standard encoder
All tests pass, produces valid JPEG files

FilesExpand file tree

ENTROPY_OPTIMIZATION_NOTES.md

Latest commit

History

ENTROPY_OPTIMIZATION_NOTES.md

File metadata and controls

Entropy Encoding Optimization Notes

Goal

Current State (Before Optimization)

Approaches Tried

1. jpegli-rs Style BitWriter (FAILED)

2. tzcnt-based Zero Run Skipping (PARTIAL SUCCESS)

3. Hybrid Sparse/Dense Approach (MARGINAL)

4. Lookup Table for jpeg_nbits (MARGINAL)

Key Insights

What C mozjpeg Does Differently

Next Steps to Try

Criterion Benchmark Results (Accurate)

Why jpegli-rs Approach Didn't Help

What the Standard Encoder Does Right

Critical Finding: Synthetic vs Real Data

Revised Benchmark Requirements

Benchmark Configuration

FINAL RESULTS: Real Image Benchmark (2026-01-19)

Real Image Data (4096 blocks)

Synthetic Image Data (for comparison)

Key Observations

CONCLUSION

Potential Next Steps (if pursuing further)

PERF Analysis: Rust vs C mozjpeg (2026-01-19)

Baseline Mode (no trellis) - 512x512 image

Key Finding: Color Conversion Bug

C mozjpeg SIMD Functions (from perf)

jchuff-sse2.asm Key Techniques

Priority Fixes

PERF Analysis: After Color + DCT Fixes (2026-01-19)

Baseline Mode (no trellis) - 512x512 image, with fast-yuv + simd-intrinsics

What Was Fixed

Remaining Bottleneck: Entropy Encoding

Key Insight: fast-yuv vs simd-intrinsics

SIMD Entropy Encoder Port (2026-01-19)

Implementation

Benchmark Results (Real Image Data, 512x512)

Integration