Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions encodings/fastlanes/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,7 @@ required-features = ["_test-harness"]
name = "compute_between"
harness = false
required-features = ["_test-harness"]

[[bench]]
name = "transpose_bench"
harness = false
82 changes: 82 additions & 0 deletions encodings/fastlanes/TRANSPOSE_OPTIMIZATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# FastLanes 1024-bit Transpose Optimization Plan

## Summary

Optimized the FastLanes 1024-bit transpose operation from ~3700 cycles to ~13 cycles (287x speedup) using AVX-512 VBMI instructions.

## Performance Results

| Implementation | Cycles/Call | Speedup vs Baseline |
|----------------|-------------|---------------------|
| baseline | 3734 | 1x |
| scalar | 1374 | 2.7x |
| scalar_fast | 215 | 17x |
| bmi2 | 129 | 29x |
| avx2 | 450 | 8x |
| avx2_gfni | 175 | 21x |
| avx512_gfni | 149 | 25x |
| **avx512_vbmi**| **13** | **287x** |
| vbmi_dual | 12.5 | 299x |
| vbmi_quad | 12.4 | 301x |

## Key Optimizations

### 1. AVX-512 VBMI Vectorized Gather/Scatter
- `vpermi2b` (`_mm512_permutex2var_epi8`) for gathering bytes from stride-8 positions
- `vpermb` (`_mm512_permutexvar_epi8`) for scattering transposed bytes
- Replaces scalar loops that were the main bottleneck

### 2. XOR/Shift Butterfly for 8x8 Bit Transpose
- 3-step butterfly algorithm using masks `0x00AA...`, `0x0000CCCC...`, `0x00000000F0F0F0F0`
- Transposes 8x8 bit matrix within each u64 in ~9 instructions per step

### 3. Multi-block Processing for ILP
- Dual-block (`transpose_1024x2_vbmi`): ~5% improvement
- Quad-block (`transpose_1024x4_vbmi`): ~7% improvement over single
- Diminishing returns beyond 4 blocks

## Static Permutation Tables

```rust
// GATHER_FIRST: Collects bytes 0,8,16,24,32,40,48,56 from each group
static GATHER_FIRST: [u8; 64] = [
0, 16, 32, 48, 64, 80, 96, 112, // Group 0
8, 24, 40, 56, 72, 88, 104, 120, // Group 1
// ... etc
];

// SCATTER_8X8: 8x8 byte transpose pattern
static SCATTER_8X8: [u8; 64] = [
0, 8, 16, 24, 32, 40, 48, 56, // byte 0 from each group
1, 9, 17, 25, 33, 41, 49, 57, // byte 1 from each group
// ... etc
];
```

## Files Modified

- `encodings/fastlanes/src/transpose/mod.rs` - Main implementations
- `encodings/fastlanes/examples/perf_transpose.rs` - Benchmark

## CPU Feature Requirements

| Implementation | Required Features |
|----------------|-------------------|
| baseline/scalar| None |
| bmi2 | BMI2 |
| avx2 | AVX2 |
| avx2_gfni | AVX2 + GFNI |
| avx512_gfni | AVX-512F/BW + GFNI|
| avx512_vbmi | AVX-512F/BW + VBMI|

## Recommendations

1. **Default**: Use `transpose_1024_vbmi` when VBMI is available (~13 cycles)
2. **Batch processing**: Use `transpose_1024x2_vbmi` or `transpose_1024x4_vbmi` for bulk operations
3. **Fallback chain**: VBMI → AVX-512+GFNI → BMI2 → scalar_fast → baseline

## Future Work

- ARM NEON implementation (currently has placeholder)
- Streaming stores for large array processing
- Integration with bitpacking encode/decode paths
Loading
Loading