Skip to content

Commit 96c8d2a

Browse files
doublegateclaude
andcommitted
feat: benchmark-driven performance optimizations (P1-P3)
12 optimizations implemented based on v2.3.2 benchmark analysis: - P1.1: Zero-allocation frame building (build_into_from_parts, 10.9x speedup) - P1.2: Cached Double Ratchet public key (93.6% encrypt improvement) - P1.3: BTreeSet priority queue for chunk requests (118,000x speedup) - P1.4: Cached assigned chunks set (eliminates per-call HashSet) - P2.1: In-place AEAD encrypt/decrypt benchmarks - P2.2: Binary search padding size classes (partition_point) - P2.3: BitVec chunk tracking (1000x memory reduction) - P3.1: Isolated benchmark runner (scripts/bench-isolated.sh) - P3.2: Missing benchmarks (build_into, replay, pipeline, AEAD in-place) - P3.3: Transfer throughput end-to-end benchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0199cf9 commit 96c8d2a

File tree

18 files changed

+3006
-104
lines changed

18 files changed

+3006
-104
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ docs/archive/backups/
5757

5858
# Benchmarks
5959
criterion/
60+
benchmarks/
6061

6162
# Fuzzing artifacts
6263
fuzz/artifacts/

CHANGELOG.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
---
1111

12-
## [2.3.2] - 2026-01-28 - Benchmark-Driven Performance & Security Optimizations
12+
## [2.3.2] - 2026-01-29 - Benchmark-Driven Performance & Security Optimizations
1313

1414
### Overview
1515

16-
Release focused on benchmark-driven performance and security optimizations, implementing 13 of 15 proposals from the v2.3.1 benchmark analysis. Key improvements: userspace PRNG for frame padding (eliminating getrandom syscall), zero-allocation frame building API, fast-path frame parsing, O(1) transfer scheduling, tuned BBR congestion control, tighter forward secrecy limits, intermediate key zeroization, expanded replay window, and improved padding size classes.
16+
Release focused on benchmark-driven performance and security optimizations, implementing 13 of 15 proposals from the v2.3.1 benchmark analysis plus 12 additional P1-P3 optimizations from v2.3.2 benchmark analysis. Key improvements: userspace PRNG for frame padding (eliminating getrandom syscall), zero-allocation frame building API, fast-path frame parsing, O(1) transfer scheduling, tuned BBR congestion control, tighter forward secrecy limits, intermediate key zeroization, expanded replay window, improved padding size classes, cached Double Ratchet public key, BTreeSet priority queue, BitVec chunk tracking, and isolated benchmark infrastructure.
1717

1818
### Changed
1919

@@ -26,6 +26,22 @@ Release focused on benchmark-driven performance and security optimizations, impl
2626
- **Chunk size increase**: `DEFAULT_CHUNK_SIZE` 256KiB->1MiB for reduced per-transfer overhead
2727
- **Transport buffers**: 256KiB->4MiB for better bandwidth-delay product coverage
2828

29+
#### Benchmark-Driven P1-P3 Optimizations (2026-01-29)
30+
- **P1.1: Zero-allocation frame building**: Added `build_into_from_parts()` writing directly into caller buffer (10.9x speedup, 76.3 GiB/s at 1456B)
31+
- **P1.2: Cached Double Ratchet public key**: Eliminates per-encrypt x25519 scalar multiplication (93.6% improvement, 1.71 us from 26.7 us)
32+
- **P1.3: BTreeSet priority queue**: O(log n) `next_chunk_to_request` via BTreeSet replacing O(n) linear scan (118,000x speedup, 3.34 ns per request)
33+
- **P1.4: Cached assigned chunks set**: Eliminates per-call HashSet construction in transfer sessions
34+
- **P2.1: In-place AEAD benchmarks**: New benchmark coverage for encrypt/decrypt in-place operations
35+
- **P2.2: Binary search padding size classes**: `partition_point()` replacing linear scan for size class lookup
36+
- **P2.3: BitVec chunk tracking**: Replaces dual HashSets with compact bitmap (1000x memory reduction, 58-71% session creation speedup, 6.6 ns `is_chunk_missing`)
37+
- **P3.1: Isolated benchmark runner**: `scripts/bench-isolated.sh` with CPU governor control, turbo boost management, core pinning
38+
- **P3.2: New benchmark groups**: build_into, full_pipeline, replay_protection, transfer_throughput, in-place AEAD, Double Ratchet
39+
- **P3.3: Transfer throughput benchmark**: End-to-end transfer session performance measurement
40+
41+
#### Benchmark Documentation
42+
- **Comprehensive analysis**: `docs/testing/BENCHMARK-ANALYSIS-v2.3.2-optimized.md` with three-version comparison (v2.3.1, v2.3.2-initial, v2.3.2-optimized)
43+
- **Initial analysis**: `docs/testing/BENCHMARK-ANALYSIS-v2.3.2.md` pre-optimization baseline
44+
2945
#### Security Hardening
3046
- **Forward secrecy**: Rekey byte limit tightened from 1GiB to 256MiB
3147
- **Intermediate key zeroization**: Added `temp.zeroize()` in KDF ratchet functions

CLAUDE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ WRAITH (Wire-speed Resilient Authenticated Invisible Transfer Handler) is a dece
1111
### Metrics
1212
| Metric | Value |
1313
|--------|-------|
14-
| Tests | 2,134 passing (2,123 workspace + 11 spectre-implant), 16 ignored - 100% pass rate |
14+
| Tests | 2,148 passing (2,123 workspace + 11 spectre-implant + 14 doc), 16 ignored - 100% pass rate |
1515
| Code | ~141,000 lines Rust (protocol + clients) + ~36,600 lines TypeScript |
1616
| Documentation | 114 files, ~62,800 lines |
1717
| Templates | 17 configuration/ROE templates |
1818
| Security | Zero vulnerabilities - EXCELLENT ([v1.1.0 audit](docs/security/SECURITY_AUDIT_v1.1.0.md), 295 deps) |
19-
| Performance | File chunking 14.85 GiB/s, tree hashing 4.71 GiB/s, verification 4.78 GiB/s, reassembly 5.42 GiB/s |
19+
| Performance | Frame build_into 76.3 GiB/s, frame parse 196 GiB/s, AEAD ~1.40 GiB/s, DR encrypt 1.71 us, chunking 14.48 GiB/s, tree hashing 4.71 GiB/s, transfer scheduling 3.34 ns (O(log n)), chunk tracking 6.6 ns (O(1) BitVec) |
2020
| Quality | 98/100, technical debt 2.5%, zero clippy warnings |
2121

2222
## Build & Development
@@ -153,4 +153,4 @@ Thread-per-core with no locks in hot path. Sessions pinned to cores, NUMA-aware
153153
| wraith-recon | ✅ Complete | 98 | Packet capture, protocol analysis |
154154
| wraith-redops | ✅ Complete | 11 | Team Server, Operator Client, Spectre Implant (isolated workspace, 11 tests run separately) |
155155

156-
**Total:** 2,134 tests passing (2,123 workspace + 11 spectre-implant, 16 ignored)
156+
**Total:** 2,148 tests passing (2,123 workspace + 11 spectre-implant + 14 doc, 16 ignored)

README.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -348,24 +348,42 @@ For detailed architecture documentation, see [Protocol Overview](docs/architectu
348348
| Memory per Session | <10 MB | Including buffers |
349349
| CPU @ 10 Gbps | <50% | 8-core system |
350350

351-
### Benchmarks (v2.3.2)
351+
### Benchmarks (v2.3.2-optimized)
352352

353-
Measured on production hardware with `cargo bench --workspace`. See [Benchmark Analysis](docs/testing/BENCHMARK-ANALYSIS-v2.3.1.md) for full methodology and results.
353+
Measured on production hardware (Intel i9-10850K, 64 GB RAM) with `cargo bench --workspace`. See [Benchmark Analysis](docs/testing/BENCHMARK-ANALYSIS-v2.3.2-optimized.md) for full methodology and results.
354354

355355
| Component | Measured Performance | Details |
356356
| -------------------- | ------------------------------------------- | ------------------------------------------ |
357-
| Frame Parsing | 2.4 ns/frame (~563 GiB/s equivalent) | SIMD: AVX2/SSE4.2/NEON, 172M frames/sec |
358-
| AEAD Encryption | ~1.4 GiB/s (XChaCha20-Poly1305) | 256-bit key, 192-bit nonce |
357+
| Frame Building | 17.77 ns (76.3 GiB/s) via `build_into` | Zero-allocation API, 10.9x faster than allocating build |
358+
| Frame Parsing | 6.9 ns/frame (~196 GiB/s) | SIMD: AVX2/SSE4.2/NEON, constant-time |
359+
| AEAD Encryption | ~1.40 GiB/s (XChaCha20-Poly1305) | 256-bit key, 192-bit nonce |
360+
| Double Ratchet | 1.71 us encrypt (was 26.7 us) | Cached public key, 93.6% improvement |
359361
| Noise XX Handshake | 345 us per handshake | Full mutual authentication |
360362
| Elligator2 Encoding | 29.5 us per encoding | Key indistinguishability from random |
361363
| BLAKE3 Hashing | 4.71 GiB/s (tree), 8.5 GB/s (parallel) | rayon + SIMD acceleration |
362-
| File Chunking | 14.85 GiB/s | io_uring async I/O |
363-
| Tree Hashing | 4.71 GiB/s in-memory, 3.78 GiB/s from disk | Merkle tree with BLAKE3 |
364-
| Chunk Verification | 4.78 GiB/s | <1 us per 256 KiB chunk |
364+
| File Chunking | 14.48 GiB/s | io_uring async I/O |
365+
| Tree Hashing | 4.71 GiB/s in-memory, 2.61 GiB/s from disk | Merkle tree with BLAKE3 |
366+
| Chunk Verification | 4.78 GiB/s | <1 us per chunk |
365367
| File Reassembly | 5.42 GiB/s | O(m) algorithm, zero-copy |
368+
| Transfer Scheduling | 3.34 ns per request (O(log n)) | BTreeSet priority queue, 118,000x improvement |
369+
| Chunk Tracking | 6.6 ns `is_chunk_missing` (O(1)) | BitVec bitmap, 1000x memory reduction |
370+
| Session Creation | 58-71% faster via BitVec tracking | Eliminated dual HashSet overhead |
371+
| Replay Protection | 920 ps sequential accept | 1024-packet sliding window |
366372
| Ring Buffers (SPSC) | ~100M ops/sec | Cache-line padded, lock-free |
367373
| Ring Buffers (MPSC) | ~20M ops/sec | CAS-based, 4 producers |
368374

375+
### Optimization Highlights (v2.3.2-optimized)
376+
377+
12 performance and infrastructure optimizations implemented based on benchmark analysis:
378+
379+
- **Zero-allocation frame building** (`build_into_from_parts`) -- writes directly into caller buffer, 10.9x speedup
380+
- **Cached Double Ratchet public key** -- eliminates per-encrypt x25519 scalar multiplication, 93.6% improvement
381+
- **BTreeSet priority queue** for chunk scheduling -- O(log n) replacing O(n) linear scan, 118,000x speedup
382+
- **BitVec chunk tracking** replacing dual HashSets -- 1000x memory reduction, 58-71% session creation speedup
383+
- **Binary search padding size classes** via `partition_point()` -- eliminates linear scan regression
384+
- **Isolated benchmark infrastructure** (`scripts/bench-isolated.sh`) with CPU governor control and core pinning
385+
- **6 new benchmark groups** -- build_into, full_pipeline, replay_protection, transfer_throughput, in-place AEAD, Double Ratchet
386+
369387
---
370388

371389
## Security
@@ -715,4 +733,4 @@ WRAITH Protocol builds on excellent projects and research:
715733

716734
**Version:** 2.3.2 | **License:** MIT | **Language:** Rust 2024 (MSRV 1.88) | **Tests:** 2,148 passing (2,123 workspace + 11 spectre-implant + 14 doc) | **Clients:** 12 applications (9 desktop + 2 mobile + 1 server)
717735

718-
**Last Updated:** 2026-01-28
736+
**Last Updated:** 2026-01-29

crates/wraith-core/benches/frame_bench.rs

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,71 @@ fn bench_parse_throughput(c: &mut Criterion) {
241241
group.finish();
242242
}
243243

244+
fn bench_frame_build_into(c: &mut Criterion) {
245+
let sizes: Vec<(usize, &str)> = vec![
246+
(64, "64_bytes"),
247+
(128, "128_bytes"),
248+
(256, "256_bytes"),
249+
(512, "512_bytes"),
250+
(1024, "1024_bytes"),
251+
(1456, "1456_bytes"),
252+
];
253+
254+
let mut group = c.benchmark_group("frame_build_into");
255+
256+
for (size, name) in sizes {
257+
let payload_len = size.saturating_sub(FRAME_HEADER_SIZE);
258+
let payload = vec![0x42; payload_len];
259+
let builder = FrameBuilder::new()
260+
.frame_type(FrameType::Data)
261+
.stream_id(42)
262+
.sequence(1000)
263+
.payload(&payload);
264+
265+
group.throughput(Throughput::Bytes(size as u64));
266+
group.bench_function(name, |b| {
267+
let mut buf = vec![0u8; size];
268+
b.iter(|| builder.build_into(black_box(&mut buf)))
269+
});
270+
}
271+
272+
group.finish();
273+
}
274+
275+
fn bench_frame_full_pipeline(c: &mut Criterion) {
276+
let sizes: Vec<(usize, &str)> = vec![
277+
(64, "64_bytes"),
278+
(256, "256_bytes"),
279+
(1024, "1024_bytes"),
280+
(1456, "1456_bytes"),
281+
];
282+
283+
let mut group = c.benchmark_group("frame_full_pipeline");
284+
285+
for (size, name) in sizes {
286+
let payload_len = size.saturating_sub(FRAME_HEADER_SIZE);
287+
let payload = vec![0x42; payload_len];
288+
289+
group.throughput(Throughput::Bytes(size as u64));
290+
group.bench_function(name, |b| {
291+
b.iter(|| {
292+
let frame = FrameBuilder::new()
293+
.frame_type(black_box(FrameType::Data))
294+
.stream_id(black_box(42))
295+
.sequence(black_box(1000))
296+
.payload(black_box(&payload))
297+
.build(black_box(size))
298+
.unwrap();
299+
300+
let parsed = Frame::parse(black_box(&frame)).unwrap();
301+
black_box(parsed.payload().len())
302+
})
303+
});
304+
}
305+
306+
group.finish();
307+
}
308+
244309
criterion_group!(
245310
benches,
246311
bench_frame_parse,
@@ -251,6 +316,8 @@ criterion_group!(
251316
bench_frame_types,
252317
bench_scalar_vs_simd,
253318
bench_parse_implementations_by_size,
254-
bench_parse_throughput
319+
bench_parse_throughput,
320+
bench_frame_build_into,
321+
bench_frame_full_pipeline
255322
);
256323
criterion_main!(benches);

crates/wraith-core/benches/transfer_bench.rs

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,50 @@ fn bench_peer_operations(c: &mut Criterion) {
312312
group.finish();
313313
}
314314

315+
// ============================================================================
316+
// Transfer Throughput Benchmarks
317+
// ============================================================================
318+
319+
/// Benchmark a complete transfer of 1000 chunks (request + mark transferred loop)
320+
fn bench_transfer_throughput(c: &mut Criterion) {
321+
let mut group = c.benchmark_group("transfer_throughput");
322+
323+
let total_chunks = 1_000u64;
324+
let file_size = total_chunks * CHUNK_SIZE as u64;
325+
326+
group.throughput(Throughput::Elements(total_chunks));
327+
328+
group.bench_function("1000_chunk_transfer", |b| {
329+
b.iter_batched(
330+
|| {
331+
let mut session = TransferSession::new_receive(
332+
[1u8; 32],
333+
PathBuf::from("/tmp/bench_throughput.dat"),
334+
file_size,
335+
CHUNK_SIZE,
336+
);
337+
session.start();
338+
let peer_id = [1u8; 32];
339+
session.add_peer(peer_id);
340+
(session, peer_id)
341+
},
342+
|(mut session, peer_id)| {
343+
for _ in 0..total_chunks {
344+
if let Some(chunk_idx) = session.next_chunk_to_request() {
345+
session.assign_chunk_to_peer(&peer_id, chunk_idx);
346+
session.mark_peer_chunk_downloaded(&peer_id, chunk_idx);
347+
session.mark_chunk_transferred(chunk_idx, CHUNK_SIZE);
348+
}
349+
}
350+
black_box(session.is_complete())
351+
},
352+
criterion::BatchSize::SmallInput,
353+
);
354+
});
355+
356+
group.finish();
357+
}
358+
315359
// ============================================================================
316360
// Session Creation Benchmarks
317361
// ============================================================================
@@ -363,11 +407,14 @@ criterion_group!(
363407

364408
criterion_group!(peer_benches, bench_peer_operations,);
365409

410+
criterion_group!(throughput_benches, bench_transfer_throughput,);
411+
366412
criterion_group!(creation_benches, bench_session_creation,);
367413

368414
criterion_main!(
369415
missing_chunks_benches,
370416
transfer_ops_benches,
371417
peer_benches,
418+
throughput_benches,
372419
creation_benches,
373420
);

crates/wraith-core/src/frame.rs

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -564,6 +564,70 @@ impl<'a> Frame<'a> {
564564
}
565565
}
566566

567+
/// Build a frame directly into a pre-allocated buffer from parts (zero-allocation hot path).
568+
///
569+
/// Bypasses the builder pattern entirely for maximum performance in send loops.
570+
/// No intermediate `Vec<u8>` is allocated for the payload.
571+
///
572+
/// # Arguments
573+
///
574+
/// * `frame_type` - The frame type
575+
/// * `stream_id` - Stream identifier (0 or >= 16)
576+
/// * `sequence` - Sequence number
577+
/// * `offset` - File offset
578+
/// * `payload` - Payload data (borrowed, not cloned)
579+
/// * `buf` - Pre-allocated output buffer (must be >= FRAME_HEADER_SIZE + payload.len())
580+
///
581+
/// # Returns
582+
///
583+
/// The total number of bytes written to `buf`.
584+
///
585+
/// # Errors
586+
///
587+
/// Returns [`FrameError::PayloadOverflow`] if `buf` is too small.
588+
pub fn build_into_from_parts(
589+
frame_type: FrameType,
590+
stream_id: u16,
591+
sequence: u32,
592+
offset: u64,
593+
payload: &[u8],
594+
buf: &mut [u8],
595+
) -> Result<usize, FrameError> {
596+
let payload_len = payload.len();
597+
let total_size = buf.len();
598+
599+
if total_size < FRAME_HEADER_SIZE + payload_len {
600+
return Err(FrameError::PayloadOverflow);
601+
}
602+
603+
let padding_len = total_size - FRAME_HEADER_SIZE - payload_len;
604+
605+
// Write header (nonce left as zero -- caller should set if needed)
606+
buf[..8].fill(0);
607+
buf[8] = frame_type as u8;
608+
buf[9] = 0; // flags
609+
buf[10..12].copy_from_slice(&stream_id.to_be_bytes());
610+
buf[12..16].copy_from_slice(&sequence.to_be_bytes());
611+
buf[16..24].copy_from_slice(&offset.to_be_bytes());
612+
#[allow(clippy::cast_possible_truncation)]
613+
let payload_len_u16 = payload_len as u16;
614+
buf[24..26].copy_from_slice(&payload_len_u16.to_be_bytes());
615+
buf[26..28].copy_from_slice(&[0u8; 2]); // Reserved
616+
617+
// Write payload
618+
buf[FRAME_HEADER_SIZE..FRAME_HEADER_SIZE + payload_len].copy_from_slice(payload);
619+
620+
// Write random padding
621+
if padding_len > 0 {
622+
rand::Rng::fill(
623+
&mut rand::thread_rng(),
624+
&mut buf[FRAME_HEADER_SIZE + payload_len..],
625+
);
626+
}
627+
628+
Ok(total_size)
629+
}
630+
567631
/// Builder for constructing frames
568632
#[derive(Default)]
569633
pub struct FrameBuilder {

crates/wraith-core/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ pub mod transfer;
8888

8989
pub use congestion::BbrState;
9090
pub use error::Error;
91-
pub use frame::{Frame, FrameBuilder, FrameFlags, FrameType};
91+
pub use frame::{Frame, FrameBuilder, FrameFlags, FrameType, build_into_from_parts};
9292
pub use migration::{PathState, PathValidator, ValidatedPath};
9393
pub use node::{Node, NodeConfig, NodeError};
9494
pub use path::{DEFAULT_MTU, MAX_MTU, MIN_MTU, PathMtuDiscovery};

0 commit comments

Comments
 (0)