doublegate
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 18 additions & 2 deletions b/‎CHANGELOG.md‎
Lines changed: 18 additions & 2 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 3 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 26 additions & 8 deletions b/‎README.md‎
Lines changed: 26 additions & 8 deletions
diff --git a/‎crates/wraith-core/benches/frame_bench.rs‎
Lines changed: 68 additions & 1 deletion b/‎crates/wraith-core/benches/frame_bench.rs‎
Lines changed: 68 additions & 1 deletion
diff --git a/‎crates/wraith-core/benches/transfer_bench.rs‎
Lines changed: 47 additions & 0 deletions b/‎crates/wraith-core/benches/transfer_bench.rs‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎crates/wraith-core/src/frame.rs‎
Lines changed: 64 additions & 0 deletions b/‎crates/wraith-core/src/frame.rs‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎crates/wraith-core/src/lib.rs‎
Lines changed: 1 addition & 1 deletion b/‎crates/wraith-core/src/lib.rs‎
Lines changed: 1 addition & 1 deletion
@@ -57,6 +57,7 @@ docs/archive/backups/
 
 # Benchmarks
 criterion/
+benchmarks/
 
 # Fuzzing artifacts
 fuzz/artifacts/
 
@@ -9,11 +9,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ---
 
-## [2.3.2] - 2026-01-28 - Benchmark-Driven Performance & Security Optimizations
+## [2.3.2] - 2026-01-29 - Benchmark-Driven Performance & Security Optimizations
 
 ### Overview
 
-Release focused on benchmark-driven performance and security optimizations, implementing 13 of 15 proposals from the v2.3.1 benchmark analysis. Key improvements: userspace PRNG for frame padding (eliminating getrandom syscall), zero-allocation frame building API, fast-path frame parsing, O(1) transfer scheduling, tuned BBR congestion control, tighter forward secrecy limits, intermediate key zeroization, expanded replay window, and improved padding size classes.
+Release focused on benchmark-driven performance and security optimizations, implementing 13 of 15 proposals from the v2.3.1 benchmark analysis plus 12 additional P1-P3 optimizations from v2.3.2 benchmark analysis. Key improvements: userspace PRNG for frame padding (eliminating getrandom syscall), zero-allocation frame building API, fast-path frame parsing, O(1) transfer scheduling, tuned BBR congestion control, tighter forward secrecy limits, intermediate key zeroization, expanded replay window, improved padding size classes, cached Double Ratchet public key, BTreeSet priority queue, BitVec chunk tracking, and isolated benchmark infrastructure.
 
 ### Changed
 
@@ -26,6 +26,22 @@ Release focused on benchmark-driven performance and security optimizations, impl
 - **Chunk size increase**: `DEFAULT_CHUNK_SIZE` 256KiB->1MiB for reduced per-transfer overhead
 - **Transport buffers**: 256KiB->4MiB for better bandwidth-delay product coverage
 
+#### Benchmark-Driven P1-P3 Optimizations (2026-01-29)
+- **P1.1: Zero-allocation frame building**: Added `build_into_from_parts()` writing directly into caller buffer (10.9x speedup, 76.3 GiB/s at 1456B)
+- **P1.2: Cached Double Ratchet public key**: Eliminates per-encrypt x25519 scalar multiplication (93.6% improvement, 1.71 us from 26.7 us)
+- **P1.3: BTreeSet priority queue**: O(log n) `next_chunk_to_request` via BTreeSet replacing O(n) linear scan (118,000x speedup, 3.34 ns per request)
+- **P1.4: Cached assigned chunks set**: Eliminates per-call HashSet construction in transfer sessions
+- **P2.1: In-place AEAD benchmarks**: New benchmark coverage for encrypt/decrypt in-place operations
+- **P2.2: Binary search padding size classes**: `partition_point()` replacing linear scan for size class lookup
+- **P2.3: BitVec chunk tracking**: Replaces dual HashSets with compact bitmap (1000x memory reduction, 58-71% session creation speedup, 6.6 ns `is_chunk_missing`)
+- **P3.1: Isolated benchmark runner**: `scripts/bench-isolated.sh` with CPU governor control, turbo boost management, core pinning
+- **P3.2: New benchmark groups**: build_into, full_pipeline, replay_protection, transfer_throughput, in-place AEAD, Double Ratchet
+- **P3.3: Transfer throughput benchmark**: End-to-end transfer session performance measurement
+
+#### Benchmark Documentation
+- **Comprehensive analysis**: `docs/testing/BENCHMARK-ANALYSIS-v2.3.2-optimized.md` with three-version comparison (v2.3.1, v2.3.2-initial, v2.3.2-optimized)
+- **Initial analysis**: `docs/testing/BENCHMARK-ANALYSIS-v2.3.2.md` pre-optimization baseline
+
 #### Security Hardening
 - **Forward secrecy**: Rekey byte limit tightened from 1GiB to 256MiB
 - **Intermediate key zeroization**: Added `temp.zeroize()` in KDF ratchet functions
 
@@ -11,12 +11,12 @@ WRAITH (Wire-speed Resilient Authenticated Invisible Transfer Handler) is a dece
 ### Metrics
 | Metric | Value |
 |--------|-------|
-| Tests | 2,134 passing (2,123 workspace + 11 spectre-implant), 16 ignored - 100% pass rate |
+| Tests | 2,148 passing (2,123 workspace + 11 spectre-implant + 14 doc), 16 ignored - 100% pass rate |
 | Code | ~141,000 lines Rust (protocol + clients) + ~36,600 lines TypeScript |
 | Documentation | 114 files, ~62,800 lines |
 | Templates | 17 configuration/ROE templates |
 | Security | Zero vulnerabilities - EXCELLENT ([v1.1.0 audit](docs/security/SECURITY_AUDIT_v1.1.0.md), 295 deps) |
-| Performance | File chunking 14.85 GiB/s, tree hashing 4.71 GiB/s, verification 4.78 GiB/s, reassembly 5.42 GiB/s |
+| Performance | Frame build_into 76.3 GiB/s, frame parse 196 GiB/s, AEAD ~1.40 GiB/s, DR encrypt 1.71 us, chunking 14.48 GiB/s, tree hashing 4.71 GiB/s, transfer scheduling 3.34 ns (O(log n)), chunk tracking 6.6 ns (O(1) BitVec) |
 | Quality | 98/100, technical debt 2.5%, zero clippy warnings |
 
 ## Build & Development
@@ -153,4 +153,4 @@ Thread-per-core with no locks in hot path. Sessions pinned to cores, NUMA-aware
 | wraith-recon | ✅ Complete | 98 | Packet capture, protocol analysis |
 | wraith-redops | ✅ Complete | 11 | Team Server, Operator Client, Spectre Implant (isolated workspace, 11 tests run separately) |
 
-**Total:** 2,134 tests passing (2,123 workspace + 11 spectre-implant, 16 ignored)
+**Total:** 2,148 tests passing (2,123 workspace + 11 spectre-implant + 14 doc, 16 ignored)
@@ -348,24 +348,42 @@ For detailed architecture documentation, see [Protocol Overview](docs/architectu
 | Memory per Session  | <10 MB    | Including buffers     |
 | CPU @ 10 Gbps       | <50%      | 8-core system         |
 
-### Benchmarks (v2.3.2)
+### Benchmarks (v2.3.2-optimized)
 
-Measured on production hardware with `cargo bench --workspace`. See [Benchmark Analysis](docs/testing/BENCHMARK-ANALYSIS-v2.3.1.md) for full methodology and results.
+Measured on production hardware (Intel i9-10850K, 64 GB RAM) with `cargo bench --workspace`. See [Benchmark Analysis](docs/testing/BENCHMARK-ANALYSIS-v2.3.2-optimized.md) for full methodology and results.
 
 | Component            | Measured Performance                        | Details                                    |
 | -------------------- | ------------------------------------------- | ------------------------------------------ |
-| Frame Parsing        | 2.4 ns/frame (~563 GiB/s equivalent)       | SIMD: AVX2/SSE4.2/NEON, 172M frames/sec   |
-| AEAD Encryption      | ~1.4 GiB/s (XChaCha20-Poly1305)            | 256-bit key, 192-bit nonce                 |
+| Frame Building       | 17.77 ns (76.3 GiB/s) via `build_into`     | Zero-allocation API, 10.9x faster than allocating build |
+| Frame Parsing        | 6.9 ns/frame (~196 GiB/s)                  | SIMD: AVX2/SSE4.2/NEON, constant-time     |
+| AEAD Encryption      | ~1.40 GiB/s (XChaCha20-Poly1305)           | 256-bit key, 192-bit nonce                 |
+| Double Ratchet       | 1.71 us encrypt (was 26.7 us)              | Cached public key, 93.6% improvement       |
 | Noise XX Handshake   | 345 us per handshake                        | Full mutual authentication                 |
 | Elligator2 Encoding  | 29.5 us per encoding                        | Key indistinguishability from random       |
 | BLAKE3 Hashing       | 4.71 GiB/s (tree), 8.5 GB/s (parallel)     | rayon + SIMD acceleration                  |
-| File Chunking        | 14.85 GiB/s                                 | io_uring async I/O                         |
-| Tree Hashing         | 4.71 GiB/s in-memory, 3.78 GiB/s from disk | Merkle tree with BLAKE3                    |
-| Chunk Verification   | 4.78 GiB/s                                  | <1 us per 256 KiB chunk                    |
+| File Chunking        | 14.48 GiB/s                                 | io_uring async I/O                         |
+| Tree Hashing         | 4.71 GiB/s in-memory, 2.61 GiB/s from disk | Merkle tree with BLAKE3                    |
+| Chunk Verification   | 4.78 GiB/s                                  | <1 us per chunk                            |
 | File Reassembly      | 5.42 GiB/s                                  | O(m) algorithm, zero-copy                  |
+| Transfer Scheduling  | 3.34 ns per request (O(log n))              | BTreeSet priority queue, 118,000x improvement |
+| Chunk Tracking       | 6.6 ns `is_chunk_missing` (O(1))            | BitVec bitmap, 1000x memory reduction      |
+| Session Creation     | 58-71% faster via BitVec tracking           | Eliminated dual HashSet overhead           |
+| Replay Protection    | 920 ps sequential accept                    | 1024-packet sliding window                 |
 | Ring Buffers (SPSC)  | ~100M ops/sec                               | Cache-line padded, lock-free               |
 | Ring Buffers (MPSC)  | ~20M ops/sec                                | CAS-based, 4 producers                     |
 
+### Optimization Highlights (v2.3.2-optimized)
+
+12 performance and infrastructure optimizations implemented based on benchmark analysis:
+
+- **Zero-allocation frame building** (`build_into_from_parts`) -- writes directly into caller buffer, 10.9x speedup
+- **Cached Double Ratchet public key** -- eliminates per-encrypt x25519 scalar multiplication, 93.6% improvement
+- **BTreeSet priority queue** for chunk scheduling -- O(log n) replacing O(n) linear scan, 118,000x speedup
+- **BitVec chunk tracking** replacing dual HashSets -- 1000x memory reduction, 58-71% session creation speedup
+- **Binary search padding size classes** via `partition_point()` -- eliminates linear scan regression
+- **Isolated benchmark infrastructure** (`scripts/bench-isolated.sh`) with CPU governor control and core pinning
+- **6 new benchmark groups** -- build_into, full_pipeline, replay_protection, transfer_throughput, in-place AEAD, Double Ratchet
+
 ---
 
 ## Security
@@ -715,4 +733,4 @@ WRAITH Protocol builds on excellent projects and research:
 
 **Version:** 2.3.2 | **License:** MIT | **Language:** Rust 2024 (MSRV 1.88) | **Tests:** 2,148 passing (2,123 workspace + 11 spectre-implant + 14 doc) | **Clients:** 12 applications (9 desktop + 2 mobile + 1 server)
 
-**Last Updated:** 2026-01-28
+**Last Updated:** 2026-01-29
@@ -241,6 +241,71 @@ fn bench_parse_throughput(c: &mut Criterion) {
     group.finish();
 }
 
+fn bench_frame_build_into(c: &mut Criterion) {
+    let sizes: Vec<(usize, &str)> = vec![
+        (64, "64_bytes"),
+        (128, "128_bytes"),
+        (256, "256_bytes"),
+        (512, "512_bytes"),
+        (1024, "1024_bytes"),
+        (1456, "1456_bytes"),
+    ];
+
+    let mut group = c.benchmark_group("frame_build_into");
+
+    for (size, name) in sizes {
+        let payload_len = size.saturating_sub(FRAME_HEADER_SIZE);
+        let payload = vec![0x42; payload_len];
+        let builder = FrameBuilder::new()
+            .frame_type(FrameType::Data)
+            .stream_id(42)
+            .sequence(1000)
+            .payload(&payload);
+
+        group.throughput(Throughput::Bytes(size as u64));
+        group.bench_function(name, |b| {
+            let mut buf = vec![0u8; size];
+            b.iter(|| builder.build_into(black_box(&mut buf)))
+        });
+    }
+
+    group.finish();
+}
+
+fn bench_frame_full_pipeline(c: &mut Criterion) {
+    let sizes: Vec<(usize, &str)> = vec![
+        (64, "64_bytes"),
+        (256, "256_bytes"),
+        (1024, "1024_bytes"),
+        (1456, "1456_bytes"),
+    ];
+
+    let mut group = c.benchmark_group("frame_full_pipeline");
+
+    for (size, name) in sizes {
+        let payload_len = size.saturating_sub(FRAME_HEADER_SIZE);
+        let payload = vec![0x42; payload_len];
+
+        group.throughput(Throughput::Bytes(size as u64));
+        group.bench_function(name, |b| {
+            b.iter(|| {
+                let frame = FrameBuilder::new()
+                    .frame_type(black_box(FrameType::Data))
+                    .stream_id(black_box(42))
+                    .sequence(black_box(1000))
+                    .payload(black_box(&payload))
+                    .build(black_box(size))
+                    .unwrap();
+
+                let parsed = Frame::parse(black_box(&frame)).unwrap();
+                black_box(parsed.payload().len())
+            })
+        });
+    }
+
+    group.finish();
+}
+
 criterion_group!(
     benches,
     bench_frame_parse,
@@ -251,6 +316,8 @@ criterion_group!(
     bench_frame_types,
     bench_scalar_vs_simd,
     bench_parse_implementations_by_size,
-    bench_parse_throughput
+    bench_parse_throughput,
+    bench_frame_build_into,
+    bench_frame_full_pipeline
 );
 criterion_main!(benches);
@@ -312,6 +312,50 @@ fn bench_peer_operations(c: &mut Criterion) {
     group.finish();
 }
 
+// ============================================================================
+// Transfer Throughput Benchmarks
+// ============================================================================
+
+/// Benchmark a complete transfer of 1000 chunks (request + mark transferred loop)
+fn bench_transfer_throughput(c: &mut Criterion) {
+    let mut group = c.benchmark_group("transfer_throughput");
+
+    let total_chunks = 1_000u64;
+    let file_size = total_chunks * CHUNK_SIZE as u64;
+
+    group.throughput(Throughput::Elements(total_chunks));
+
+    group.bench_function("1000_chunk_transfer", |b| {
+        b.iter_batched(
+            || {
+                let mut session = TransferSession::new_receive(
+                    [1u8; 32],
+                    PathBuf::from("/tmp/bench_throughput.dat"),
+                    file_size,
+                    CHUNK_SIZE,
+                );
+                session.start();
+                let peer_id = [1u8; 32];
+                session.add_peer(peer_id);
+                (session, peer_id)
+            },
+            |(mut session, peer_id)| {
+                for _ in 0..total_chunks {
+                    if let Some(chunk_idx) = session.next_chunk_to_request() {
+                        session.assign_chunk_to_peer(&peer_id, chunk_idx);
+                        session.mark_peer_chunk_downloaded(&peer_id, chunk_idx);
+                        session.mark_chunk_transferred(chunk_idx, CHUNK_SIZE);
+                    }
+                }
+                black_box(session.is_complete())
+            },
+            criterion::BatchSize::SmallInput,
+        );
+    });
+
+    group.finish();
+}
+
 // ============================================================================
 // Session Creation Benchmarks
 // ============================================================================
@@ -363,11 +407,14 @@ criterion_group!(
 
 criterion_group!(peer_benches, bench_peer_operations,);
 
+criterion_group!(throughput_benches, bench_transfer_throughput,);
+
 criterion_group!(creation_benches, bench_session_creation,);
 
 criterion_main!(
     missing_chunks_benches,
     transfer_ops_benches,
     peer_benches,
+    throughput_benches,
     creation_benches,
 );
@@ -564,6 +564,70 @@ impl<'a> Frame<'a> {
     }
 }
 
+/// Build a frame directly into a pre-allocated buffer from parts (zero-allocation hot path).
+///
+/// Bypasses the builder pattern entirely for maximum performance in send loops.
+/// No intermediate `Vec<u8>` is allocated for the payload.
+///
+/// # Arguments
+///
+/// * `frame_type` - The frame type
+/// * `stream_id` - Stream identifier (0 or >= 16)
+/// * `sequence` - Sequence number
+/// * `offset` - File offset
+/// * `payload` - Payload data (borrowed, not cloned)
+/// * `buf` - Pre-allocated output buffer (must be >= FRAME_HEADER_SIZE + payload.len())
+///
+/// # Returns
+///
+/// The total number of bytes written to `buf`.
+///
+/// # Errors
+///
+/// Returns [`FrameError::PayloadOverflow`] if `buf` is too small.
+pub fn build_into_from_parts(
+    frame_type: FrameType,
+    stream_id: u16,
+    sequence: u32,
+    offset: u64,
+    payload: &[u8],
+    buf: &mut [u8],
+) -> Result<usize, FrameError> {
+    let payload_len = payload.len();
+    let total_size = buf.len();
+
+    if total_size < FRAME_HEADER_SIZE + payload_len {
+        return Err(FrameError::PayloadOverflow);
+    }
+
+    let padding_len = total_size - FRAME_HEADER_SIZE - payload_len;
+
+    // Write header (nonce left as zero -- caller should set if needed)
+    buf[..8].fill(0);
+    buf[8] = frame_type as u8;
+    buf[9] = 0; // flags
+    buf[10..12].copy_from_slice(&stream_id.to_be_bytes());
+    buf[12..16].copy_from_slice(&sequence.to_be_bytes());
+    buf[16..24].copy_from_slice(&offset.to_be_bytes());
+    #[allow(clippy::cast_possible_truncation)]
+    let payload_len_u16 = payload_len as u16;
+    buf[24..26].copy_from_slice(&payload_len_u16.to_be_bytes());
+    buf[26..28].copy_from_slice(&[0u8; 2]); // Reserved
+
+    // Write payload
+    buf[FRAME_HEADER_SIZE..FRAME_HEADER_SIZE + payload_len].copy_from_slice(payload);
+
+    // Write random padding
+    if padding_len > 0 {
+        rand::Rng::fill(
+            &mut rand::thread_rng(),
+            &mut buf[FRAME_HEADER_SIZE + payload_len..],
+        );
+    }
+
+    Ok(total_size)
+}
+
 /// Builder for constructing frames
 #[derive(Default)]
 pub struct FrameBuilder {
 
@@ -88,7 +88,7 @@ pub mod transfer;
 
 pub use congestion::BbrState;
 pub use error::Error;
-pub use frame::{Frame, FrameBuilder, FrameFlags, FrameType};
+pub use frame::{Frame, FrameBuilder, FrameFlags, FrameType, build_into_from_parts};
 pub use migration::{PathState, PathValidator, ValidatedPath};
 pub use node::{Node, NodeConfig, NodeError};
 pub use path::{DEFAULT_MTU, MAX_MTU, MIN_MTU, PathMtuDiscovery};