AztecProtocol
diff --git a/‎barretenberg/cpp/CLAUDE.md‎
Lines changed: 8 additions & 0 deletions b/‎barretenberg/cpp/CLAUDE.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎barretenberg/cpp/scripts/compare_branch_vs_baseline_remote.sh‎
Lines changed: 1 addition & 1 deletion b/‎barretenberg/cpp/scripts/compare_branch_vs_baseline_remote.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎barretenberg/cpp/src/barretenberg/ecc/scalar_multiplication/README.md‎
Lines changed: 183 additions & 0 deletions b/‎barretenberg/cpp/src/barretenberg/ecc/scalar_multiplication/README.md‎
Lines changed: 183 additions & 0 deletions
diff --git a/‎barretenberg/cpp/src/barretenberg/ecc/scalar_multiplication/process_buckets.cpp‎
Lines changed: 50 additions & 42 deletions b/‎barretenberg/cpp/src/barretenberg/ecc/scalar_multiplication/process_buckets.cpp‎
Lines changed: 50 additions & 42 deletions
@@ -2,6 +2,14 @@ succint aztec-packages cheat sheet.
 
 THE PROJECT ROOT IS AT TWO LEVELS ABOVE THIS FOLDER. Typically, the repository is at ~/aztec-packages. all advice is from the root.
 
+# Git workflow for barretenberg
+
+**IMPORTANT**: When comparing branches or looking at diffs for barretenberg work, use `merge-train/barretenberg` as the base branch, NOT `master`. The master branch is often outdated for barretenberg development.
+
+Examples:
+- `git diff merge-train/barretenberg...HEAD` (not `git diff master...HEAD`)
+- `git log merge-train/barretenberg..HEAD` (not `git log master..HEAD`)
+
 Run ./bootstrap.sh at the top-level to be sure the repo fully builds.
 Bootstrap scripts can be called with relative paths e.g. ../barretenberg/bootstrap.sh
 
 
@@ -16,7 +16,7 @@ PRESET=${3:-clang20}
 BUILD_DIR=${4:-build}
 HARDWARE_CONCURRENCY=${HARDWARE_CONCURRENCY:-16}
 
-BASELINE_BRANCH="master"
+BASELINE_BRANCH="${BASELINE_BRANCH:-merge-train/barretenberg}"
 BENCH_TOOLS_DIR="$BUILD_DIR/_deps/benchmark-src/tools"
 
 if [ ! -z "$(git status --untracked-files=no --porcelain)" ]; then
 
@@ -0,0 +1,183 @@
+# Pippenger Multi-Scalar Multiplication (MSM)
+
+## Overview
+
+The Pippenger algorithm computes multi-scalar multiplications:
+
+$$\text{MSM}(\vec{s}, \vec{P}) = \sum_{i=0}^{n-1} s_i \cdot P_i$$
+
+**Complexity**: Let $q = \lceil \log_2(\text{field modulus}) \rceil$ be the scalar bit-length, $|A|$ the cost of a group addition, and $|D|$ the cost of a doubling.
+
+- **Pippenger**: $O\left(\frac{q}{c} \cdot \left((n + 2^c) \cdot |A| + c \cdot |D|\right)\right)$
+- **Naive**: $O(n \cdot q \cdot |D| + n \cdot q \cdot |A| / 2)$
+
+With $c \approx \frac{1}{2} \log_2 n$, Pippenger achieves roughly $O(n \cdot q / \log n)$ vs $O(n \cdot q)$ for naive scalar multiplication.
+
+## Algorithm
+
+### Step 1: Scalar Decomposition
+
+**Implementation**: `get_scalar_slice(scalar, round_index, bits_per_slice)`
+
+Each scalar $s_i$ is decomposed into $r$ slices of $c$ bits each, processed **MSB-first**:
+
+$$s_i = \sum_{j=0}^{r-1} s_i^{(j)} \cdot 2^{c(r-1-j)}$$
+
+- $c$ = bits per slice (from `get_optimal_log_num_buckets`, which brute-force searches for minimum cost)
+- $r = \lceil $ `NUM_BITS_IN_FIELD` $/ c \rceil$ = number of rounds
+- Round 0 extracts the most significant bits
+
+### Step 2: Bucket Accumulation
+
+For each round $j$, points are added into **buckets** based on their scalar slice. Bucket $k$ accumulates all points whose slice value equals $k$:
+
+$$B_k^{(j)} = \sum_{\{i : s_i^{(j)} = k\}} P_i$$
+
+**Two implementation paths:**
+
+- **Affine**: Sorts points by bucket and uses batched affine additions
+- **Jacobian**: Direct bucket accumulation in Jacobian coordinates
+
+### Step 3: Bucket Reduction
+
+**Implementation**: `accumulate_buckets(bucket_accumulators)`
+
+Computes weighted sum using a suffix sum (high to low):
+
+$$R^{(j)} = \sum_{k=1}^{2^c - 1} k \cdot B_k^{(j)} = \sum_{k=1}^{2^c - 1} \left( \sum_{m=k}^{2^c - 1} B_m^{(j)} \right)$$
+
+An offset generator is added and subtracted to avoid rare accumulator edge cases—a probabilistic mitigation that simplifies accumulation logic.
+
+### Step 4: Round Combination
+
+Combines all rounds using Horner's method (MSB-first):
+
+```cpp
+msm_accumulator = point_at_infinity
+for j = 0 to r-1:
+    repeat c doublings (or fewer for final round)
+    msm_accumulator += bucket_result[j]
+```
+
+## Algorithm Variants
+
+### Entry Points and Safety
+
+| Entry Point | Default | Safety |
+|-------------|---------|--------|
+| `msm()` | `handle_edge_cases=false` | ⚠️ **Unsafe** |
+| `pippenger()` | `handle_edge_cases=true` | ✓ Safe |
+| `pippenger_unsafe()` | `handle_edge_cases=false` | ⚠️ Unsafe |
+| `batch_multi_scalar_mul()` | `handle_edge_cases=true` | ✓ Safe |
+
+### Edge Cases
+
+Affine addition fails for **P = Q** (doubling), **P = −Q** (inverse), and **P = O** (identity). Jacobian coordinates handle these correctly at higher cost (~2-3× slower).
+
+⚠️ **Use `msm()` or `pippenger_unsafe()` only when points are guaranteed linearly independent** (e.g., SRS points). For user-controlled or potentially duplicate points, use `pippenger()`.
+
+### Affine Pippenger (`handle_edge_cases=false`)
+
+Uses affine coordinates with Montgomery's batch inversion trick: replaces $m$ inversions with **1 inversion + O(m) multiplications**, yielding ~2-3× speedup over Jacobian.
+
+### Jacobian Pippenger (`handle_edge_cases=true`)
+
+Uses Jacobian coordinates for bucket accumulators. Handles all edge cases correctly.
+
+## Tuning Constants
+
+| Constant | Value | Purpose |
+|----------|-------|---------|
+| `PIPPENGER_THRESHOLD` | 16 | Below this, use naive scalar multiplication |
+| `AFFINE_TRICK_THRESHOLD` | 128 | Below this, batch inversion overhead exceeds savings |
+| `MAX_SLICE_BITS` | 20 | Upper bound on bucket count exponent |
+| `BATCH_SIZE` | 2048 | Points per batch inversion (fits L2 cache) |
+| `RADIX_BITS` | 8 | Bits per radix sort pass |
+
+<details>
+<summary>Cost model constants and derivations</summary>
+
+| Constant | Value | Derivation |
+|----------|-------|------------|
+| `BUCKET_ACCUMULATION_COST` | 5 | 2 Jacobian adds/bucket × 2.5× cost ratio |
+| `AFFINE_TRICK_SAVINGS_PER_OP` | 5 | ~10 muls saved − ~3 muls for product tree |
+| `JACOBIAN_Z_NOT_ONE_PENALTY` | 5 | Extra field ops when Z ≠ 1 |
+| `INVERSION_TABLE_COST` | 14 | 4-bit lookup table for modular exp |
+
+**BATCH_SIZE=2048**: Each `AffineElement` is 64 bytes. 2048 points = 128 KB, fitting in L2 cache.
+
+**RADIX_BITS=8**: 256 radix buckets × 4 bytes = 1 KB counting array, fits in L1 cache.
+
+</details>
+
+## Implementation Notes
+
+### Zero Scalar Filtering
+
+`transform_scalar_and_get_nonzero_scalar_indices` filters out zero scalars before processing (since $0 \cdot P_i = \mathcal{O}$). Scalars are converted from Montgomery form in-place to avoid doubling memory usage.
+
+### Bucket Existence Tracking
+
+A `BitVector` bitmap tracks which buckets are populated, avoiding expensive full-array clears between rounds. Clearing the bitmap costs $O(2^c / 64)$ words vs $O(2^c)$ for the full bucket array.
+
+### Point Scheduling (Affine Variant Only)
+
+Entries are packed as `(point_index << 32) | bucket_index` into 64-bit values. Since bucket indices fit in $c$ bits (typically 8-16), they occupy only the lowest bits of the packed entry. An **in-place MSD radix sort** on the low $c$ bits groups points by bucket for efficient batch processing. The sort also detects entries with `bucket_index == 0` during the final radix pass, allowing zero-bucket entries to be skipped without a separate scan.
+
+### Batched Affine Addition
+
+`batch_accumulate_points_into_buckets` processes sorted points iteratively:
+- Same-bucket pairs → queue for batch addition
+- Different buckets → cache in bucket or queue with existing accumulator
+- Uses branchless conditional moves to minimize pipeline stalls
+- Prefetches future points to hide memory latency
+- Recirculates results to maximize batch efficiency before writing to buckets
+
+<details>
+<summary>Batch accumulation case analysis</summary>
+
+| Condition | Action | Iterator Update |
+|-----------|--------|-----------------|
+| `bucket[i] == bucket[i+1]` | Queue both points for batch add | `point_it += 2` |
+| Different buckets, accumulator exists | Queue point + accumulator | `point_it += 1` |
+| Different buckets, no accumulator | Cache point into bucket | `point_it += 1` |
+
+After batch addition, results targeting the same bucket are paired again before writing to bucket accumulators, reducing random memory access by ~50%.
+
+</details>
+
+## Parallelization
+
+Uses **per-thread buffers** (bucket accumulators, scratch space) to eliminate contention.
+
+For `batch_multi_scalar_mul()`, work is distributed via `MSMWorkUnit` structures that can split a single MSM across multiple threads. Each thread computes partial results on point subsets, combined in a final reduction.
+
+<details>
+<summary>Per-call buffer sizes</summary>
+
+| Buffer | Size | Purpose |
+|--------|------|---------|
+| `BucketAccumulators` (affine) | $2^c × 64$ bytes | Affine bucket array + bitmap |
+| `JacobianBucketAccumulators` | $2^c × 96$ bytes | Jacobian bucket array + bitmap |
+| `AffineAdditionData` | ~400 KB | Scratch for batch inversion |
+| `point_schedule` | $n × 8$ bytes | Per-MSM point schedule |
+
+Buffers are allocated per-call for WASM compatibility. Memory scales with thread count during parallel execution.
+
+</details>
+
+## File Structure
+
+```
+scalar_multiplication/
+├── scalar_multiplication.hpp    # MSM class, data structures
+├── scalar_multiplication.cpp    # Core algorithm
+├── process_buckets.hpp/cpp      # Radix sort
+├── bitvector.hpp                # Bit vector for bucket tracking
+└── README.md                    # This file
+```
+
+## References
+
+1. Pippenger, N. (1976). "On the evaluation of powers and related problems"
+2. Bernstein, D.J. et al. "Faster batch forgery identification" (batch inversion)
@@ -10,89 +10,97 @@
 
 namespace bb::scalar_multiplication {
 
-// NOLINTNEXTLINE(misc-no-recursion) recursion is fine here, max recursion depth is 8 (64 bit int / 8 bits per call)
+// NOLINTNEXTLINE(misc-no-recursion) recursion is fine here, max depth is 4 (32-bit bucket index / 8 bits per call)
 void radix_sort_count_zero_entries(uint64_t* keys,
                                    const size_t num_entries,
                                    const uint32_t shift,
                                    size_t& num_zero_entries,
-                                   const uint32_t total_bits,
-                                   const uint64_t* start_pointer) noexcept
+                                   const uint32_t bucket_index_bits,
+                                   const uint64_t* top_level_keys) noexcept
 {
-    constexpr size_t num_bits = 8;
-    constexpr size_t num_buckets = 1UL << num_bits;
-    constexpr uint32_t mask = static_cast<uint32_t>(num_buckets) - 1U;
-    std::array<uint32_t, num_buckets> bucket_counts{};
+    constexpr size_t NUM_RADIX_BUCKETS = 1UL << RADIX_BITS;
+    constexpr uint32_t RADIX_MASK = static_cast<uint32_t>(NUM_RADIX_BUCKETS) - 1U;
 
+    // Step 1: Count entries in each radix bucket
+    std::array<uint32_t, NUM_RADIX_BUCKETS> bucket_counts{};
     for (size_t i = 0; i < num_entries; ++i) {
-        bucket_counts[(keys[i] >> shift) & mask]++;
+        bucket_counts[(keys[i] >> shift) & RADIX_MASK]++;
     }
 
-    std::array<uint32_t, num_buckets + 1> offsets;
-    std::array<uint32_t, num_buckets + 1> offsets_copy;
+    // Step 2: Convert counts to cumulative offsets (prefix sum)
+    std::array<uint32_t, NUM_RADIX_BUCKETS + 1> offsets;
+    std::array<uint32_t, NUM_RADIX_BUCKETS + 1> offsets_copy;
     offsets[0] = 0;
-
-    for (size_t i = 0; i < num_buckets - 1; ++i) {
+    for (size_t i = 0; i < NUM_RADIX_BUCKETS - 1; ++i) {
         bucket_counts[i + 1] += bucket_counts[i];
     }
-    if ((shift == 0) && (keys == start_pointer)) {
+
+    // Count zero entries only at the final recursion level (shift == 0) and only for the full array
+    if ((shift == 0) && (keys == top_level_keys)) {
         num_zero_entries = bucket_counts[0];
     }
-    for (size_t i = 1; i < num_buckets + 1; ++i) {
+
+    for (size_t i = 1; i < NUM_RADIX_BUCKETS + 1; ++i) {
         offsets[i] = bucket_counts[i - 1];
     }
-    for (size_t i = 0; i < num_buckets + 1; ++i) {
+    for (size_t i = 0; i < NUM_RADIX_BUCKETS + 1; ++i) {
         offsets_copy[i] = offsets[i];
     }
-    uint64_t* start = &keys[0];
 
-    for (size_t i = 0; i < num_buckets; ++i) {
+    // Step 3: In-place permutation using cycle sort
+    // For each radix bucket, repeatedly swap elements to their correct positions until all elements
+    // in that bucket's range belong there. The offsets array tracks the next write position for each bucket.
+    uint64_t* start = &keys[0];
+    for (size_t i = 0; i < NUM_RADIX_BUCKETS; ++i) {
         uint64_t* bucket_start = &keys[offsets[i]];
         const uint64_t* bucket_end = &keys[offsets_copy[i + 1]];
         while (bucket_start != bucket_end) {
             for (uint64_t* it = bucket_start; it < bucket_end; ++it) {
-                const size_t value = (*it >> shift) & mask;
+                const size_t value = (*it >> shift) & RADIX_MASK;
                 const uint64_t offset = offsets[value]++;
                 std::iter_swap(it, start + offset);
             }
             bucket_start = &keys[offsets[i]];
         }
     }
+
+    // Step 4: Recursively sort each bucket by the next less-significant byte
     if (shift > 0) {
-        for (size_t i = 0; i < num_buckets; ++i) {
-            if (offsets_copy[i + 1] - offsets_copy[i] > 1) {
-                radix_sort_count_zero_entries(&keys[offsets_copy[i]],
-                                              offsets_copy[i + 1] - offsets_copy[i],
-                                              shift - 8,
-                                              num_zero_entries,
-                                              total_bits,
-                                              keys);
+        for (size_t i = 0; i < NUM_RADIX_BUCKETS; ++i) {
+            const size_t bucket_size = offsets_copy[i + 1] - offsets_copy[i];
+            if (bucket_size > 1) {
+                radix_sort_count_zero_entries(
+                    &keys[offsets_copy[i]], bucket_size, shift - RADIX_BITS, num_zero_entries, bucket_index_bits, keys);
             }
         }
     }
 }
 
-size_t process_buckets_count_zero_entries(uint64_t* wnaf_entries,
-                                          const size_t num_entries,
-                                          const uint32_t num_bits) noexcept
+size_t sort_point_schedule_and_count_zero_buckets(uint64_t* point_schedule,
+                                                  const size_t num_entries,
+                                                  const uint32_t bucket_index_bits) noexcept
 {
     if (num_entries == 0) {
         return 0;
     }
-    const uint32_t bits_per_round = 8;
-    const uint32_t base = num_bits & 7;
-    const uint32_t total_bits = (base == 0) ? num_bits : num_bits - base + 8;
-    const uint32_t shift = total_bits - bits_per_round;
+
+    // Round bucket_index_bits up to next multiple of RADIX_BITS for proper MSD radix sort alignment.
+    // E.g., if bucket_index_bits=10, we need to start sorting from bit 16 (2 bytes) not bit 10.
+    const uint32_t remainder = bucket_index_bits % RADIX_BITS;
+    const uint32_t padded_bits = (remainder == 0) ? bucket_index_bits : bucket_index_bits - remainder + RADIX_BITS;
+    const uint32_t initial_shift = padded_bits - RADIX_BITS;
+
     size_t num_zero_entries = 0;
-    radix_sort_count_zero_entries(wnaf_entries, num_entries, shift, num_zero_entries, num_bits, wnaf_entries);
-
-    // inside radix_sort_count_zero_entries, if the least significant *byte* of `wnaf_entries[0] == 0`,
-    // then num_nonzero_entries = number of entries that share the same value as wnaf_entries[0].
-    // If wnaf_entries[0] != 0, we must manually set num_zero_entries = 0
-    if (num_entries > 0) {
-        if ((wnaf_entries[0] & 0xffffffff) != 0) {
-            num_zero_entries = 0;
-        }
+    radix_sort_count_zero_entries(
+        point_schedule, num_entries, initial_shift, num_zero_entries, bucket_index_bits, point_schedule);
+
+    // The radix sort counts entries where the least significant BYTE is zero, but we need entries where
+    // the entire bucket_index (lower 32 bits) is zero. Verify the first entry after sorting.
+    if ((point_schedule[0] & BUCKET_INDEX_MASK) != 0) {
+        num_zero_entries = 0;
     }
+
     return num_zero_entries;
 }
+
 } // namespace bb::scalar_multiplication