|
| 1 | +# Pippenger Multi-Scalar Multiplication (MSM) |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The Pippenger algorithm computes multi-scalar multiplications: |
| 6 | + |
| 7 | +$$\text{MSM}(\vec{s}, \vec{P}) = \sum_{i=0}^{n-1} s_i \cdot P_i$$ |
| 8 | + |
| 9 | +**Complexity**: Let $q = \lceil \log_2(\text{field modulus}) \rceil$ be the scalar bit-length, $|A|$ the cost of a group addition, and $|D|$ the cost of a doubling. |
| 10 | + |
| 11 | +- **Pippenger**: $O\left(\frac{q}{c} \cdot \left((n + 2^c) \cdot |A| + c \cdot |D|\right)\right)$ |
| 12 | +- **Naive**: $O(n \cdot q \cdot |D| + n \cdot q \cdot |A| / 2)$ |
| 13 | + |
| 14 | +With $c \approx \frac{1}{2} \log_2 n$, Pippenger achieves roughly $O(n \cdot q / \log n)$ vs $O(n \cdot q)$ for naive scalar multiplication. |
| 15 | + |
| 16 | +## Algorithm |
| 17 | + |
| 18 | +### Step 1: Scalar Decomposition |
| 19 | + |
| 20 | +**Implementation**: `get_scalar_slice(scalar, round_index, bits_per_slice)` |
| 21 | + |
| 22 | +Each scalar $s_i$ is decomposed into $r$ slices of $c$ bits each, processed **MSB-first**: |
| 23 | + |
| 24 | +$$s_i = \sum_{j=0}^{r-1} s_i^{(j)} \cdot 2^{c(r-1-j)}$$ |
| 25 | + |
| 26 | +- $c$ = bits per slice (from `get_optimal_log_num_buckets`, which brute-force searches for minimum cost) |
| 27 | +- $r = \lceil $ `NUM_BITS_IN_FIELD` $/ c \rceil$ = number of rounds |
| 28 | +- Round 0 extracts the most significant bits |
| 29 | + |
| 30 | +### Step 2: Bucket Accumulation |
| 31 | + |
| 32 | +For each round $j$, points are added into **buckets** based on their scalar slice. Bucket $k$ accumulates all points whose slice value equals $k$: |
| 33 | + |
| 34 | +$$B_k^{(j)} = \sum_{\{i : s_i^{(j)} = k\}} P_i$$ |
| 35 | + |
| 36 | +**Two implementation paths:** |
| 37 | + |
| 38 | +- **Affine**: Sorts points by bucket and uses batched affine additions |
| 39 | +- **Jacobian**: Direct bucket accumulation in Jacobian coordinates |
| 40 | + |
| 41 | +### Step 3: Bucket Reduction |
| 42 | + |
| 43 | +**Implementation**: `accumulate_buckets(bucket_accumulators)` |
| 44 | + |
| 45 | +Computes weighted sum using a suffix sum (high to low): |
| 46 | + |
| 47 | +$$R^{(j)} = \sum_{k=1}^{2^c - 1} k \cdot B_k^{(j)} = \sum_{k=1}^{2^c - 1} \left( \sum_{m=k}^{2^c - 1} B_m^{(j)} \right)$$ |
| 48 | + |
| 49 | +An offset generator is added and subtracted to avoid rare accumulator edge cases—a probabilistic mitigation that simplifies accumulation logic. |
| 50 | + |
| 51 | +### Step 4: Round Combination |
| 52 | + |
| 53 | +Combines all rounds using Horner's method (MSB-first): |
| 54 | + |
| 55 | +```cpp |
| 56 | +msm_accumulator = point_at_infinity |
| 57 | +for j = 0 to r-1: |
| 58 | + repeat c doublings (or fewer for final round) |
| 59 | + msm_accumulator += bucket_result[j] |
| 60 | +``` |
| 61 | +
|
| 62 | +## Algorithm Variants |
| 63 | +
|
| 64 | +### Entry Points and Safety |
| 65 | +
|
| 66 | +| Entry Point | Default | Safety | |
| 67 | +|-------------|---------|--------| |
| 68 | +| `msm()` | `handle_edge_cases=false` | ⚠️ **Unsafe** | |
| 69 | +| `pippenger()` | `handle_edge_cases=true` | ✓ Safe | |
| 70 | +| `pippenger_unsafe()` | `handle_edge_cases=false` | ⚠️ Unsafe | |
| 71 | +| `batch_multi_scalar_mul()` | `handle_edge_cases=true` | ✓ Safe | |
| 72 | +
|
| 73 | +### Edge Cases |
| 74 | +
|
| 75 | +Affine addition fails for **P = Q** (doubling), **P = −Q** (inverse), and **P = O** (identity). Jacobian coordinates handle these correctly at higher cost (~2-3× slower). |
| 76 | +
|
| 77 | +⚠️ **Use `msm()` or `pippenger_unsafe()` only when points are guaranteed linearly independent** (e.g., SRS points). For user-controlled or potentially duplicate points, use `pippenger()`. |
| 78 | +
|
| 79 | +### Affine Pippenger (`handle_edge_cases=false`) |
| 80 | +
|
| 81 | +Uses affine coordinates with Montgomery's batch inversion trick: replaces $m$ inversions with **1 inversion + O(m) multiplications**, yielding ~2-3× speedup over Jacobian. |
| 82 | +
|
| 83 | +### Jacobian Pippenger (`handle_edge_cases=true`) |
| 84 | +
|
| 85 | +Uses Jacobian coordinates for bucket accumulators. Handles all edge cases correctly. |
| 86 | +
|
| 87 | +## Tuning Constants |
| 88 | +
|
| 89 | +| Constant | Value | Purpose | |
| 90 | +|----------|-------|---------| |
| 91 | +| `PIPPENGER_THRESHOLD` | 16 | Below this, use naive scalar multiplication | |
| 92 | +| `AFFINE_TRICK_THRESHOLD` | 128 | Below this, batch inversion overhead exceeds savings | |
| 93 | +| `MAX_SLICE_BITS` | 20 | Upper bound on bucket count exponent | |
| 94 | +| `BATCH_SIZE` | 2048 | Points per batch inversion (fits L2 cache) | |
| 95 | +| `RADIX_BITS` | 8 | Bits per radix sort pass | |
| 96 | +
|
| 97 | +<details> |
| 98 | +<summary>Cost model constants and derivations</summary> |
| 99 | +
|
| 100 | +| Constant | Value | Derivation | |
| 101 | +|----------|-------|------------| |
| 102 | +| `BUCKET_ACCUMULATION_COST` | 5 | 2 Jacobian adds/bucket × 2.5× cost ratio | |
| 103 | +| `AFFINE_TRICK_SAVINGS_PER_OP` | 5 | ~10 muls saved − ~3 muls for product tree | |
| 104 | +| `JACOBIAN_Z_NOT_ONE_PENALTY` | 5 | Extra field ops when Z ≠ 1 | |
| 105 | +| `INVERSION_TABLE_COST` | 14 | 4-bit lookup table for modular exp | |
| 106 | +
|
| 107 | +**BATCH_SIZE=2048**: Each `AffineElement` is 64 bytes. 2048 points = 128 KB, fitting in L2 cache. |
| 108 | +
|
| 109 | +**RADIX_BITS=8**: 256 radix buckets × 4 bytes = 1 KB counting array, fits in L1 cache. |
| 110 | +
|
| 111 | +</details> |
| 112 | +
|
| 113 | +## Implementation Notes |
| 114 | +
|
| 115 | +### Zero Scalar Filtering |
| 116 | +
|
| 117 | +`transform_scalar_and_get_nonzero_scalar_indices` filters out zero scalars before processing (since $0 \cdot P_i = \mathcal{O}$). Scalars are converted from Montgomery form in-place to avoid doubling memory usage. |
| 118 | +
|
| 119 | +### Bucket Existence Tracking |
| 120 | +
|
| 121 | +A `BitVector` bitmap tracks which buckets are populated, avoiding expensive full-array clears between rounds. Clearing the bitmap costs $O(2^c / 64)$ words vs $O(2^c)$ for the full bucket array. |
| 122 | +
|
| 123 | +### Point Scheduling (Affine Variant Only) |
| 124 | +
|
| 125 | +Entries are packed as `(point_index << 32) | bucket_index` into 64-bit values. Since bucket indices fit in $c$ bits (typically 8-16), they occupy only the lowest bits of the packed entry. An **in-place MSD radix sort** on the low $c$ bits groups points by bucket for efficient batch processing. The sort also detects entries with `bucket_index == 0` during the final radix pass, allowing zero-bucket entries to be skipped without a separate scan. |
| 126 | +
|
| 127 | +### Batched Affine Addition |
| 128 | +
|
| 129 | +`batch_accumulate_points_into_buckets` processes sorted points iteratively: |
| 130 | +- Same-bucket pairs → queue for batch addition |
| 131 | +- Different buckets → cache in bucket or queue with existing accumulator |
| 132 | +- Uses branchless conditional moves to minimize pipeline stalls |
| 133 | +- Prefetches future points to hide memory latency |
| 134 | +- Recirculates results to maximize batch efficiency before writing to buckets |
| 135 | +
|
| 136 | +<details> |
| 137 | +<summary>Batch accumulation case analysis</summary> |
| 138 | +
|
| 139 | +| Condition | Action | Iterator Update | |
| 140 | +|-----------|--------|-----------------| |
| 141 | +| `bucket[i] == bucket[i+1]` | Queue both points for batch add | `point_it += 2` | |
| 142 | +| Different buckets, accumulator exists | Queue point + accumulator | `point_it += 1` | |
| 143 | +| Different buckets, no accumulator | Cache point into bucket | `point_it += 1` | |
| 144 | +
|
| 145 | +After batch addition, results targeting the same bucket are paired again before writing to bucket accumulators, reducing random memory access by ~50%. |
| 146 | +
|
| 147 | +</details> |
| 148 | +
|
| 149 | +## Parallelization |
| 150 | +
|
| 151 | +Uses **per-thread buffers** (bucket accumulators, scratch space) to eliminate contention. |
| 152 | +
|
| 153 | +For `batch_multi_scalar_mul()`, work is distributed via `MSMWorkUnit` structures that can split a single MSM across multiple threads. Each thread computes partial results on point subsets, combined in a final reduction. |
| 154 | +
|
| 155 | +<details> |
| 156 | +<summary>Per-call buffer sizes</summary> |
| 157 | +
|
| 158 | +| Buffer | Size | Purpose | |
| 159 | +|--------|------|---------| |
| 160 | +| `BucketAccumulators` (affine) | $2^c × 64$ bytes | Affine bucket array + bitmap | |
| 161 | +| `JacobianBucketAccumulators` | $2^c × 96$ bytes | Jacobian bucket array + bitmap | |
| 162 | +| `AffineAdditionData` | ~400 KB | Scratch for batch inversion | |
| 163 | +| `point_schedule` | $n × 8$ bytes | Per-MSM point schedule | |
| 164 | +
|
| 165 | +Buffers are allocated per-call for WASM compatibility. Memory scales with thread count during parallel execution. |
| 166 | +
|
| 167 | +</details> |
| 168 | +
|
| 169 | +## File Structure |
| 170 | +
|
| 171 | +``` |
| 172 | +scalar_multiplication/ |
| 173 | +├── scalar_multiplication.hpp # MSM class, data structures |
| 174 | +├── scalar_multiplication.cpp # Core algorithm |
| 175 | +├── process_buckets.hpp/cpp # Radix sort |
| 176 | +├── bitvector.hpp # Bit vector for bucket tracking |
| 177 | +└── README.md # This file |
| 178 | +``` |
| 179 | +
|
| 180 | +## References |
| 181 | +
|
| 182 | +1. Pippenger, N. (1976). "On the evaluation of powers and related problems" |
| 183 | +2. Bernstein, D.J. et al. "Faster batch forgery identification" (batch inversion) |
0 commit comments