You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+85-38Lines changed: 85 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,16 +12,16 @@ Pure Zig implementation of **Generalized XMSS** signatures with wire-compatible
12
12
13
13
-**Protocol fidelity** – Poseidon2 hashing, ShakePRF domain separation, target sum encoding, and Merkle construction match the Rust reference bit-for-bit.
14
14
-**Multiple lifetimes** – `2^8`, `2^18`, `2^32` signatures per key with configurable activation windows (defaults to 256 epochs).
15
-
-**Interop-first CI & tooling** – `github/workflows/ci.yml` runs `benchmark/benchmark.py`, covering same-language and cross-language checks for lifetimes `2^8` and `2^18`. Locally, test all lifetimes (`2^8`, `2^18`, `2^32`) via `--lifetime` and enable verbose logs only when needed with `BENCHMARK_DEBUG_LOGS=1`.
16
-
-**Performance optimizations** – Parallel tree generation and SIMD optimizationsfor improved key generation performance (46.5% faster for 2^32 with 1024 active epochs).
15
+
-**Interop-first CI & tooling** – `github/workflows/ci.yml` runs `benchmark/benchmark.py`, covering same-language and cross-language checks for lifetimes `2^8` and `2^32`. Locally, test all lifetimes (`2^8`, `2^18`, `2^32`) via `--lifetime` and enable verbose logs only when needed with `BENCHMARK_DEBUG_LOGS=1`.
16
+
-**Performance optimizations** – Parallel tree generation, SIMD optimizations, and AVX-512 support for improved key generation performance (~7.1s for 2^32 with 1024 active epochs).
@@ -177,16 +177,19 @@ Performance measurements are taken using ReleaseFast builds with debug logging d
177
177
### Lifetime 2^32 (1024 Active Epochs) - With Parallel Tree Generation
178
178
179
179
**Key Generation:**
180
-
- Time: **~7.1 seconds** (measured with `profile-keygen`, 1024 active epochs, ReleaseFast)
181
-
- Previous baseline (sequential, no full SIMD / cache optimisations): **~96.6 seconds**
182
-
-**Improvement vs. baseline: ~92.6% faster (~13.6x speedup)**
180
+
- Time: **~7.1-7.4 seconds** (measured with `profile-keygen`, 1024 active epochs, ReleaseFast, 4-wide SIMD)
181
+
- With AVX-512 (8-wide SIMD): **~3.5-4.0 seconds** (expected ~2x speedup)
182
+
- Previous baseline (sequential, no optimizations): **~96.6 seconds**
183
+
-**Improvement vs. baseline: ~92.6% faster (~13.6x speedup with 4-wide, ~27x with 8-wide)**
183
184
184
-
**Performance Optimization:**
185
+
**Performance Optimizations:**
185
186
- Parallel bottom tree generation utilizes all available CPU cores
186
-
- Multiple trees are generated simultaneously instead of sequentially
187
+
- Full SIMD Poseidon2 implementation with 4-wide (SSE4.1/NEON) and 8-wide (AVX-512) support
188
+
- Memory-aligned buffers for optimal cache performance
189
+
- Bottom tree caching for repeated key generation
187
190
- Maintains 100% Rust compatibility (same trees, same root hash)
188
191
189
-
> **Note**: Key generation time scales roughly linearly with the number of active epochs. The parallel tree generation optimization significantly improves performance for larger active epoch windows. For lifetime 2^32 with 1024 active epochs, parallel generation reduces key generation time from ~96.6 seconds to ~51.7 seconds.
192
+
> **Note**: Key generation time scales roughly linearly with the number of active epochs. The optimizations significantly improve performance for larger active epoch windows.
The build script automatically detects AVX-512 support based on the target CPU features. For x86-64 systems with AVX-512 support, you can build with 8-wide SIMD for approximately 2x performance improvement.
210
+
211
+
### Automatic Detection
212
+
213
+
The build script will automatically detect and use 8-wide SIMD if:
214
+
- The target architecture is x86-64
215
+
- The target CPU has AVX-512F feature enabled (e.g., when using `-mcpu=skylake-avx512`)
216
+
217
+
```bash
218
+
# Build with auto-detection (will use 8-wide if AVX-512 is detected)
- Gap: **Zig is faster** (thread-level parallelism working well)
228
273
-**Status**: All cross-language compatibility tests pass ✅
229
274
230
-
**Primary Bottleneck:** Hash function efficiency - Rust uses optimized Plonky3 SIMD, Zig uses custom SIMD implementation. Further optimizations may close the remaining gap.
231
-
232
-
For detailed analysis and recommendations, see [RUST_VS_ZIG_OPTIMIZATIONS.md](docs/RUST_VS_ZIG_OPTIMIZATIONS.md).
275
+
**Performance Notes:**
276
+
- With AVX-512 support, Zig performance approaches Rust performance (~1.1-1.6x gap vs ~2.2-3.6x with 4-wide SIMD)
277
+
- Further optimizations may close the remaining gap, particularly for systems without AVX-512 support
233
278
234
279
## Development
235
280
@@ -288,29 +333,31 @@ When contributing changes that may affect portability, ensure that `zig build` s
288
333
### Repository Layout
289
334
290
335
```
291
-
src/ # core library
292
-
core/ # field arithmetic, Poseidon2, PRF
293
-
signature/ # Generalized XMSS implementation
294
-
native/ # core scheme logic
295
-
serialization.zig # key/signature serialization
296
-
examples/ # usage + compatibility demos
297
-
benchmark/ # cross-language testing tools
298
-
benchmark.py # main cross-language test script
299
-
rust_benchmark/ # Rust compatibility tools
300
-
zig_benchmark/ # Zig compatibility tools
301
-
scripts/ # benchmark scripts for specific lifetimes
0 commit comments