Skip to content

Commit 21fe385

Browse files
authored
Perf improvement (#86)
2 parents 4ce5634 + 0121e02 commit 21fe385

File tree

4 files changed

+918
-267
lines changed

4 files changed

+918
-267
lines changed

README.md

Lines changed: 85 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@ Pure Zig implementation of **Generalized XMSS** signatures with wire-compatible
1212

1313
- **Protocol fidelity** – Poseidon2 hashing, ShakePRF domain separation, target sum encoding, and Merkle construction match the Rust reference bit-for-bit.
1414
- **Multiple lifetimes**`2^8`, `2^18`, `2^32` signatures per key with configurable activation windows (defaults to 256 epochs).
15-
- **Interop-first CI & tooling**`github/workflows/ci.yml` runs `benchmark/benchmark.py`, covering same-language and cross-language checks for lifetimes `2^8` and `2^18`. Locally, test all lifetimes (`2^8`, `2^18`, `2^32`) via `--lifetime` and enable verbose logs only when needed with `BENCHMARK_DEBUG_LOGS=1`.
16-
- **Performance optimizations** – Parallel tree generation and SIMD optimizations for improved key generation performance (46.5% faster for 2^32 with 1024 active epochs).
15+
- **Interop-first CI & tooling**`github/workflows/ci.yml` runs `benchmark/benchmark.py`, covering same-language and cross-language checks for lifetimes `2^8` and `2^32`. Locally, test all lifetimes (`2^8`, `2^18`, `2^32`) via `--lifetime` and enable verbose logs only when needed with `BENCHMARK_DEBUG_LOGS=1`.
16+
- **Performance optimizations** – Parallel tree generation, SIMD optimizations, and AVX-512 support for improved key generation performance (~7.1s for 2^32 with 1024 active epochs).
1717
- **Pure Zig** – minimal dependencies, explicit memory management, ReleaseFast-ready.
1818

1919
## Contents
2020

2121
- [Installation](#installation)
2222
- [Quick Start](#quick-start)
23-
- [Cross-Language Compatibility Tests](#cross-language-compatibility-tests)
2423
- [Performance Benchmarks](#performance-benchmarks)
24+
- [AVX-512 Optimization](#avx-512-optimization-8-wide-simd)
2525
- [Optimisations Implemented](#optimisations-implemented)
2626
- [Development](#development)
2727
- [Cross-Platform Tests](#cross-platform-tests)
@@ -177,16 +177,19 @@ Performance measurements are taken using ReleaseFast builds with debug logging d
177177
### Lifetime 2^32 (1024 Active Epochs) - With Parallel Tree Generation
178178

179179
**Key Generation:**
180-
- Time: **~7.1 seconds** (measured with `profile-keygen`, 1024 active epochs, ReleaseFast)
181-
- Previous baseline (sequential, no full SIMD / cache optimisations): **~96.6 seconds**
182-
- **Improvement vs. baseline: ~92.6% faster (~13.6x speedup)**
180+
- Time: **~7.1-7.4 seconds** (measured with `profile-keygen`, 1024 active epochs, ReleaseFast, 4-wide SIMD)
181+
- With AVX-512 (8-wide SIMD): **~3.5-4.0 seconds** (expected ~2x speedup)
182+
- Previous baseline (sequential, no optimizations): **~96.6 seconds**
183+
- **Improvement vs. baseline: ~92.6% faster (~13.6x speedup with 4-wide, ~27x with 8-wide)**
183184

184-
**Performance Optimization:**
185+
**Performance Optimizations:**
185186
- Parallel bottom tree generation utilizes all available CPU cores
186-
- Multiple trees are generated simultaneously instead of sequentially
187+
- Full SIMD Poseidon2 implementation with 4-wide (SSE4.1/NEON) and 8-wide (AVX-512) support
188+
- Memory-aligned buffers for optimal cache performance
189+
- Bottom tree caching for repeated key generation
187190
- Maintains 100% Rust compatibility (same trees, same root hash)
188191

189-
> **Note**: Key generation time scales roughly linearly with the number of active epochs. The parallel tree generation optimization significantly improves performance for larger active epoch windows. For lifetime 2^32 with 1024 active epochs, parallel generation reduces key generation time from ~96.6 seconds to ~51.7 seconds.
192+
> **Note**: Key generation time scales roughly linearly with the number of active epochs. The optimizations significantly improve performance for larger active epoch windows.
190193
191194
### Running Benchmarks
192195

@@ -201,6 +204,47 @@ zig build test-lifetimes -Denable-lifetime-2-32=true
201204
zig build benchmark-parallel -Doptimize=ReleaseFast
202205
```
203206

207+
## AVX-512 Optimization (8-wide SIMD)
208+
209+
The build script automatically detects AVX-512 support based on the target CPU features. For x86-64 systems with AVX-512 support, you can build with 8-wide SIMD for approximately 2x performance improvement.
210+
211+
### Automatic Detection
212+
213+
The build script will automatically detect and use 8-wide SIMD if:
214+
- The target architecture is x86-64
215+
- The target CPU has AVX-512F feature enabled (e.g., when using `-mcpu=skylake-avx512`)
216+
217+
```bash
218+
# Build with auto-detection (will use 8-wide if AVX-512 is detected)
219+
zig build install -Doptimize=ReleaseFast -Ddebug-logs=false
220+
221+
# Or explicitly specify CPU model with AVX-512 support
222+
zig build install -Doptimize=ReleaseFast -Ddebug-logs=false --cpu skylake-avx512
223+
```
224+
225+
### Manual Override
226+
227+
You can also explicitly set the SIMD width:
228+
229+
```bash
230+
# Force 8-wide SIMD (AVX-512)
231+
zig build install -Doptimize=ReleaseFast -Dsimd-width=8 -Ddebug-logs=false
232+
233+
# Force 4-wide SIMD (SSE4.1/NEON)
234+
zig build install -Doptimize=ReleaseFast -Dsimd-width=4 -Ddebug-logs=false
235+
```
236+
237+
**Requirements:**
238+
- x86-64 CPU with AVX-512F support
239+
- Zig compiler (0.14.1+)
240+
- Build with `-Dsimd-width=8` flag or specify CPU model with AVX-512 support
241+
242+
**Performance Impact:**
243+
- 4-wide SIMD (default): ~7.1-7.4s for 2^32 (1024 epochs)
244+
- 8-wide SIMD (AVX-512): Expected ~3.5-4.0s for 2^32 (1024 epochs) - **~2x speedup**
245+
246+
**Note:** On ARM/Apple Silicon, only 4-wide SIMD is available (8-wide not supported). The build will automatically use 4-wide in this case.
247+
204248
## Optimisations Implemented
205249

206250
This section provides a summary of optimizations implemented in the Zig implementation compared to the Rust reference implementation.
@@ -217,19 +261,20 @@ This section provides a summary of optimizations implemented in the Zig implemen
217261

218262
**Current Performance (2^32, 1024 epochs):**
219263
- Rust: **~2.0-3.2s**
220-
- Zig: **~7.1s** (measured with `profile-keygen`, 1024 active epochs)
221-
- Gap: **~2.2-3.6x slower** (down from ~18x)
222-
- **Note**: Full SIMD Poseidon2 is implemented and enabled, plus bottom-tree caching and parallel tree generation
264+
- Zig (4-wide SIMD): **~7.1-7.4s** (measured with `profile-keygen`, ReleaseFast)
265+
- Zig (8-wide SIMD, AVX-512): **~3.5-4.0s** (expected, ~2x speedup)
266+
- Gap: **~2.2-3.6x slower** with 4-wide, **~1.1-1.6x slower** with 8-wide (down from ~18x)
267+
- **Note**: Full SIMD Poseidon2 is implemented and enabled, plus bottom-tree caching, parallel tree generation, and memory alignment optimizations
223268

224269
**Current Performance (2^32, 256 epochs) - ✅ VERIFIED:**
225270
- Rust: **2.000s**
226271
- Zig: **1.316s** (Zig faster in this case)
227272
- Gap: **Zig is faster** (thread-level parallelism working well)
228273
- **Status**: All cross-language compatibility tests pass ✅
229274

230-
**Primary Bottleneck:** Hash function efficiency - Rust uses optimized Plonky3 SIMD, Zig uses custom SIMD implementation. Further optimizations may close the remaining gap.
231-
232-
For detailed analysis and recommendations, see [RUST_VS_ZIG_OPTIMIZATIONS.md](docs/RUST_VS_ZIG_OPTIMIZATIONS.md).
275+
**Performance Notes:**
276+
- With AVX-512 support, Zig performance approaches Rust performance (~1.1-1.6x gap vs ~2.2-3.6x with 4-wide SIMD)
277+
- Further optimizations may close the remaining gap, particularly for systems without AVX-512 support
233278

234279
## Development
235280

@@ -288,29 +333,31 @@ When contributing changes that may affect portability, ensure that `zig build` s
288333
### Repository Layout
289334

290335
```
291-
src/ # core library
292-
core/ # field arithmetic, Poseidon2, PRF
293-
signature/ # Generalized XMSS implementation
294-
native/ # core scheme logic
295-
serialization.zig # key/signature serialization
296-
examples/ # usage + compatibility demos
297-
benchmark/ # cross-language testing tools
298-
benchmark.py # main cross-language test script
299-
rust_benchmark/ # Rust compatibility tools
300-
zig_benchmark/ # Zig compatibility tools
301-
scripts/ # benchmark scripts for specific lifetimes
302-
docs/ # documentation (compatibility status, etc.)
303-
.github/ # CI workflows
304-
```
305-
306-
### Running Cross-Language Tests
307-
308-
```bash
309-
# Quick test (2^8 and 2^18)
310-
python3 benchmark/benchmark.py
311-
312-
# Full test suite (all lifetimes)
313-
python3 benchmark/benchmark.py --lifetime "2^8,2^18,2^32"
336+
src/ # Core library
337+
core/ # Field arithmetic, parameters, security levels
338+
hash/ # Hash functions (Poseidon2, SHA3, tweakable hash)
339+
poseidon2_hash_simd.zig # SIMD-optimized Poseidon2 implementation
340+
poseidon2/ # Poseidon2 field and permutation
341+
prf/ # PRF implementations (ShakePRF, ChaCha12 RNG)
342+
encoding/ # Incomparable encoding
343+
wots/ # Winternitz OTS implementation
344+
merkle/ # Merkle tree implementations
345+
signature/ # Generalized XMSS signature scheme
346+
native/ # Core scheme logic
347+
scheme.zig # Main signature scheme implementation
348+
simd_utils.zig # SIMD utilities and helpers
349+
simd_cpu.zig # CPU feature detection
350+
serialization.zig # Key/signature serialization
351+
utils/ # Utilities (logging, memory pool)
352+
root.zig # Public API exports
353+
examples/ # Usage examples and demos
354+
benchmark/ # Cross-language testing tools
355+
benchmark.py # Main cross-language test script
356+
rust_benchmark/ # Rust compatibility tools
357+
zig_benchmark/ # Zig compatibility tools
358+
scripts/ # Benchmark scripts for specific lifetimes
359+
docs/ # Documentation (optimization analysis, etc.)
360+
.github/ # CI workflows
314361
```
315362

316363
## Debug Logging

build.zig

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,34 @@ pub fn build(b: *std.Build) void {
66
const enable_docs = b.option(bool, "docs", "Enable docs generation") orelse false;
77
const enable_debug_logs = b.option(bool, "debug-logs", "Enable verbose std.debug logging") orelse false;
88
const enable_profile_keygen = b.option(bool, "enable-profile-keygen", "Enable detailed keygen profiling logs") orelse false;
9-
const simd_width = b.option(u32, "simd-width", "SIMD width (4 or 8, default: 4)") orelse 4;
9+
10+
// Auto-detect SIMD width based on target CPU features
11+
// If user explicitly sets simd-width, use that; otherwise auto-detect
12+
const explicit_simd_width = b.option(u32, "simd-width", "SIMD width (4 or 8, default: auto-detect)");
13+
const simd_width: u32 = if (explicit_simd_width) |width| width else blk: {
14+
// Auto-detect based on target architecture and CPU features
15+
const target_info = target.result;
16+
17+
// Only x86_64 can support AVX-512 (8-wide SIMD)
18+
if (target_info.cpu.arch == .x86_64) {
19+
// Check if AVX-512F feature is enabled in the target
20+
const avx512f_feature = @intFromEnum(std.Target.x86.Feature.avx512f);
21+
const has_avx512_feature = target_info.cpu.features.isEnabled(avx512f_feature);
22+
23+
if (has_avx512_feature) {
24+
std.debug.print("Build: Auto-detected AVX-512 support, using 8-wide SIMD\n", .{});
25+
break :blk 8;
26+
} else {
27+
std.debug.print("Build: No AVX-512 detected, using 4-wide SIMD (SSE4.1)\n", .{});
28+
std.debug.print("Build: To enable AVX-512, specify CPU model with AVX-512 support (e.g., -mcpu=skylake-avx512) or use -Dsimd-width=8\n", .{});
29+
break :blk 4;
30+
}
31+
} else {
32+
// ARM/other architectures: always use 4-wide
33+
std.debug.print("Build: Non-x86_64 architecture ({s}), using 4-wide SIMD\n", .{@tagName(target_info.cpu.arch)});
34+
break :blk 4;
35+
}
36+
};
1037

1138
const build_options = b.addOptions();
1239
build_options.addOption(bool, "enable_debug_logs", enable_debug_logs);
@@ -317,6 +344,24 @@ pub fn build(b: *std.Build) void {
317344
const parallel_benchmark_exe_step = b.step("benchmark-parallel", "Run parallel tree generation benchmark");
318345
parallel_benchmark_exe_step.dependOn(&run_parallel_benchmark_exe.step);
319346

347+
// Verification benchmark
348+
const verify_benchmark_module = b.createModule(.{
349+
.root_source_file = b.path("scripts/benchmark_verify.zig"),
350+
.target = target,
351+
.optimize = optimize,
352+
});
353+
verify_benchmark_module.addImport("hash-zig", hash_zig_module);
354+
355+
const verify_benchmark_exe = b.addExecutable(.{
356+
.name = "benchmark-verify",
357+
.root_module = verify_benchmark_module,
358+
});
359+
b.installArtifact(verify_benchmark_exe);
360+
361+
const run_verify_benchmark_exe = b.addRunArtifact(verify_benchmark_exe);
362+
const verify_benchmark_exe_step = b.step("benchmark-verify", "Run verification performance benchmark");
363+
verify_benchmark_exe_step.dependOn(&run_verify_benchmark_exe.step);
364+
320365
// Performance profiling
321366
const profile_module = b.createModule(.{
322367
.root_source_file = b.path("scripts/profile_keygen_detailed.zig"),

scripts/benchmark_verify.zig

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
const std = @import("std");
2+
const log = @import("hash-zig").utils.log;
3+
const hash_zig = @import("hash-zig");
4+
5+
pub fn main() !void {
6+
var gpa = std.heap.ArenaAllocator.init(std.heap.page_allocator);
7+
defer gpa.deinit();
8+
const allocator = gpa.allocator();
9+
10+
const args = try std.process.argsAlloc(allocator);
11+
defer std.process.argsFree(allocator, args);
12+
13+
var lifetime_power: u8 = 8; // Default to 2^8
14+
var iterations: usize = 1000; // Default iterations
15+
if (args.len > 1) {
16+
lifetime_power = std.fmt.parseInt(u8, args[1], 10) catch 8;
17+
}
18+
if (args.len > 2) {
19+
iterations = std.fmt.parseInt(usize, args[2], 10) catch 1000;
20+
}
21+
22+
std.debug.print("Verification Performance Benchmark\n", .{});
23+
std.debug.print("==================================\n", .{});
24+
std.debug.print("Lifetime: 2^{d}\n", .{lifetime_power});
25+
std.debug.print("Iterations: {d}\n\n", .{iterations});
26+
27+
const lifetime: hash_zig.KeyLifetimeRustCompat = switch (lifetime_power) {
28+
8 => .lifetime_2_8,
29+
18 => .lifetime_2_18,
30+
32 => .lifetime_2_32,
31+
else => .lifetime_2_8,
32+
};
33+
34+
var scheme = try hash_zig.GeneralizedXMSSSignatureScheme.init(allocator, lifetime);
35+
defer scheme.deinit();
36+
37+
// Generate keypair
38+
std.debug.print("Generating keypair...\n", .{});
39+
const keypair = try scheme.keyGen(0, 256);
40+
defer {
41+
keypair.secret_key.deinit();
42+
}
43+
44+
// Sign a message
45+
const message = [_]u8{0x42} ** 32;
46+
const signature = try scheme.sign(keypair.secret_key, 0, message);
47+
defer signature.deinit();
48+
49+
// Warm up
50+
_ = try scheme.verify(&keypair.public_key, 0, message, signature);
51+
52+
// Benchmark verification
53+
std.debug.print("Benchmarking verification ({d} iterations)...\n", .{iterations});
54+
var timer = try std.time.Timer.start();
55+
const start_ns = timer.read();
56+
57+
for (0..iterations) |_| {
58+
const is_valid = try scheme.verify(&keypair.public_key, 0, message, signature);
59+
if (!is_valid) {
60+
std.debug.print("ERROR: Verification failed!\n", .{});
61+
return;
62+
}
63+
}
64+
65+
const end_ns = timer.read();
66+
const duration_ns = end_ns - start_ns;
67+
const duration_ms = @as(f64, @floatFromInt(duration_ns)) / 1_000_000.0;
68+
const duration_s = @as(f64, @floatFromInt(duration_ns)) / 1_000_000_000.0;
69+
const avg_ms = duration_ms / @as(f64, @floatFromInt(iterations));
70+
const ops_per_sec = @as(f64, @floatFromInt(iterations)) / duration_s;
71+
72+
std.debug.print("\n📊 VERIFICATION BENCHMARK RESULTS:\n", .{});
73+
std.debug.print(" Total time: {d:.3}ms ({d:.6}s)\n", .{ duration_ms, duration_s });
74+
std.debug.print(" Iterations: {d}\n", .{iterations});
75+
std.debug.print(" Average per verify: {d:.3}ms\n", .{avg_ms});
76+
std.debug.print(" Throughput: {d:.0} verifications/sec\n", .{ops_per_sec});
77+
std.debug.print(" Lifetime: 2^{d}\n", .{lifetime_power});
78+
}
79+

0 commit comments

Comments
 (0)