-
Notifications
You must be signed in to change notification settings - Fork 57
Description
After the recent NTT speedups introduced upstream in twenty-first, I had expected to see an at least 10 % drop in proving time with triton-vm v0.49.0. Instead I observe a 7 % drop in proving times.
To get to the bottom of this, I wanted to compare a representative NTT benchmark with the "evaluation" steps in the triton-vm prover. I found that there is a 58 % overhead for BFieldElement NTT and a 30 % overhead for XFieldElement NTT. To be fair, there is more going on in the "evaluate" step than just NTT, as the function scales a polynomial (to be able to do co-set evaluation). But I think those 58/30 % overheads are worth looking at.
Benchmarks & proving were done on Mjolnir, a Threadripper 5995wx with 64 cores and 512GB RAM, on twenty-first v0.49.0 and triton-vm v0.49.0.
Proving, where evaluation takes 32.64 seconds for b-field, and 16.84 seconds for x-field.
$ triton-cli --profile prove --program fib.tasm --input 204800
### Triton VM – Prove 134.42s #Reps Share Category 105.2 GiB
├─trace execution 721.29ms 1 0.54% (gen – 29.36%) +648.0 MiB
├─Fiat-Shamir: claim 6.81µs 1 0.00% (hash – 0.00%) ±0 B
├─derive additional parameters 20.17µs 1 0.00% ±0 B
├─main tables 53.63s 1 39.90% +64.9 GiB
│ ├─create 659.47ms 1 0.49% (gen – 26.84%) +5.9 GiB
│ ├─pad 519.88ms 1 0.39% (gen – 21.16%) +25.9 MiB
│ │ ├─pad original tables 308.33ms 1 0.23% +17.9 MiB
│ │ └─fill degree-lowering table 211.40ms 1 0.16% +8.0 MiB
│ ├─LDE 37.89s 1 28.19% (LDE – 57.63%) +59.7 GiB
│ │ ├─polynomial zero-initialization 1.75µs 1 0.00% ±0 B
│ │ ├─interpolation 3.32s 1 2.47% +12.1 GiB
│ │ ├─resize 1.94s 1 1.44% +47.3 GiB
│ │ ├─evaluation 32.64s 1 24.28% +275.2 MiB
│ │ └─memoize 1.37µs 1 0.00% ±0 B
│ ├─Merkle tree 12.57s 1 9.35% +1.2 GiB
│ │ ├─leafs 11.54s 1 8.59% +628.0 MiB
│ │ │ └─hash rows 11.54s 1 8.59% (hash – 50.58%) +628.0 MiB
│ │ └─Merkle tree 930.81ms 1 0.69% (hash – 4.08%) +1.2 GiB
│ ├─Fiat-Shamir 26.20µs 1 0.00% (hash – 0.00%) ±0 B
│ └─extend 556.30ms 1 0.41% (gen – 22.64%) +4.1 GiB
│ ├─initialize master table 221.29ms 1 0.16% +4.1 GiB
│ ├─slice master table 4.60µs 1 0.00% ±0 B
│ ├─all tables 217.36ms 1 0.16% +10.5 MiB
│ └─fill degree lowering table 117.44ms 1 0.09% ±0 B
├─aux tables 33.90s 1 25.22% +34.2 GiB
│ ├─LDE 21.91s 1 16.30% (LDE – 33.32%) +37.2 GiB
│ │ ├─polynomial zero-initialization 1.38µs 1 0.00% ±0 B
│ │ ├─interpolation 3.66s 1 2.72% +4.2 GiB
│ │ ├─resize 1.39s 1 1.04% +32.1 GiB
│ │ ├─evaluation 16.86s 1 12.54% +40.7 MiB
│ │ └─memoize 871.00ns 1 0.00% ±0 B
│ ├─Merkle tree 9.54s 1 7.10% +1.2 GiB
│ │ ├─leafs 9.18s 1 6.83% +628.0 MiB
│ │ │ └─hash rows 9.18s 1 6.83% (hash – 40.24%) +628.0 MiB
│ │ └─Merkle tree 319.95ms 1 0.24% (hash – 1.40%) +1.2 GiB
│ └─Fiat-Shamir 114.48µs 1 0.00% (hash – 0.00%) ±0 B
├─quotient calculation (cached) 13.36s 1 9.94% (CC – 72.74%) +381.1 MiB
│ ├─zerofier inverse 3.88s 1 2.89% +510.7 MiB
│ └─evaluate AIR, compute quotient codeword 9.44s 1 7.02% +383.0 MiB
├─quotient LDE 5.95s 1 4.43% (LDE – 9.05%) +1.4 GiB
├─hash rows of quotient segments 524.60ms 1 0.39% (hash – 2.30%) +642.0 MiB
├─Merkle tree 319.13ms 1 0.24% (hash – 1.40%) +1.3 GiB
├─out-of-domain rows 8.21s 1 6.11% +92.5 MiB
├─Fiat-Shamir 81.54µs 1 0.00% (hash – 0.00%) ±0 B
├─linear combination 6.02s 1 4.48% +624.0 MiB
│ ├─main 483.22ms 1 0.36% (CC – 2.63%) +48.1 MiB
│ ├─aux 2.27s 1 1.69% (CC – 12.34%) +48.4 MiB
│ └─quotient 1.77s 1 1.31% (CC – 9.62%) +239.1 MiB
├─DEEP 2.68s 1 2.00% +1.1 GiB
│ ├─main&aux curr row 585.61ms 1 0.44% +372.2 MiB
│ ├─main&aux next row 812.93ms 1 0.60% +384.7 MiB
│ └─segmented quotient 1.28s 1 0.96% +379.7 MiB
├─combined DEEP polynomial 490.81ms 1 0.37% -815.9 MiB
│ └─sum 490.77ms 1 0.37% (CC – 2.67%) -815.9 MiB
├─FRI 1.89s 1 1.40% +96.3 MiB
└─open trace leafs 709.99µs 1 0.00% ±0 B
### Categories
LDE 65.75s 48.92%
hash 22.82s 16.97%
CC 18.36s 13.66%
gen 2.46s 1.83%
Clock frequency is 15236 Hz (2048009 clock cycles / (134418 ms / 1 iterations))
Optimal clock frequency is 15601 Hz (2097152 padded height / (134418 ms / 1 iterations))
FRI domain length is 2^24Compared to a benchmark of parallel NTTs performed over the same domain as the FRI domain of the prover, thv/ntt-par-benchmarks branch to the twenty-first repo.
$ cargo criterion --bench ntt
Finished `bench` profile [optimized] target(s) in 3.48s
Gnuplot not found, using plotters backend
Benchmarking bfe_ntt/len/24: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.7s.
bfe_ntt/len/24 time: [537.82 ms 539.78 ms 542.01 ms]
thrpt: [30.954 Melem/s 31.082 Melem/s 31.195 Melem/s]
Benchmarking par_bfe_ntt/len/width: /24/379: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 206.4s.
par_bfe_ntt/len/width: /24/379
time: [20.636 s 20.668 s 20.699 s]
thrpt: [307.19 Melem/s 307.66 Melem/s 308.13 Melem/s]
Benchmarking xfe_ntt/len/24: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 14.5s.
xfe_ntt/len/24 time: [1.4251 s 1.4309 s 1.4375 s]
thrpt: [11.671 Melem/s 11.725 Melem/s 11.773 Melem/s]
Benchmarking par_xfe_ntt/len/width: /24/88: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 130.4s.
par_xfe_ntt/len/width: /24/88
time: [12.958 s 12.980 s 13.002 s]
thrpt: [113.55 Melem/s 113.75 Melem/s 113.94 Melem/s]fib.tasm program used to benchmark prover:
// initialize stack: ⊥ 0 1 i
push 0
push 1
read_io 1
// is any looping necessary?
dup 0
skiz
call fib_loop
// pop zero, write result
pop 1
write_io 1
halt
// before: ⊥ 0 1 i
// after: ⊥ fib(i-1) fib(i) 0
fib_loop:
push -1 // ⊥ a b j -1
add // ⊥ a b (j-1)
swap 2 // ⊥ (j-1) b a
dup 1 // ⊥ (j-1) b a b
add // ⊥ (j-1) b (a+b)
swap 1 // ⊥ (j-1) (a+b) b
swap 2 // ⊥ b (a+b) (j-1)
dup 0 // ⊥ b (a+b) (j-1) (j-1)
skiz // ⊥ b (a+b) (j-1)
recurse
return