Investigate evaluation overhead

After the recent [NTT speedups](https://github.com/Neptune-Crypto/twenty-first/commit/a72ff37928e43e0d5a061b1bbc6e0bd21f51100f) introduced upstream in twenty-first, I had expected to see an at least 10 % drop in proving time with triton-vm v0.49.0. Instead I observe a 7 % drop in proving times.

To get to the bottom of this, I wanted to compare a representative NTT benchmark with the "evaluation" steps in the triton-vm prover. I found that there is a 58 % overhead for `BFieldElement` NTT and a 30 % overhead for `XFieldElement` NTT. To be fair, there is more going on in the "evaluate" step than just NTT, as the function scales a polynomial (to be able to do co-set evaluation). But I think those 58/30 % overheads are worth looking at.

Benchmarks & proving were done on Mjolnir, a Threadripper 5995wx with 64 cores and 512GB RAM, on twenty-first v0.49.0 and triton-vm v0.49.0. 

Proving, where evaluation takes 32.64 seconds for b-field, and 16.84 seconds for x-field.
```bash
$ triton-cli --profile prove --program fib.tasm --input 204800
### Triton VM – Prove                         134.42s    #Reps   Share  Category          105.2 GiB
├─trace execution                             721.29ms       1   0.54%  (gen  – 29.36%)  +648.0 MiB
├─Fiat-Shamir: claim                            6.81µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─derive additional parameters                 20.17µs       1   0.00%                       ±0 B  
├─main tables                                  53.63s        1  39.90%                    +64.9 GiB
│ ├─create                                    659.47ms       1   0.49%  (gen  – 26.84%)    +5.9 GiB
│ ├─pad                                       519.88ms       1   0.39%  (gen  – 21.16%)   +25.9 MiB
│ │ ├─pad original tables                     308.33ms       1   0.23%                    +17.9 MiB
│ │ └─fill degree-lowering table              211.40ms       1   0.16%                     +8.0 MiB
│ ├─LDE                                        37.89s        1  28.19%  (LDE  – 57.63%)   +59.7 GiB
│ │ ├─polynomial zero-initialization            1.75µs       1   0.00%                       ±0 B  
│ │ ├─interpolation                             3.32s        1   2.47%                    +12.1 GiB
│ │ ├─resize                                    1.94s        1   1.44%                    +47.3 GiB
│ │ ├─evaluation                               32.64s        1  24.28%                   +275.2 MiB
│ │ └─memoize                                   1.37µs       1   0.00%                       ±0 B  
│ ├─Merkle tree                                12.57s        1   9.35%                     +1.2 GiB
│ │ ├─leafs                                    11.54s        1   8.59%                   +628.0 MiB
│ │ │ └─hash rows                              11.54s        1   8.59%  (hash – 50.58%)  +628.0 MiB
│ │ └─Merkle tree                             930.81ms       1   0.69%  (hash –  4.08%)    +1.2 GiB
│ ├─Fiat-Shamir                                26.20µs       1   0.00%  (hash –  0.00%)      ±0 B  
│ └─extend                                    556.30ms       1   0.41%  (gen  – 22.64%)    +4.1 GiB
│   ├─initialize master table                 221.29ms       1   0.16%                     +4.1 GiB
│   ├─slice master table                        4.60µs       1   0.00%                       ±0 B  
│   ├─all tables                              217.36ms       1   0.16%                    +10.5 MiB
│   └─fill degree lowering table              117.44ms       1   0.09%                       ±0 B  
├─aux tables                                   33.90s        1  25.22%                    +34.2 GiB
│ ├─LDE                                        21.91s        1  16.30%  (LDE  – 33.32%)   +37.2 GiB
│ │ ├─polynomial zero-initialization            1.38µs       1   0.00%                       ±0 B  
│ │ ├─interpolation                             3.66s        1   2.72%                     +4.2 GiB
│ │ ├─resize                                    1.39s        1   1.04%                    +32.1 GiB
│ │ ├─evaluation                               16.86s        1  12.54%                    +40.7 MiB
│ │ └─memoize                                 871.00ns       1   0.00%                       ±0 B  
│ ├─Merkle tree                                 9.54s        1   7.10%                     +1.2 GiB
│ │ ├─leafs                                     9.18s        1   6.83%                   +628.0 MiB
│ │ │ └─hash rows                               9.18s        1   6.83%  (hash – 40.24%)  +628.0 MiB
│ │ └─Merkle tree                             319.95ms       1   0.24%  (hash –  1.40%)    +1.2 GiB
│ └─Fiat-Shamir                               114.48µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─quotient calculation (cached)                13.36s        1   9.94%  (CC   – 72.74%)  +381.1 MiB
│ ├─zerofier inverse                            3.88s        1   2.89%                   +510.7 MiB
│ └─evaluate AIR, compute quotient codeword     9.44s        1   7.02%                   +383.0 MiB
├─quotient LDE                                  5.95s        1   4.43%  (LDE  –  9.05%)    +1.4 GiB
├─hash rows of quotient segments              524.60ms       1   0.39%  (hash –  2.30%)  +642.0 MiB
├─Merkle tree                                 319.13ms       1   0.24%  (hash –  1.40%)    +1.3 GiB
├─out-of-domain rows                            8.21s        1   6.11%                    +92.5 MiB
├─Fiat-Shamir                                  81.54µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─linear combination                            6.02s        1   4.48%                   +624.0 MiB
│ ├─main                                      483.22ms       1   0.36%  (CC   –  2.63%)   +48.1 MiB
│ ├─aux                                         2.27s        1   1.69%  (CC   – 12.34%)   +48.4 MiB
│ └─quotient                                    1.77s        1   1.31%  (CC   –  9.62%)  +239.1 MiB
├─DEEP                                          2.68s        1   2.00%                     +1.1 GiB
│ ├─main&aux curr row                         585.61ms       1   0.44%                   +372.2 MiB
│ ├─main&aux next row                         812.93ms       1   0.60%                   +384.7 MiB
│ └─segmented quotient                          1.28s        1   0.96%                   +379.7 MiB
├─combined DEEP polynomial                    490.81ms       1   0.37%                   -815.9 MiB
│ └─sum                                       490.77ms       1   0.37%  (CC   –  2.67%)  -815.9 MiB
├─FRI                                           1.89s        1   1.40%                    +96.3 MiB
└─open trace leafs                            709.99µs       1   0.00%                       ±0 B  

### Categories
LDE     65.75s  48.92%
hash    22.82s  16.97%
CC      18.36s  13.66%
gen      2.46s   1.83%

Clock frequency is 15236 Hz (2048009 clock cycles / (134418 ms / 1 iterations))
Optimal clock frequency is 15601 Hz (2097152 padded height / (134418 ms / 1 iterations))
FRI domain length is 2^24
```

Compared to a benchmark of parallel NTTs performed over the same domain as the FRI domain of the prover, $2^{24}$, where the NTTs take 20.668 seconds and 12.980 seconds, respectively. The benchmarks can be found on the `thv/ntt-par-benchmarks` branch to the twenty-first repo.
```bash
$ cargo criterion --bench ntt

    Finished `bench` profile [optimized] target(s) in 3.48s
Gnuplot not found, using plotters backend
Benchmarking bfe_ntt/len/24: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.7s.
bfe_ntt/len/24          time:   [537.82 ms 539.78 ms 542.01 ms]                         
                        thrpt:  [30.954 Melem/s 31.082 Melem/s 31.195 Melem/s]

Benchmarking par_bfe_ntt/len/width: /24/379: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 206.4s.
par_bfe_ntt/len/width: /24/379                                                                          
                        time:   [20.636 s 20.668 s 20.699 s]
                        thrpt:  [307.19 Melem/s 307.66 Melem/s 308.13 Melem/s]

Benchmarking xfe_ntt/len/24: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 14.5s.
xfe_ntt/len/24          time:   [1.4251 s 1.4309 s 1.4375 s]                            
                        thrpt:  [11.671 Melem/s 11.725 Melem/s 11.773 Melem/s]

Benchmarking par_xfe_ntt/len/width: /24/88: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 130.4s.
par_xfe_ntt/len/width: /24/88                                                                          
                        time:   [12.958 s 12.980 s 13.002 s]
                        thrpt:  [113.55 Melem/s 113.75 Melem/s 113.94 Melem/s]
```


`fib.tasm` program used to benchmark prover:

```tasm
// initialize stack: ⊥ 0 1 i
        push 0
        push 1
        read_io 1

        // is any looping necessary?
        dup 0
        skiz
        call fib_loop

        // pop zero, write result
        pop 1
        write_io 1
        halt

        // before: ⊥ 0 1 i
        // after:  ⊥ fib(i-1) fib(i) 0
        fib_loop:
            push -1   // ⊥ a b j -1
            add       // ⊥ a b (j-1)
            swap 2    // ⊥ (j-1) b a
            dup 1     // ⊥ (j-1) b a b
            add       // ⊥ (j-1) b (a+b)
            swap 1    // ⊥ (j-1) (a+b) b
            swap 2    // ⊥ b (a+b) (j-1)
            dup 0     // ⊥ b (a+b) (j-1) (j-1)
            skiz      // ⊥ b (a+b) (j-1)
            recurse
            return

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate evaluation overhead #360

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate evaluation overhead #360

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions