internal/field: Think about using fiat-crypto

[fiat-crypto][1] has neat formally verified/auto-generated field arithmetic that can be used to replace the fiddly bits of the internal/field package.  It would be nice to be able to use it, since it would let me feel less scared of exposing the package, and it's a nice checkbox to have.

Actually doing the switch requires the fiat-crypto performance to not be horrifically bad, since the current code works (and has been reviewed externally for correctness).  Benchmarks from [Novi's curve25519-dalek fork][2] (that has been upstreamed) documentation indicate that a 8-15% performance degradation is to be expected, but the actual observed results are significantly worse.

 * [x] Write an initial naive experimental branch to get ballpark benchmarks.
 * [x] Be really really really sad about performance.
 * [x] Try to figure out why performance is, significantly worse, when the rust code is just "slightly worse".
   * `CarryMul` is ridiculously slow because of `addcarryxU64`
   * `CarrySquare` is ridiculously slow because of `addcarryxU64`
   * Go's shit inliner strikes again.  The rust code `#[inline]`s `fiat_25519_carry_mul` among other things.  `Carry` is over the inliner budget. 
 * [x] Try to reduce the performance penalty further
   * Switch `CarrySquare` over to `CarryPow2k`
   * Add `Add`/`Sub`/`Opp` + `Carry`
   * Add `Add`/`Sub` + `CarryMul`/`CarrySquare`
   * Fix X25519 performance
 * [ ] Benchmark on 32-bit ARM or something.

##### Current status (2021/08/09)

The upstream fiat-crypto developers were kind enough to fold in some performance related changes (`CarryMul`/`CarrySquare` performance fixed, `Add`/`Sub`/`Opp` + `Carry` added).

My branch still adds `CarryPow2k` and `Add`/`Sub` + `CarryMul`/`CarrySquare`, which I suppose I can ask about.

Since the branch is close enough to what I envision a switch would look like, these are the current rough benchmarks:

With AVX2 used where it exists:
```
name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    103µs ± 0%     109µs ± 0%   +5.67%
VerifyBatchOnly/2-8                    146µs ± 0%     157µs ± 0%   +7.56%
VerifyBatchOnly/4-8                    232µs ± 0%     252µs ± 0%   +8.59%
VerifyBatchOnly/8-8                    402µs ± 0%     444µs ± 0%  +10.49%
VerifyBatchOnly/16-8                   742µs ± 0%     826µs ± 0%  +11.31%
VerifyBatchOnly/32-8                  1.43ms ± 0%    1.58ms ± 0%  +10.91%
VerifyBatchOnly/64-8                  2.76ms ± 0%    3.11ms ± 0%  +12.55%
VerifyBatchOnly/128-8                 5.24ms ± 0%    5.95ms ± 0%  +13.61%
VerifyBatchOnly/256-8                 9.52ms ± 0%   10.91ms ± 0%  +14.57%
VerifyBatchOnly/384-8                 13.5ms ± 0%    15.6ms ± 0%  +15.35%
VerifyBatchOnly/512-8                 17.7ms ± 0%    20.4ms ± 0%  +15.28%
VerifyBatchOnly/768-8                 25.4ms ± 0%    29.6ms ± 0%  +16.32%
VerifyBatchOnly/1024-8                32.8ms ± 0%    38.4ms ± 0%  +17.11%
GenerateKey/voi-8                     25.5µs ± 0%    27.9µs ± 0%   +9.33%
GenerateKey/stdlib-8                  43.3µs ± 0%    43.1µs ± 0%   -0.41%
NewKeyFromSeed/voi-8                  25.3µs ± 0%    27.7µs ± 0%   +9.43%
NewKeyFromSeed/stdlib-8               42.7µs ± 0%    42.8µs ± 0%   +0.07%
Signing/voi-8                         28.0µs ± 0%    30.3µs ± 0%   +8.21%
Signing/stdlib-8                      52.2µs ± 0%    52.2µs ± 0%   -0.04%
Verification/voi-8                    72.9µs ± 0%    79.4µs ± 0%   +8.94%
Verification/voi_stdlib-8             83.3µs ± 0%    90.9µs ± 0%   +9.13%
Verification/stdlib-8                  124µs ± 0%     124µs ± 0%   +0.15%
Expanded/NewExpandedPublicKey-8       11.4µs ± 0%    14.2µs ± 0%  +24.98%
Expanded/Verification/voi-8           62.6µs ± 0%    65.5µs ± 0%   +4.65%
Expanded/Verification/voi_stdlib-8    74.9µs ± 0%    78.9µs ± 0%   +5.39%
Expanded/VerifyBatchOnly/1-8          91.9µs ± 0%    96.7µs ± 0%   +5.18%
Expanded/VerifyBatchOnly/2-8           123µs ± 0%     128µs ± 0%   +4.02%
Expanded/VerifyBatchOnly/4-8           184µs ± 0%     197µs ± 0%   +6.76%
Expanded/VerifyBatchOnly/8-8           308µs ± 0%     331µs ± 0%   +7.54%
Expanded/VerifyBatchOnly/16-8          556µs ± 0%     597µs ± 0%   +7.37%
Expanded/VerifyBatchOnly/32-8         1.05ms ± 0%    1.13ms ± 0%   +7.70%
Expanded/VerifyBatchOnly/64-8         2.02ms ± 0%    2.20ms ± 0%   +8.56%
Expanded/VerifyBatchOnly/128-8        3.87ms ± 0%    4.24ms ± 0%   +9.47%
Expanded/VerifyBatchOnly/256-8        7.03ms ± 0%    7.77ms ± 0%  +10.38%
Expanded/VerifyBatchOnly/384-8        10.0ms ± 0%    11.1ms ± 0%  +10.93%
Expanded/VerifyBatchOnly/512-8        13.1ms ± 0%    14.5ms ± 0%  +11.00%
Expanded/VerifyBatchOnly/768-8        18.6ms ± 0%    20.9ms ± 0%  +12.10%
Expanded/VerifyBatchOnly/1024-8       24.0ms ± 0%    26.9ms ± 0%  +11.68%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        24.5µs ± 0%    26.9µs ± 0%   +9.54%
ScalarMult/voi-8            80.2µs ± 0%   113.0µs ± 0%  +40.98%
```

Note: The massive regression for X25519 ScalarMult is due to the removal of assembly.  Sufficiently recent `x/crypto/curve25519` does away with it as well, and clocks in at ~107us/op (~161us/op purego).

purego:
```
name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    164µs ± 0%     179µs ± 0%   +9.15%
VerifyBatchOnly/2-8                    238µs ± 0%     247µs ± 0%   +3.79%
VerifyBatchOnly/4-8                    351µs ± 0%     383µs ± 0%   +9.18%
VerifyBatchOnly/8-8                    603µs ± 0%     657µs ± 0%   +9.03%
VerifyBatchOnly/16-8                  1.10ms ± 0%    1.21ms ± 0%   +9.29%
VerifyBatchOnly/32-8                  2.09ms ± 0%    2.29ms ± 0%   +9.60%
VerifyBatchOnly/64-8                  4.09ms ± 0%    4.48ms ± 0%   +9.36%
VerifyBatchOnly/128-8                 8.04ms ± 0%    8.93ms ± 0%  +11.05%
VerifyBatchOnly/256-8                 14.3ms ± 0%    15.9ms ± 0%  +11.09%
VerifyBatchOnly/384-8                 20.2ms ± 0%    22.5ms ± 0%  +11.43%
VerifyBatchOnly/512-8                 26.3ms ± 0%    29.3ms ± 0%  +11.46%
VerifyBatchOnly/768-8                 37.3ms ± 0%    41.4ms ± 0%  +11.10%
VerifyBatchOnly/1024-8                48.1ms ± 0%    53.4ms ± 0%  +11.20%
GenerateKey/voi-8                     48.8µs ± 0%    52.6µs ± 0%   +7.80%
GenerateKey/stdlib-8                  58.3µs ± 0%    58.2µs ± 0%   -0.02%
NewKeyFromSeed/voi-8                  48.5µs ± 0%    52.8µs ± 0%   +8.84%
NewKeyFromSeed/stdlib-8               58.2µs ± 0%    58.1µs ± 0%   -0.12%
Signing/voi-8                         51.3µs ± 0%    55.0µs ± 0%   +7.20%
Signing/stdlib-8                      72.4µs ± 0%    72.2µs ± 0%   -0.27%
Verification/voi-8                     108µs ± 0%     117µs ± 0%   +8.55%
Verification/voi_stdlib-8              131µs ± 0%     145µs ± 0%  +10.17%
Verification/stdlib-8                  181µs ± 0%     181µs ± 0%   -0.22%
Expanded/NewExpandedPublicKey-8       14.5µs ± 0%    16.0µs ± 0%  +10.85%
Expanded/Verification/voi-8           93.4µs ± 0%   101.9µs ± 0%   +9.16%
Expanded/Verification/voi_stdlib-8     119µs ± 0%     132µs ± 0%  +11.36%
Expanded/VerifyBatchOnly/1-8           149µs ± 0%     163µs ± 0%   +9.48%
Expanded/VerifyBatchOnly/2-8           196µs ± 0%     215µs ± 0%   +9.80%
Expanded/VerifyBatchOnly/4-8           293µs ± 0%     319µs ± 0%   +9.03%
Expanded/VerifyBatchOnly/8-8           484µs ± 0%     530µs ± 0%   +9.44%
Expanded/VerifyBatchOnly/16-8          868µs ± 0%     943µs ± 0%   +8.67%
Expanded/VerifyBatchOnly/32-8         1.62ms ± 0%    1.77ms ± 0%   +9.44%
Expanded/VerifyBatchOnly/64-8         3.13ms ± 0%    3.43ms ± 0%   +9.75%
Expanded/VerifyBatchOnly/128-8        6.33ms ± 0%    7.00ms ± 0%  +10.72%
Expanded/VerifyBatchOnly/256-8        11.3ms ± 0%    12.6ms ± 0%  +11.41%
Expanded/VerifyBatchOnly/384-8        15.9ms ± 0%    17.7ms ± 0%  +11.16%
Expanded/VerifyBatchOnly/512-8        20.7ms ± 0%    23.0ms ± 0%  +11.17%
Expanded/VerifyBatchOnly/768-8        29.1ms ± 0%    32.4ms ± 0%  +11.34%
Expanded/VerifyBatchOnly/1024-8       37.4ms ± 0%    41.5ms ± 0%  +10.98%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        47.8µs ± 0%    51.7µs ± 0%  +8.18%
ScalarMult/voi-8             118µs ± 0%     113µs ± 0%  -4.05%
```

This is getting to "an acceptable slowdown" if people think that using the fiat code is better over the code that came from dalek, but the issue of having to ship modified routines from the `64/curve25519` implementation remains.

[1]: https://github.com/mit-plv/fiat-crypto
[2]: https://github.com/novifinancial/curve25519-dalek-fiat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/field: Think about using fiat-crypto #78

Current status (2021/08/09)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

internal/field: Think about using fiat-crypto #78

Description

Current status (2021/08/09)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions