fiat-crypto has neat formally verified/auto-generated field arithmetic that can be used to replace the fiddly bits of the internal/field package. It would be nice to be able to use it, since it would let me feel less scared of exposing the package, and it's a nice checkbox to have.
Actually doing the switch requires the fiat-crypto performance to not be horrifically bad, since the current code works (and has been reviewed externally for correctness). Benchmarks from Novi's curve25519-dalek fork (that has been upstreamed) documentation indicate that a 8-15% performance degradation is to be expected, but the actual observed results are significantly worse.
Current status (2021/08/09)
The upstream fiat-crypto developers were kind enough to fold in some performance related changes (CarryMul/CarrySquare performance fixed, Add/Sub/Opp + Carry added).
My branch still adds CarryPow2k and Add/Sub + CarryMul/CarrySquare, which I suppose I can ask about.
Since the branch is close enough to what I envision a switch would look like, these are the current rough benchmarks:
With AVX2 used where it exists:
name old time/op new time/op delta
VerifyBatchOnly/1-8 103µs ± 0% 109µs ± 0% +5.67%
VerifyBatchOnly/2-8 146µs ± 0% 157µs ± 0% +7.56%
VerifyBatchOnly/4-8 232µs ± 0% 252µs ± 0% +8.59%
VerifyBatchOnly/8-8 402µs ± 0% 444µs ± 0% +10.49%
VerifyBatchOnly/16-8 742µs ± 0% 826µs ± 0% +11.31%
VerifyBatchOnly/32-8 1.43ms ± 0% 1.58ms ± 0% +10.91%
VerifyBatchOnly/64-8 2.76ms ± 0% 3.11ms ± 0% +12.55%
VerifyBatchOnly/128-8 5.24ms ± 0% 5.95ms ± 0% +13.61%
VerifyBatchOnly/256-8 9.52ms ± 0% 10.91ms ± 0% +14.57%
VerifyBatchOnly/384-8 13.5ms ± 0% 15.6ms ± 0% +15.35%
VerifyBatchOnly/512-8 17.7ms ± 0% 20.4ms ± 0% +15.28%
VerifyBatchOnly/768-8 25.4ms ± 0% 29.6ms ± 0% +16.32%
VerifyBatchOnly/1024-8 32.8ms ± 0% 38.4ms ± 0% +17.11%
GenerateKey/voi-8 25.5µs ± 0% 27.9µs ± 0% +9.33%
GenerateKey/stdlib-8 43.3µs ± 0% 43.1µs ± 0% -0.41%
NewKeyFromSeed/voi-8 25.3µs ± 0% 27.7µs ± 0% +9.43%
NewKeyFromSeed/stdlib-8 42.7µs ± 0% 42.8µs ± 0% +0.07%
Signing/voi-8 28.0µs ± 0% 30.3µs ± 0% +8.21%
Signing/stdlib-8 52.2µs ± 0% 52.2µs ± 0% -0.04%
Verification/voi-8 72.9µs ± 0% 79.4µs ± 0% +8.94%
Verification/voi_stdlib-8 83.3µs ± 0% 90.9µs ± 0% +9.13%
Verification/stdlib-8 124µs ± 0% 124µs ± 0% +0.15%
Expanded/NewExpandedPublicKey-8 11.4µs ± 0% 14.2µs ± 0% +24.98%
Expanded/Verification/voi-8 62.6µs ± 0% 65.5µs ± 0% +4.65%
Expanded/Verification/voi_stdlib-8 74.9µs ± 0% 78.9µs ± 0% +5.39%
Expanded/VerifyBatchOnly/1-8 91.9µs ± 0% 96.7µs ± 0% +5.18%
Expanded/VerifyBatchOnly/2-8 123µs ± 0% 128µs ± 0% +4.02%
Expanded/VerifyBatchOnly/4-8 184µs ± 0% 197µs ± 0% +6.76%
Expanded/VerifyBatchOnly/8-8 308µs ± 0% 331µs ± 0% +7.54%
Expanded/VerifyBatchOnly/16-8 556µs ± 0% 597µs ± 0% +7.37%
Expanded/VerifyBatchOnly/32-8 1.05ms ± 0% 1.13ms ± 0% +7.70%
Expanded/VerifyBatchOnly/64-8 2.02ms ± 0% 2.20ms ± 0% +8.56%
Expanded/VerifyBatchOnly/128-8 3.87ms ± 0% 4.24ms ± 0% +9.47%
Expanded/VerifyBatchOnly/256-8 7.03ms ± 0% 7.77ms ± 0% +10.38%
Expanded/VerifyBatchOnly/384-8 10.0ms ± 0% 11.1ms ± 0% +10.93%
Expanded/VerifyBatchOnly/512-8 13.1ms ± 0% 14.5ms ± 0% +11.00%
Expanded/VerifyBatchOnly/768-8 18.6ms ± 0% 20.9ms ± 0% +12.10%
Expanded/VerifyBatchOnly/1024-8 24.0ms ± 0% 26.9ms ± 0% +11.68%
name old time/op new time/op delta
ScalarBaseMult/voi-8 24.5µs ± 0% 26.9µs ± 0% +9.54%
ScalarMult/voi-8 80.2µs ± 0% 113.0µs ± 0% +40.98%
Note: The massive regression for X25519 ScalarMult is due to the removal of assembly. Sufficiently recent x/crypto/curve25519 does away with it as well, and clocks in at ~107us/op (~161us/op purego).
purego:
name old time/op new time/op delta
VerifyBatchOnly/1-8 164µs ± 0% 179µs ± 0% +9.15%
VerifyBatchOnly/2-8 238µs ± 0% 247µs ± 0% +3.79%
VerifyBatchOnly/4-8 351µs ± 0% 383µs ± 0% +9.18%
VerifyBatchOnly/8-8 603µs ± 0% 657µs ± 0% +9.03%
VerifyBatchOnly/16-8 1.10ms ± 0% 1.21ms ± 0% +9.29%
VerifyBatchOnly/32-8 2.09ms ± 0% 2.29ms ± 0% +9.60%
VerifyBatchOnly/64-8 4.09ms ± 0% 4.48ms ± 0% +9.36%
VerifyBatchOnly/128-8 8.04ms ± 0% 8.93ms ± 0% +11.05%
VerifyBatchOnly/256-8 14.3ms ± 0% 15.9ms ± 0% +11.09%
VerifyBatchOnly/384-8 20.2ms ± 0% 22.5ms ± 0% +11.43%
VerifyBatchOnly/512-8 26.3ms ± 0% 29.3ms ± 0% +11.46%
VerifyBatchOnly/768-8 37.3ms ± 0% 41.4ms ± 0% +11.10%
VerifyBatchOnly/1024-8 48.1ms ± 0% 53.4ms ± 0% +11.20%
GenerateKey/voi-8 48.8µs ± 0% 52.6µs ± 0% +7.80%
GenerateKey/stdlib-8 58.3µs ± 0% 58.2µs ± 0% -0.02%
NewKeyFromSeed/voi-8 48.5µs ± 0% 52.8µs ± 0% +8.84%
NewKeyFromSeed/stdlib-8 58.2µs ± 0% 58.1µs ± 0% -0.12%
Signing/voi-8 51.3µs ± 0% 55.0µs ± 0% +7.20%
Signing/stdlib-8 72.4µs ± 0% 72.2µs ± 0% -0.27%
Verification/voi-8 108µs ± 0% 117µs ± 0% +8.55%
Verification/voi_stdlib-8 131µs ± 0% 145µs ± 0% +10.17%
Verification/stdlib-8 181µs ± 0% 181µs ± 0% -0.22%
Expanded/NewExpandedPublicKey-8 14.5µs ± 0% 16.0µs ± 0% +10.85%
Expanded/Verification/voi-8 93.4µs ± 0% 101.9µs ± 0% +9.16%
Expanded/Verification/voi_stdlib-8 119µs ± 0% 132µs ± 0% +11.36%
Expanded/VerifyBatchOnly/1-8 149µs ± 0% 163µs ± 0% +9.48%
Expanded/VerifyBatchOnly/2-8 196µs ± 0% 215µs ± 0% +9.80%
Expanded/VerifyBatchOnly/4-8 293µs ± 0% 319µs ± 0% +9.03%
Expanded/VerifyBatchOnly/8-8 484µs ± 0% 530µs ± 0% +9.44%
Expanded/VerifyBatchOnly/16-8 868µs ± 0% 943µs ± 0% +8.67%
Expanded/VerifyBatchOnly/32-8 1.62ms ± 0% 1.77ms ± 0% +9.44%
Expanded/VerifyBatchOnly/64-8 3.13ms ± 0% 3.43ms ± 0% +9.75%
Expanded/VerifyBatchOnly/128-8 6.33ms ± 0% 7.00ms ± 0% +10.72%
Expanded/VerifyBatchOnly/256-8 11.3ms ± 0% 12.6ms ± 0% +11.41%
Expanded/VerifyBatchOnly/384-8 15.9ms ± 0% 17.7ms ± 0% +11.16%
Expanded/VerifyBatchOnly/512-8 20.7ms ± 0% 23.0ms ± 0% +11.17%
Expanded/VerifyBatchOnly/768-8 29.1ms ± 0% 32.4ms ± 0% +11.34%
Expanded/VerifyBatchOnly/1024-8 37.4ms ± 0% 41.5ms ± 0% +10.98%
name old time/op new time/op delta
ScalarBaseMult/voi-8 47.8µs ± 0% 51.7µs ± 0% +8.18%
ScalarMult/voi-8 118µs ± 0% 113µs ± 0% -4.05%
This is getting to "an acceptable slowdown" if people think that using the fiat code is better over the code that came from dalek, but the issue of having to ship modified routines from the 64/curve25519 implementation remains.
fiat-crypto has neat formally verified/auto-generated field arithmetic that can be used to replace the fiddly bits of the internal/field package. It would be nice to be able to use it, since it would let me feel less scared of exposing the package, and it's a nice checkbox to have.
Actually doing the switch requires the fiat-crypto performance to not be horrifically bad, since the current code works (and has been reviewed externally for correctness). Benchmarks from Novi's curve25519-dalek fork (that has been upstreamed) documentation indicate that a 8-15% performance degradation is to be expected, but the actual observed results are significantly worse.
CarryMulis ridiculously slow because ofaddcarryxU64CarrySquareis ridiculously slow because ofaddcarryxU64#[inline]sfiat_25519_carry_mulamong other things.Carryis over the inliner budget.CarrySquareover toCarryPow2kAdd/Sub/Opp+CarryAdd/Sub+CarryMul/CarrySquareCurrent status (2021/08/09)
The upstream fiat-crypto developers were kind enough to fold in some performance related changes (
CarryMul/CarrySquareperformance fixed,Add/Sub/Opp+Carryadded).My branch still adds
CarryPow2kandAdd/Sub+CarryMul/CarrySquare, which I suppose I can ask about.Since the branch is close enough to what I envision a switch would look like, these are the current rough benchmarks:
With AVX2 used where it exists:
Note: The massive regression for X25519 ScalarMult is due to the removal of assembly. Sufficiently recent
x/crypto/curve25519does away with it as well, and clocks in at ~107us/op (~161us/op purego).purego:
This is getting to "an acceptable slowdown" if people think that using the fiat code is better over the code that came from dalek, but the issue of having to ship modified routines from the
64/curve25519implementation remains.