Skip to content

Conversation

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Nov 6, 2025

This adds a new representation geh of group elements, namely in homogeneous (also called projective) coordinates. This is supposed to be faster for unmixed (i.e., the second summand is not ge) addition in terms of field operations, namely 12M+0S+25etc vs. 12M+4S+11etc for Jacobian coordinates.

The addition and doubling formulas are due to Renes, Costello, and Batina 2016, Algorithms 7 and 9. The formulas are complete, i.e., they have no special cases. However, this implementation still keeps track of infinity in a dedicated boolean flag for performance reasons. Since the buckets in Pippenger's algorithm are initialized with infinity (=zero), we'll have many additions involving infinity, and going through the entire formula for each of those hurts performance (and the entire point of this PR is performance).

The formulas were implemented by giving GPT-5 mini screenshots of the algorithms in the paper and the field.h. The result was not awesome but I could clean it up manually.

The new representation is used in Pippenger's ecmult_multi for accumulating the buckets after every window iteration. Buckets are still constructed as gej (because it has faster mixed addition) and only converted to geh before accumulation. This is still supposed to be faster even if the conversion is accounted for. The conversion costs 2M+1S but we then do two geh additions in a row, saving 8S. This PR has three different variants of how geh could be used:

  1. 2ccc5e0 Only the inner accumulation loop is done in geh.
  2. 867fe34 All of the accumulation is done in geh.
  3. 0fc73f3 Like the previous, but we switch back to gej for rows of doublings.

Unfortunately, none of these turns out to be really faster in ecmult_bench pippenger_wnaf on my x86_64 system with gcc 15.2.1 or clang 21.1.4. The best variant (2) beats master by just 0.21%; the other variants are slower than master. :/ If I compile in 32-bit mode, all three variants beat master consistently, but only by 1.2%. But this latter result gives at least some hope that this PR could pay off on some platform. I'm not even sure how much we care about 32-bit platforms. Maybe we care about hardware wallets in general, but probably not when it comes to ecmult_multi. Plus this would need real benchmarks; I didn't even run this on a native 32-bit CPU).

But we'd certainly care about ARM64 which I couldn't test on. Anyone with an ARM Mac willing to benchmark this?

The exact benchmark command was SECP256K1_BENCH_ITERS=100000 bench_ecmult pippenger_wnaf (or 20000 iters for 32-bit). Don't forget the pippenger_wnaf argument to make sure you don't benchmark Strauss' algorithm instead, at least below the threshold where we switch to Pippenger automatically. I did this on a 12th Gen Intel(R) Core(TM) i7-1260P, pinned to a P-core, and with TurboBoost disabled. See the attached spreadsheet: for details. benchmark-gcc.ods

If you want to benchmark this, I think it makes sense to get four runs per setup: one for the baseline (d0f3123, just disabling low point counts in bench_ecmult for quicker benchmarking) and the three "step" commits as mentioned above. You could just extend the spreadsheet with your results.

Also, if you have any ideas on how to improve this further, I'd be happy to hear them. I tried various micro-optimizations, but none of them turned out to be significant on my machine. In fact, most of them made the code slower in practice. In theory, this PR should make it possible to increase the window size a bit, but playing around with the window size didn't make a difference either in practice.

edit: Don't care about CI. It fails on some platforms because I forgot to mark functions static. This should compile locally without issues.

@theStack
Copy link
Contributor

theStack commented Nov 7, 2025

Ran the benchmarks on my arm64 notebook (it's a Lenovo Thinkpad T14s Gen 6, with a Qualcomm Snapdragon X Elite CPU, using gcc 14.2.0) using a hacked-together build-and-benchmark script and got the following results: https://gist.github.com/theStack/897d7b50b5b8a6f288ed2b817fcca9fc
Based on only this, it looks like only variant 1 (2ccc5e0) is a bit faster (~1%-ish, looking at the last few lines each?), variant 2 is pretty much the same and variant 3 is worse. Will give it a few more runs next week to verify if these results are consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants