Skip to content

Conversation

@peterdettman
Copy link
Contributor

Modifies the pippenger bucket-summing loop. Very early code, just here to get benchmarks of the basic concept.

In short, we ensure running_sum always has the same z coordinate as the accumulator ("r"), so that adding running_sum to r can be done in 5M + 2S (with free update of running_sum for the new Co-Z coordinate); this is often called a ZADDU (or DBLU when it devolves to doubling). On the downside, when adding each bucket to the running_sum, we now need to also update r to keep them Co-Z; cost 3M + 1S. So a typical iteration of the summing loop costs 20M + 7S instead of 24M + 8S.

I measure a few % overall improvement for pippenger_wnaf, depending on the number of points, although I would appreciate others sanity-checking that. Unfortunately it involves rather a lot of new code, not helped by the fact that there are special cases everywhere.

@john-moffett
Copy link
Contributor

Apple M2 Max, default build. Master vs d95e48b

secp256k1 configure summary
===========================
Build artifacts:
  library type ........................ Shared
Optional modules:
  ECDH ................................ ON
  ECDSA pubkey recovery ............... OFF
  extrakeys ........................... ON
  schnorrsig .......................... ON
  musig ............................... ON
  ElligatorSwift ...................... ON
Parameters:
  ecmult window size .................. 15
  ecmult gen table size ............... 86 KiB
Optional features:
  assembly ............................ OFF
  external callbacks .................. OFF
Optional binaries:
  benchmark ........................... ON
  noverify_tests ...................... ON
  tests ............................... ON
  exhaustive tests .................... ON
  ctime_tests ......................... OFF
  examples ............................ OFF

Cross compiling ....................... FALSE
API visibility attributes ............. ON
Valgrind .............................. OFF
Preprocessor defined macros ........... ECMULT_WINDOW_SIZE=15 COMB_BLOCKS=43 COMB_TEETH=6
C compiler ............................ AppleClang 17.0.0.17000013, /usr/bin/cc
CFLAGS ................................ 
Compile options ....................... -Wall -pedantic -Wcast-align -Wconditional-uninitialized -Wextra -Wnested-externs -Wno-long-long -Wno-overlength-strings -Wno-unused-function -Wreserved-identifier -Wshadow -Wstrict-prototypes -Wundef
Build type:
 - CMAKE_BUILD_TYPE ................... RelWithDebInfo
 - CFLAGS ............................. -O2 -g 
 - LDFLAGS for executables ............ 
 - LDFLAGS for shared libraries ....... 
Benchmark Before Avg (us) After Avg (us) Δ Avg (us) Δ Avg %
ecmult_multi_79p_g 7.81 7.85 +0.04 +0.5%
ecmult_multi_95p_g 7.47 7.32 -0.15 -2.0%
ecmult_multi_111p_g 7.24 7.03 -0.21 -2.9%
ecmult_multi_127p_g 7.15 6.87 -0.28 -3.9%
ecmult_multi_159p_g 6.85 6.49 -0.36 -5.3%
ecmult_multi_191p_g 6.57 6.25 -0.32 -4.9%
ecmult_multi_223p_g 6.28 6.08 -0.20 -3.2%
ecmult_multi_255p_g 6.23 5.99 -0.24 -3.9%
ecmult_multi_319p_g 5.86 5.71 -0.15 -2.6%
ecmult_multi_383p_g 5.71 5.59 -0.12 -2.1%
ecmult_multi_447p_g 5.48 5.41 -0.07 -1.3%
ecmult_multi_511p_g 5.35 5.28 -0.07 -1.3%
ecmult_multi_639p_g 5.33 5.20 -0.13 -2.4%
ecmult_multi_767p_g 5.04 5.06 +0.02 +0.4%
ecmult_multi_895p_g 4.96 4.88 -0.08 -1.6%
ecmult_multi_1023p_g 4.88 4.89 +0.01 +0.2%
ecmult_multi_1279p_g 4.76 4.67 -0.09 -1.9%
ecmult_multi_1535p_g 4.59 4.51 -0.08 -1.7%
ecmult_multi_1791p_g 4.46 4.36 -0.10 -2.2%
ecmult_multi_2047p_g 4.36 4.29 -0.07 -1.6%
ecmult_multi_2559p_g 4.24 4.18 -0.06 -1.4%
ecmult_multi_3071p_g 4.14 4.06 -0.08 -1.9%
ecmult_multi_3583p_g 4.19 4.04 -0.15 -3.6%
ecmult_multi_4095p_g 4.04 3.95 -0.09 -2.2%
ecmult_multi_5119p_g 3.93 3.86 -0.07 -1.8%
ecmult_multi_6143p_g 3.82 3.78 -0.04 -1.0%
ecmult_multi_7167p_g 3.79 3.76 -0.03 -0.8%
ecmult_multi_8191p_g 3.75 3.65 -0.10 -2.7%
ecmult_multi_10239p_g 3.63 3.57 -0.06 -1.7%
ecmult_multi_12287p_g 3.57 3.51 -0.06 -1.7%
ecmult_multi_14335p_g 3.50 3.48 -0.02 -0.6%
ecmult_multi_16383p_g 3.53 3.38 -0.15 -4.2%
ecmult_multi_20479p_g 3.42 3.31 -0.11 -3.2%
ecmult_multi_24575p_g 3.29 3.30 +0.01 +0.3%
ecmult_multi_28671p_g 3.22 3.14 -0.08 -2.5%
ecmult_multi_32767p_g 3.19 3.10 -0.09 -2.8%

@siv2r
Copy link
Contributor

siv2r commented Dec 15, 2025

I benchmarked this pull request on a MacBook Pro (M4 Pro, ARM64, 12 cores: 8P + 4E) running macOS 15.6.1, plugged in with no background apps. Based on my results, this pull request performs better than master, with an average speedup of ~2%.

Anyone who wants to reproduce the benchmarks can use this Python script. It benchmarks both this pull request and the master branch, and outputs an .xlsx file comparing their performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants