Skip to content

perf: bmt SIMD hasher#5381

Draft
acud wants to merge 4 commits intomasterfrom
simd-hashing-no-concurrency
Draft

perf: bmt SIMD hasher#5381
acud wants to merge 4 commits intomasterfrom
simd-hashing-no-concurrency

Conversation

@acud
Copy link
Contributor

@acud acud commented Feb 25, 2026

Description

This PR tries to improve the current BMT hasher to use SIMD when available.

Motivation and Context (Optional)

The current BMT hasher is highly inefficient - it uses massive goroutine spawning to compute a single chunk address. For a full chunk, this means 255 goroutines are spawned per chunk, creating GC stress and significant scheduler stress. For the ReserveSample function used in the redistribution game, this means excessive memory and CPU stress in order to calculate the reserve sample.

The idea here was to use an existing keccak implementation, compile it to assembler and try to call it directly from our go code without having to use cgo which comes with its own set of side-effects.

I used (I==me+Claude) the XKCP project (from the keccak authors) and built a build script that builds, extracts and wraps the compiled code correctly. Currently only linux amd64 is supported. Windows and mac should fall back to the go legacy sha3 hasher.

So far the results are promising:

  • x1.6 faster BMT hashing on my local machine (laptop)
  • x2.5 faster on AVX2 supported data-center CPUs (Hetzner)
  • x5 faster on newer AVX512 architectures

There's a few more things to iron out and test:

  • is the scalar hashing fallback (to the go crypto legacy keccak hashing) reasonably fast? (I am imagining this only for dev use anyway - mac devs etc) or do we need to fallback to the previous implementation of BMT that spawned all those goroutines?
  • as of such - do we need a BMT factory? (to spin up the right type of hasher) do we want in this case to maintain both implementations? (copies?)
  • since SIMD instructions can cause CPU throttling, we must test on different configurations and providers and see that the possible throttling doesn't nullify the performance improvements that SIMD gives us. in other words we have to compare workload benchmarks against the current implementation on trunk.
  • figure out whether the excessive stack frame allocation can somehow be avoided
  • see if using bmtpool would improve anything at all
  • fix the linker error: /usr/bin/ld: warning: /tmp/go-link-1425710637/000001.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

Worth noting:

  • I've also tried to port the existing go crypto library legacy sha3 implementation (written in go) to use the new go 1.26 SIMD primitives. Still needs a bit of research.
  • windows not supported cuz shadow space (don't ask)

Test plan

  • run on selected production nodes for a month and see if all is well, no panics, better performance. Only linux amd64, various configurations.

Related Issue (Optional)

#5174

References:

@acud acud force-pushed the simd-hashing-no-concurrency branch from 304934f to ad70c60 Compare February 25, 2026 23:56
@acud acud changed the title perf: BMT SIMD hasher perf: bmt SIMD hasher Feb 26, 2026
@acud acud marked this pull request as draft February 26, 2026 17:41
@acud acud marked this pull request as draft February 26, 2026 17:41
@acud acud force-pushed the simd-hashing-no-concurrency branch from 0267d23 to 74b4d29 Compare February 26, 2026 17:48
@acud acud force-pushed the simd-hashing-no-concurrency branch from af184e8 to 7872f4e Compare February 27, 2026 01:23
@acud acud force-pushed the simd-hashing-no-concurrency branch from 7872f4e to e10454f Compare February 27, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant