Neon xoshiro256pp uplift #2512

kielfriedt · 2025-10-06T21:40:26Z

arm changes in folly that provide a uplift in xoshiro256. Optimized NEON version code using VecResCount = 4 to retain all data in registers, eliminating spills and reducing mov instructions to avoid the high register pressure
Achieved 15.76% uplift on xoshiro256_64 case and 8.64% uplift on xoshiro256 case

Grace-Clang19	Original Code	Optimized code (Neon+VecResCount=4)	Uplift
xoshiro256	684.66M	743.83M	8.64%
xoshiro256_64	529.31M	612.75M	15.76%

NOTE, when build with clang, we need manually tell it to use eor3 by specifying "--extra-cmake-defines '{"CMAKE_C_FLAGS": "-O3 -march=native+sha3", "CMAKE_CXX_FLAGS": "-O3 -march=native+sha3"}'" to getdeps.py.

meta-cla · 2025-10-06T21:40:32Z

Hi @kielfriedt!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

mcfi · 2025-10-06T21:55:27Z

folly/random/xoshiro256pp.h

+    }
+    cur = 0;
+#else
+   for (uint64_t i = 0; i < VecResCount; i++) {


What code gen do you get if you add #pragma clang loop unroll_count(4) to unroll this loop 4 times? Is it still worse than the manually unrolled loop above?

Would it make sense to set the loop unroll count to the number of SIMD units?

sorry for the late response

Im testing on N V2 with 4 SIMD

testing with #pragma clang loop unroll_count(4)

arm changes manual unroll

[...]/folly/folly/test/RandomBenchmark.cpp relative time/iter iters/s ============================================================================ xoshiro256 1.10ns 911.82M xoshiro256_64 1.31ns 765.83M #pragma clang loop unroll_count(4) ============================================================================ [...]/folly/folly/test/RandomBenchmark.cpp relative time/iter iters/s ============================================================================ xoshiro256 1.39ns 720.06M xoshiro256_64 1.70ns 587.99M Significant reduction in performance when not manually unrolling

meta-cla · 2025-10-06T22:10:56Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Nicoshev · 2025-10-06T22:43:20Z

folly/random/xoshiro256pp.h

+      uint64x2_t s3_3 = curState3[3];
+
+      uint64x2_t sum0 = vaddq_u64(s0_0, s0_3);
+      uint64x2_t rot0 = vorrq_u64(vshlq_n_u64(sum0, 23),vshrq_n_u64(sum0, 64 - 23));


@kielfriedt What about using instruction XAR for the rotate. It would be a XAR with a vdupq_n_u64(0) and rotating by 41

sorry for the delay.

original changes provided

[...]/folly/folly/test/RandomBenchmark.cpp relative time/iter iters/s ============================================================================ xoshiro256 1.10ns 911.82M xoshiro256_64 1.31ns 765.83M doing a single rotation on the first ============================================================================ [...]/folly/folly/test/RandomBenchmark.cpp relative time/iter iters/s ============================================================================ xoshiro256 1.89ns 528.34M xoshiro256_64 1.40ns 716.72M doing all rotations ============================================================================ [...]/folly/folly/test/RandomBenchmark.cpp relative time/iter iters/s ============================================================================ xoshiro256 2.07ns 482.83M xoshiro256_64 2.36ns 424.47M significant reduction in performance when using this change.

kielfriedt · 2025-12-09T17:11:45Z

@Nicoshev are we waiting on anymore feedback or can this be closed?

kielfriedt and others added 2 commits October 6, 2025 11:47

xoshiro arm changes 15% uplift with unrolling

8d43c40

Merge branch 'facebook:main' into neon-xoshiro256pp

c9689bb

mcfi reviewed Oct 6, 2025

View reviewed changes

meta-cla bot added the CLA Signed label Oct 6, 2025

Nicoshev reviewed Oct 6, 2025

View reviewed changes

kielfriedt mentioned this pull request Dec 9, 2025

Optimize xoshiro256 for NVIDIA Grace #2540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neon xoshiro256pp uplift #2512

Neon xoshiro256pp uplift #2512

kielfriedt commented Oct 6, 2025

Uh oh!

meta-cla bot commented Oct 6, 2025

Uh oh!

mcfi Oct 6, 2025

Uh oh!

mcfi Oct 6, 2025

Uh oh!

kielfriedt Nov 6, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Oct 6, 2025

Uh oh!

Nicoshev Oct 6, 2025

Uh oh!

kielfriedt Nov 6, 2025 •

edited

Loading

Uh oh!

kielfriedt commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Neon xoshiro256pp uplift #2512

Are you sure you want to change the base?

Neon xoshiro256pp uplift #2512

Conversation

kielfriedt commented Oct 6, 2025

Uh oh!

meta-cla bot commented Oct 6, 2025

Action Required

Process

Uh oh!

mcfi Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

mcfi Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

kielfriedt Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-cla bot commented Oct 6, 2025

Uh oh!

Nicoshev Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

kielfriedt Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kielfriedt commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kielfriedt Nov 6, 2025 •

edited

Loading

kielfriedt Nov 6, 2025 •

edited

Loading