Skip to content

Conversation

@kielfriedt
Copy link

arm changes in folly that provide a uplift in xoshiro256. Optimized NEON version code using VecResCount = 4 to retain all data in registers, eliminating spills and reducing mov instructions to avoid the high register pressure
Achieved 15.76% uplift on xoshiro256_64 case and 8.64% uplift on xoshiro256 case

Grace-Clang19 Original Code Optimized code (Neon+VecResCount=4) Uplift
xoshiro256 684.66M 743.83M 8.64%
xoshiro256_64 529.31M 612.75M 15.76%

NOTE, when build with clang, we need manually tell it to use eor3 by specifying "--extra-cmake-defines '{"CMAKE_C_FLAGS": "-O3 -march=native+sha3", "CMAKE_CXX_FLAGS": "-O3 -march=native+sha3"}'" to getdeps.py.

@meta-cla
Copy link

meta-cla bot commented Oct 6, 2025

Hi @kielfriedt!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

}
cur = 0;
#else
for (uint64_t i = 0; i < VecResCount; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code gen do you get if you add #pragma clang loop unroll_count(4) to unroll this loop 4 times? Is it still worse than the manually unrolled loop above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set the loop unroll count to the number of SIMD units?

Copy link
Author

@kielfriedt kielfriedt Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late response

Im testing on N V2 with 4 SIMD

testing with #pragma clang loop unroll_count(4)

arm changes manual unroll

[...]/folly/folly/test/RandomBenchmark.cpp     relative  time/iter   iters/s
============================================================================
xoshiro256                                                  1.10ns   911.82M
xoshiro256_64                                               1.31ns   765.83M

#pragma clang loop unroll_count(4)
============================================================================
[...]/folly/folly/test/RandomBenchmark.cpp     relative  time/iter   iters/s
============================================================================
xoshiro256                                                  1.39ns   720.06M
xoshiro256_64                                               1.70ns   587.99M

Significant reduction in performance when not manually unrolling

@meta-cla meta-cla bot added the CLA Signed label Oct 6, 2025
@meta-cla
Copy link

meta-cla bot commented Oct 6, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

uint64x2_t s3_3 = curState3[3];

uint64x2_t sum0 = vaddq_u64(s0_0, s0_3);
uint64x2_t rot0 = vorrq_u64(vshlq_n_u64(sum0, 23),vshrq_n_u64(sum0, 64 - 23));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kielfriedt What about using instruction XAR for the rotate. It would be a XAR with a vdupq_n_u64(0) and rotating by 41

Copy link
Author

@kielfriedt kielfriedt Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay.

original changes provided

[...]/folly/folly/test/RandomBenchmark.cpp     relative  time/iter   iters/s
============================================================================
xoshiro256                                                  1.10ns   911.82M
xoshiro256_64                                               1.31ns   765.83M

doing a single rotation on the first
============================================================================
[...]/folly/folly/test/RandomBenchmark.cpp     relative  time/iter   iters/s
============================================================================
xoshiro256                                                  1.89ns   528.34M
xoshiro256_64                                               1.40ns   716.72M

doing all rotations
============================================================================
[...]/folly/folly/test/RandomBenchmark.cpp     relative  time/iter   iters/s
============================================================================
xoshiro256                                                  2.07ns   482.83M
xoshiro256_64                                               2.36ns   424.47M

significant reduction in performance when using this change.

@kielfriedt
Copy link
Author

@Nicoshev are we waiting on anymore feedback or can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants