Skip to content

Conversation

pcc
Copy link
Contributor

@pcc pcc commented Jul 17, 2024

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

DamonFool and others added 2 commits July 16, 2024 17:51
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@llvmbot llvmbot added the libc label Jul 17, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 17, 2024

@llvm/pr-subscribers-libc

Author: None (pcc)

Changes

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)


Full diff: https://github.com/llvm/llvm-project/pull/99260.diff

1 Files Affected:

  • (modified) libc/src/string/memory_utils/op_aarch64.h (+6-12)
diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h
index 1090ea2617f09..5c08a6ae48b04 100644
--- a/libc/src/string/memory_utils/op_aarch64.h
+++ b/libc/src/string/memory_utils/op_aarch64.h
@@ -84,8 +84,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t a = vld1q_u8(_p1);
       uint8x16_t n = vld1q_u8(_p2);
       uint8x16_t an = veorq_u8(a, n);
-      uint32x2_t an_reduced = vqmovn_u64(vreinterpretq_u64_u8(an));
-      return vmaxv_u32(an_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(an));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -97,12 +96,9 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o).  At least one byte is nonzero if there is
       // a difference between the two buffers.  We reduce this value down to 4
-      // bytes in two steps. First, calculate the saturated move value when
-      // going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get
-      // a single 32 bit nonzero value if a mismatch occurred.
+      // bytes using the UMAXV instruction to compute the max across the vector.
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr ((Size % BlockSize) == 0) {
       for (size_t offset = 0; offset < Size; offset += BlockSize)
         if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset))
@@ -129,8 +125,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o)
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -150,9 +145,8 @@ template <size_t Size> struct Bcmp {
       uint8x16_t cpdq = vorrq_u8(cp, dq);
       // abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)).  Reduce this to
       // a nonzero 32 bit value if a mismatch occurred.
-      uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq);
-      uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq);
-      return vmaxv_u32(abnocpdq_reduced);
+      uint8x16_t abnocpdq = anbo | cpdq;
+      return vmaxvq_u32(vreinterpretq_u32_u8(abnocpdq));
     } else {
       static_assert(cpp::always_false<decltype(Size)>, "SIZE not implemented");
     }

@pcc pcc requested a review from gchatelet July 17, 2024 00:51
@lntue lntue changed the title libc: Use UMAXV.4S to reduce bcmp result. [libc] Use UMAXV.4S to reduce bcmp result. Jul 17, 2024
@SchrodingerZhu
Copy link
Contributor

Hi,

Thank you for the patch. Unfortunately, I think the proposed change is causing failures in tests:

Ran 5 tests.  PASS: 5  FAIL: 0
[4171/5229] Running unit test libc.test.src.stdio.snprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.snprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:23: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcSNPrintfTest.CutOff
[ RUN      ] LlvmLibcSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:53: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4172/5229] Running unit test libc.test.src.stdio.vsnprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.vsnprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcVSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:35: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcVSNPrintfTest.CutOff
[ RUN      ] LlvmLibcVSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:64: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcVSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4173/5229] Running unit test libc.test.src.stdio.fprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.fprintf_test.__unit__.__build__
[==========] Running 1 test from 1 test suite.
[ RUN      ] LlvmLibcFPrintfTest.WriteToFile
/home/schrodinger/development/llvm-project/libc/test/src/stdio/fprintf_test.cpp:65: FAILURE
      Expected: printf_test::fread(data, 1, sizeof(simple) - 1, file)
      Which is: 0
To be equal to: sizeof(simple) - 1
      Which is: 37
[  FAILED  ] LlvmLibcFPrintfTest.WriteToFile
Ran 1 tests.  PASS: 0  FAIL: 1
[4184/5229] Linking CXX executable libc/test/src/stdio/libc.test.src.stdio.remove_test.__hermetic__.__build__
ninja: build stopped: subcommand failed.

Copy link
Contributor

@SchrodingerZhu SchrodingerZhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@pcc pcc closed this Aug 5, 2024
@pcc
Copy link
Contributor Author

pcc commented Aug 15, 2025

can't reproduce this test failure. I did:

cmake -G Ninja -DLLVM_ENABLE_RUNTIMES=libc -DLLVM_ENABLE_PROJECTS=clang\;lld -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD=AArch64 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm
ninja libc
ninja -C ./runtimes/runtimes-bins check-libc

all tests passed.

@pcc pcc reopened this Aug 15, 2025
@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Aug 15, 2025 via email

@lntue lntue requested a review from overmighty August 26, 2025 01:03
@pcc
Copy link
Contributor Author

pcc commented Sep 6, 2025

I will test it again. Thanks for letting me know.On Thu, Aug 14, 2025 at 23:14, Peter Collingbourne @.> wrote: Reopened #99260. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are on a team that was mentioned.Message ID: @.>

Did you have a chance to do this?

@overmighty
Copy link
Member

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

@pcc
Copy link
Contributor Author

pcc commented Sep 12, 2025

Ping

FYI, ~2 weeks ago I tried to reproduce your benchmark results on Android and I managed to cross-compile the benchmark from main and run it, but couldn't get it to compile from this PR's branch.

Yeah, you might need to rebase to main, it rebases cleanly here.

@pcc
Copy link
Contributor Author

pcc commented Oct 10, 2025

Ping

Copy link
Contributor

@michaelrj-google michaelrj-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

pveras and others added 2 commits October 13, 2025 11:20
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@pcc pcc changed the title [libc] Use UMAXV.4S to reduce bcmp result. libc: Use UMAXV.4S to reduce bcmp result. Oct 13, 2025
@pcc pcc changed the title libc: Use UMAXV.4S to reduce bcmp result. [libc] Use UMAXV.4S to reduce bcmp result. Oct 13, 2025
@pcc pcc changed the base branch from users/pcc/spr/main.libc-use-umaxv4s-to-reduce-bcmp-result to main October 13, 2025 18:21
@pcc pcc merged commit 7905ec3 into main Oct 13, 2025
11 of 13 checks passed
@pcc pcc deleted the users/pcc/spr/libc-use-umaxv4s-to-reduce-bcmp-result branch October 13, 2025 18:21
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Oct 13, 2025
We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
  bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
    1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu

Pull Request: llvm/llvm-project#99260
akadutta pushed a commit to akadutta/llvm-project that referenced this pull request Oct 14, 2025
We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
  bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
    1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
    1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Reviewers: michaelrj-google, gchatelet, overmighty, SchrodingerZhu

Pull Request: llvm#99260
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants