[compiler-rt][ARM] Optimized f32 add/subtract for Armv6-M. #154093

statham-arm · 2025-08-18T11:08:17Z

This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository.

The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization:

New code: 73, 63, 53, 81, 81
Old code: 83, 92, 88, 153, 168

This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly smaller number, instead of the correct 0x00fffffe.

The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an #ifdef that should match when this implementation is selected.

The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its crt_supersedes property in cmake.

This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository. The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization: New code: 73, 63, 53, 81, 81 Old code: 83, 92, 88, 153, 168 This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly _smaller_ number, instead of the correct 0x00fffffe. The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an `#ifdef` that should match when this implementation is selected. The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its `crt_supersedes` property in cmake.

github-actions · 2025-08-18T11:11:46Z

✅ With the latest revision this PR passed the C/C++ code formatter.

vhscampos · 2025-08-18T11:40:41Z

The cmake changes LGTM, but I'll leave the core of the change for the others (I'm out of my depth).

smithp35

A disclaimer that I was involved in the internal review for Arm Optimized Routines, and on principle I'm in favour of this change. I've concentrated on the parts unique to compiler-rt.

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance. Obviously can't just compare the length of the file as that contains comments, and instructions can be 2 or 4 bytes. To be clear I'm not advocating for an additional choice of routine based on size/performance.

Presumably the denormal issue you found was unique to the existing Arm assembly implementation and not the general C implementation (otherwise the tests would fail on non v6-m platforms)?

smithp35 · 2025-08-18T13:39:30Z

compiler-rt/test/builtins/Unit/addsf3_test.c

+  status |= test__addsf3(0xbbebe66d, 0x3b267c1f, 0xbb98a85e);
+  status |= test__addsf3(0x01f5b166, 0x81339a37, 0x019be44a);
+
+#if __thumb__ && !__thumb2__


We've got a situation where the generic C implemention differs in behaviour from the assembler version. I expect your intention is to replace all the Arm implementations so they will all have consistent behaviours across architecture.

Being paranoid, if someone edits the CMake to use the C version (on v6-m) these tests will start to fail. The comment does make that clear so at least they will know why.

Do you know if there's a way to get CMake to define a symbol when the superseded version is chosen. That could be tested for here.

I'm not sure it could – these test sources are compiled via lit, not via cmake-generated compile commands, so cmake add_definitions or similar wouldn't affect them.

It would probably be easier to set a lit feature name, such as you can query in REQUIRES: lines. Then I could move the NaN-policy-specific tests out into separate files conditioned on librt_has_addsf3_arm_nan or some such.

I don't have a strong opinion for the tests to be split out. It would be nice to have if it is simple, but it may not be worth a lot of additional complexity.

smithp35 · 2025-08-18T13:46:52Z

compiler-rt/lib/builtins/arm/addsf3.S

-  pop {r4, r5, r6, r7, pc}
-
-
+  PUSH {r4,r5,r6,lr}


Apologies in advance for being annoying. All the existing Arm assembly code in the directory uses lower case for instructions and directives. Possibly to distinguish it from capitalised Macros.

I don't personally care that much myself about whether the assembly is captialized or not. This is more of an observation that it could be an existing undocumented convention. Maybe another reviewer can confirm?

I'd prefer using lower case for instructions and directives for consistency with other assembly files in compiler-rt.

smithp35 · 2025-08-18T14:03:33Z

compiler-rt/lib/builtins/arm/fnan2.c

+//
+//===----------------------------------------------------------------------===//
+
+unsigned __fnan2(unsigned a, unsigned b) {


Being paranoid again, this helper function isn't part of any runtime ABI so we could risk clashing with an arbitrary helper in another system library.

Running nm on compiler-rt I can see a prefix of __compilerrt_ being used in some cases such as __compilerrt_abort_impl. Could it be worth a similar prefix here. Possibly even __compilerrt_arm_fnan2?

smithp35 · 2025-08-18T14:05:52Z

compiler-rt/lib/builtins/arm/fnan2.c

+unsigned __fnan2(unsigned a, unsigned b) {
+  unsigned aadj = (a << 1) + 0x00800000;
+  unsigned badj = (b << 1) + 0x00800000;
+  if (aadj > 0xff800000)


Is it worth accommodating some of the additional comments in https://github.com/ARM-software/optimized-routines/blob/master/fp/common/dnan2.c to make it easier to recognize the bit-patterns?

statham-arm · 2025-08-19T10:51:28Z

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance.

You're right, the code size is bigger in this implementation. The new addsf3.S assembles to 648 bytes of code, and (at -Os) another 68 bytes for the helper fnan2.c. The old version was 312 bytes for addsf3.S and 22 bytes for the subsf3.c wrapper.

Presumably the denormal issue you found was unique to the existing Arm assembly implementation and not the general C implementation (otherwise the tests would fail on non v6-m platforms)?

Yes, the C version in lib/builtins/addsf3.c (well, really lib/builtins/fp_add_impl.inc) passes the new test in full. (Or rather, correctly skips all the NaN test cases and passes the rest.)

smithp35 · 2025-08-19T14:11:29Z

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance.

You're right, the code size is bigger in this implementation. The new addsf3.S assembles to 648 bytes of code, and (at -Os) another 68 bytes for the helper fnan2.c. The old version was 312 bytes for addsf3.S and 22 bytes for the subsf3.c wrapper.

Looking at the size of the v6-m libgcc implementation, which has a symbol size of 0x41c bytes (1052) then while this implementation is larger than previous compiler-rt it is not as large as the most commonly used open-source implementation.

Personally I'm comfortable that the size isn't prohibitively large.

This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.

Now we only try to test the difficult parts if we're using the new implementations, and otherwise, fall back to treating all NaNs as good enough when a NaN result is expected.

statham-arm requested review from MaskRay, TNorthover, efriedma-quic, mstorsjo, smithp35 and vhscampos August 18, 2025 11:08

llvmbot added compiler-rt compiler-rt:builtins labels Aug 18, 2025

clang-format

7f88246

smithp35 reviewed Aug 18, 2025

View reviewed changes

statham-arm added 2 commits August 18, 2025 17:13

Add comments in fnan2

f4447a1

Rename __fnan2 to mention compiler-rt

9e31425

statham-arm mentioned this pull request Oct 1, 2025

[compiler-rt][ARM] Optimized mulsf3 and divsf3 #161546

Merged

statham-arm added a commit to statham-arm/llvm-project that referenced this pull request Oct 2, 2025

Lowercase instruction mnemonics and shifter operands

7a24535

This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.

statham-arm added 3 commits October 2, 2025 14:26

Lowercase instruction mnemonics

73e7571

Changes to fnan2 to be consistent wih llvm#161546

a6f6263

Conditionalize NaN-handling parts of tests

fd77ee9

Now we only try to test the difficult parts if we're using the new implementations, and otherwise, fall back to treating all NaNs as good enough when a NaN result is expected.

[compiler-rt][ARM] Optimized f32 add/subtract for Armv6-M. #154093

Are you sure you want to change the base?

[compiler-rt][ARM] Optimized f32 add/subtract for Armv6-M. #154093

Uh oh!

Conversation

statham-arm commented Aug 18, 2025

Uh oh!

github-actions bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vhscampos commented Aug 18, 2025

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

statham-arm commented Aug 19, 2025

Uh oh!

smithp35 commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Aug 18, 2025 •

edited

Loading