Skip to content

Conversation

@statham-arm
Copy link
Collaborator

This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository.

The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization:

New code: 73, 63, 53, 81, 81
Old code: 83, 92, 88, 153, 168

This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly smaller number, instead of the correct 0x00fffffe.

The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an #ifdef that should match when this implementation is selected.

The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its crt_supersedes property in cmake.

This commit replaces the contents of the existing arm/addsf3.S with a
much faster implementation that Arm has recently open-sourced in the
Arm Optimized Routines git repository.

The new implementation is approximately 1.6× as fast as the old one on
average. Some sample cycle timings from a Cortex-M0, with test cases
covering both magnitude addition and subtraction and various cases of
renormalization:

New code: 73,  63,  53,  81,  81
Old code: 83,  92,  88, 153, 168

This commit also contains a more thorough test suite for single
precision addition and subtraction. Using that test suite I also found
that the previous arm/addsf3.S had at least one bug, which the new
code fixes: adding the largest denormal (0x007fffff) to itself
returned 0x007ffffe, a slightly _smaller_ number, instead of the
correct 0x00fffffe.

The test suite also includes thorough tests for the NaN handling
policy implemented by the new code. This is in line with Arm's
hardware FP implementations (so that switching between software and
hardware FP makes as little difference as possible to the answers),
but doesn't match what compiler-rt does in all other situations, so
I've enabled it only under an `#ifdef` that should match when this
implementation is selected.

The new code contains entry points for both addition and subtraction,
with cross-branching between them after correcting signs. This avoids
the overhead of treating subtraction as a sign-flipping wrapper on
addition, but also means I had to add an extra piece of mechanism to
the build scripts to allow the wrapper version of subsf3.c to be
excluded from the build in the presence of the new addsf3.S. You can
indicate that a platform-specific source file replaces an additional
platform-independent one by setting its `crt_supersedes` property in
cmake.
@github-actions
Copy link

github-actions bot commented Aug 18, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@vhscampos
Copy link
Member

The cmake changes LGTM, but I'll leave the core of the change for the others (I'm out of my depth).

Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A disclaimer that I was involved in the internal review for Arm Optimized Routines, and on principle I'm in favour of this change. I've concentrated on the parts unique to compiler-rt.

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance. Obviously can't just compare the length of the file as that contains comments, and instructions can be 2 or 4 bytes. To be clear I'm not advocating for an additional choice of routine based on size/performance.

Presumably the denormal issue you found was unique to the existing Arm assembly implementation and not the general C implementation (otherwise the tests would fail on non v6-m platforms)?

status |= test__addsf3(0xbbebe66d, 0x3b267c1f, 0xbb98a85e);
status |= test__addsf3(0x01f5b166, 0x81339a37, 0x019be44a);

#if __thumb__ && !__thumb2__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a situation where the generic C implemention differs in behaviour from the assembler version. I expect your intention is to replace all the Arm implementations so they will all have consistent behaviours across architecture.

Being paranoid, if someone edits the CMake to use the C version (on v6-m) these tests will start to fail. The comment does make that clear so at least they will know why.

Do you know if there's a way to get CMake to define a symbol when the superseded version is chosen. That could be tested for here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it could – these test sources are compiled via lit, not via cmake-generated compile commands, so cmake add_definitions or similar wouldn't affect them.

It would probably be easier to set a lit feature name, such as you can query in REQUIRES: lines. Then I could move the NaN-policy-specific tests out into separate files conditioned on librt_has_addsf3_arm_nan or some such.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion for the tests to be split out. It would be nice to have if it is simple, but it may not be worth a lot of additional complexity.

pop {r4, r5, r6, r7, pc}


PUSH {r4,r5,r6,lr}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance for being annoying. All the existing Arm assembly code in the directory uses lower case for instructions and directives. Possibly to distinguish it from capitalised Macros.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally care that much myself about whether the assembly is captialized or not. This is more of an observation that it could be an existing undocumented convention. Maybe another reviewer can confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using lower case for instructions and directives for consistency with other assembly files in compiler-rt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

//
//===----------------------------------------------------------------------===//

unsigned __fnan2(unsigned a, unsigned b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being paranoid again, this helper function isn't part of any runtime ABI so we could risk clashing with an arbitrary helper in another system library.

Running nm on compiler-rt I can see a prefix of __compilerrt_ being used in some cases such as __compilerrt_abort_impl. Could it be worth a similar prefix here. Possibly even __compilerrt_arm_fnan2?

unsigned __fnan2(unsigned a, unsigned b) {
unsigned aadj = (a << 1) + 0x00800000;
unsigned badj = (b << 1) + 0x00800000;
if (aadj > 0xff800000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth accommodating some of the additional comments in https://github.com/ARM-software/optimized-routines/blob/master/fp/common/dnan2.c to make it easier to recognize the bit-patterns?

@statham-arm
Copy link
Collaborator Author

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance.

You're right, the code size is bigger in this implementation. The new addsf3.S assembles to 648 bytes of code, and (at -Os) another 68 bytes for the helper fnan2.c. The old version was 312 bytes for addsf3.S and 22 bytes for the subsf3.c wrapper.

Presumably the denormal issue you found was unique to the existing Arm assembly implementation and not the general C implementation (otherwise the tests would fail on non v6-m platforms)?

Yes, the C version in lib/builtins/addsf3.c (well, really lib/builtins/fp_add_impl.inc) passes the new test in full. (Or rather, correctly skips all the NaN test cases and passes the rest.)

@smithp35
Copy link
Collaborator

Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance.

You're right, the code size is bigger in this implementation. The new addsf3.S assembles to 648 bytes of code, and (at -Os) another 68 bytes for the helper fnan2.c. The old version was 312 bytes for addsf3.S and 22 bytes for the subsf3.c wrapper.

Looking at the size of the v6-m libgcc implementation, which has a symbol size of 0x41c bytes (1052) then while this implementation is larger than previous compiler-rt it is not as large as the most commonly used open-source implementation.

Personally I'm comfortable that the size isn't prohibitively large.

statham-arm added a commit to statham-arm/llvm-project that referenced this pull request Oct 2, 2025
This was Petr Hosek's comment on llvm#154093, but if we're doing that, we
should do it consistently.
Now we only try to test the difficult parts if we're using the new
implementations, and otherwise, fall back to treating all NaNs as good
enough when a NaN result is expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants