-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[compiler-rt][ARM] Optimized f32 add/subtract for Armv6-M. #154093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository. The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization: New code: 73, 63, 53, 81, 81 Old code: 83, 92, 88, 153, 168 This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly _smaller_ number, instead of the correct 0x00fffffe. The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an `#ifdef` that should match when this implementation is selected. The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its `crt_supersedes` property in cmake.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
The cmake changes LGTM, but I'll leave the core of the change for the others (I'm out of my depth). |
smithp35
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A disclaimer that I was involved in the internal review for Arm Optimized Routines, and on principle I'm in favour of this change. I've concentrated on the parts unique to compiler-rt.
Do you happen to have a figure for code-size? One possible objection is someone preferring a smallest possible implementation for M0 at the expense of performance. Obviously can't just compare the length of the file as that contains comments, and instructions can be 2 or 4 bytes. To be clear I'm not advocating for an additional choice of routine based on size/performance.
Presumably the denormal issue you found was unique to the existing Arm assembly implementation and not the general C implementation (otherwise the tests would fail on non v6-m platforms)?
| status |= test__addsf3(0xbbebe66d, 0x3b267c1f, 0xbb98a85e); | ||
| status |= test__addsf3(0x01f5b166, 0x81339a37, 0x019be44a); | ||
|
|
||
| #if __thumb__ && !__thumb2__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got a situation where the generic C implemention differs in behaviour from the assembler version. I expect your intention is to replace all the Arm implementations so they will all have consistent behaviours across architecture.
Being paranoid, if someone edits the CMake to use the C version (on v6-m) these tests will start to fail. The comment does make that clear so at least they will know why.
Do you know if there's a way to get CMake to define a symbol when the superseded version is chosen. That could be tested for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it could – these test sources are compiled via lit, not via cmake-generated compile commands, so cmake add_definitions or similar wouldn't affect them.
It would probably be easier to set a lit feature name, such as you can query in REQUIRES: lines. Then I could move the NaN-policy-specific tests out into separate files conditioned on librt_has_addsf3_arm_nan or some such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion for the tests to be split out. It would be nice to have if it is simple, but it may not be worth a lot of additional complexity.
| pop {r4, r5, r6, r7, pc} | ||
|
|
||
|
|
||
| PUSH {r4,r5,r6,lr} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies in advance for being annoying. All the existing Arm assembly code in the directory uses lower case for instructions and directives. Possibly to distinguish it from capitalised Macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't personally care that much myself about whether the assembly is captialized or not. This is more of an observation that it could be an existing undocumented convention. Maybe another reviewer can confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer using lower case for instructions and directives for consistency with other assembly files in compiler-rt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
compiler-rt/lib/builtins/arm/fnan2.c
Outdated
| // | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| unsigned __fnan2(unsigned a, unsigned b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being paranoid again, this helper function isn't part of any runtime ABI so we could risk clashing with an arbitrary helper in another system library.
Running nm on compiler-rt I can see a prefix of __compilerrt_ being used in some cases such as __compilerrt_abort_impl. Could it be worth a similar prefix here. Possibly even __compilerrt_arm_fnan2?
compiler-rt/lib/builtins/arm/fnan2.c
Outdated
| unsigned __fnan2(unsigned a, unsigned b) { | ||
| unsigned aadj = (a << 1) + 0x00800000; | ||
| unsigned badj = (b << 1) + 0x00800000; | ||
| if (aadj > 0xff800000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth accommodating some of the additional comments in https://github.com/ARM-software/optimized-routines/blob/master/fp/common/dnan2.c to make it easier to recognize the bit-patterns?
You're right, the code size is bigger in this implementation. The new
Yes, the C version in |
Looking at the size of the v6-m libgcc implementation, which has a symbol size of 0x41c bytes (1052) then while this implementation is larger than previous compiler-rt it is not as large as the most commonly used open-source implementation. Personally I'm comfortable that the size isn't prohibitively large. |
This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.
Now we only try to test the difficult parts if we're using the new implementations, and otherwise, fall back to treating all NaNs as good enough when a NaN result is expected.
This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository.
The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization:
New code: 73, 63, 53, 81, 81
Old code: 83, 92, 88, 153, 168
This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly smaller number, instead of the correct 0x00fffffe.
The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an
#ifdefthat should match when this implementation is selected.The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its
crt_supersedesproperty in cmake.