-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Open
Description
It looks like we are about 100% behind for the following function (where N=10000) on Neoverse V2.
Compilation options: -O3 -mcpu=neoverse-v2
#include <math.h>
void f(int n, double *arr, double m) {
for (int i = 0; i < n; i++) {
arr[i] = sqrt(arr[i] * m);
}
}
godbolt: https://godbolt.org/z/57Yqj15KP
I tried to analyze the root cause and found out that the fcmp instruction after fsqrt takes a lot of time. The fcmp checks if the result of fsqrt is QNaN or not, then jumps to the library function call branch if necessary. This problem happens even if the all the element in arr is positive, so we don't jump to branch the library function call. Avoiding this check by adding options like -fno-honor-nan resolved the performance gap between gcc and clang. I think we should insert a comparison instruction before the fsqrt instruction like gcc does.