https://godbolt.org/z/fE16nGPxP
#include <math.h>
void foo(float *p)
{
for (int x=0; x<999; ++x) {
p[x] = sin(p[x]);
}
}
generates calls to armpl_vsinq_f64 instead of armpl_vsinq_f32.
If we replace that code with sinf it works fine:
#include <math.h>
void foo(float *p)
{
for (int x=0; x<999; ++x) {
p[x] = sinf(p[x]);
}
}
I'm reasonably sure it's legitimate to implicitly switch to sinf(), and this would lead to double the vector throughput.