It looks like that our vectorisation strategy is to have some in-loop reduction/dependencies for a simple reduction like this:
for (int i = 0; i < N; i++) {
sum += a[i];
Because we generate something like this:
vector.body:
vecsum1 += a[..]
vecsum2 = a[..] + a[..]
vecsum1 += vecsum2
vecsum2 = a[..] + a[..]
vecsum1 += vecsum2
end
// adding partial sums
But GCC is generating something more like this:
vector.body:
vecsum1 += a[i:i+4]
vecsum2 += a[i+4:i+8]
vecsum3 += a[i+8:i+12]
vecsum4 += a[i+12:i+16]
end
// adding partial sums
We have more dependency chains in the loop body, which can slow us down.
Here's an AArch64 code example on compiler explorer: https://godbolt.org/z/v1c6hxfGc
I have disabled the interleaver to have a more concise example, but with interleaving things are very similar.