-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Open
Description
See: https://godbolt.org/z/Wh3sW57qa
There are two problems:
- Do we estimate the cost of gather/scatter too low? Because AArch64 won't vectorize this loop and the RISC-V GCC does the same.
<source>:35:6: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]
35 | while (p <= lastone) {
| ^
<source>:35:6: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-missed=loop-vectorize]- Can we combine the loads before LoopVectorizer? As you can see, if we disable LV, then SLP will kick in and generate a better code seemingly.
.LBB0_5:
vsetvli zero, a4, e32, m8, ta, ma
vlse32.v v8, (a2), a5
vmv.s.x v16, a1
vredsum.vs v8, v8, v16
addi a2, a2, 512
vmv.x.s a1, v8
bgeu a3, a2, .LBB0_5
bnez a0, .LBB0_4
mv a0, a1
tail use_int