-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
The following code illustrates the problem (godbolt):
void add1(int *x, int *end) {
for (; x != end; ++x) {
++*x;
}
}
void add1_i(int *x, size_t n) {
for (size_t i = 0; i != n; ++i) {
++x[i];
}
}
void add1_n(int *x, size_t n) {
add1(x, x + n);
}
void add1_2(int *x, int *end) {
size_t n = end - x;
for (size_t i = 0; i != n; ++i) {
++x[i];
}
}
void add1_3(int *x, int *end) {
add1_i(x, end - x);
}
void add1_s(std::span<int> r) {
for (int &x : r) {
++x;
}
}add1 and add1_i generate optimal code with a single induction variable and bne.
add1(int*, int*):
beq a0, a1, .LBB0_2
.LBB0_1:
lw a2, 0(a0)
addi a2, a2, 1
sw a2, 0(a0)
addi a0, a0, 4
bne a0, a1, .LBB0_1
.LBB0_2:
ret
add1_i(int*, unsigned long):
beqz a1, .LBB1_3
slli a1, a1, 2
add a1, a1, a0
.LBB1_2:
lw a2, 0(a0)
addi a2, a2, 1
sw a2, 0(a0)
addi a0, a0, 4
bne a0, a1, .LBB1_2
.LBB1_3:
ret
I would expect add1_n to be identical to add1_i, but instead this is generated:
add1_n(int*, unsigned long):
beqz a1, .LBB2_3
slli a1, a1, 2
.LBB2_2:
lw a2, 0(a0)
addi a1, a1, -4
addi a2, a2, 1
sw a2, 0(a0)
addi a0, a0, 4
bnez a1, .LBB2_2
.LBB2_3:
ret
One extra add per iteration and a totally useless left shift (useless because a1 is only used as a loop counter).
add1_2 is similar to add1_n with a subtract (and useless right shift) to get the length, but should be the same as add1.
add1_3 is almost correct. It has the correct loop body, but adds and subtracts a0 which should have been removed:
add1_3(int*, int*):
beq a1, a0, .LBB4_3
sub a1, a1, a0
srai a1, a1, 2
slli a1, a1, 2
add a1, a1, a0
.LBB4_2:
lw a2, 0(a0)
addi a2, a2, 1
sw a2, 0(a0)
addi a0, a0, 4
bne a0, a1, .LBB4_2
.LBB4_3:
ret
add1_s results in the same as add1_n, even though the code does almost exactly what the generated add1_i does.
The redundant add and subtract seems to only happen for RISC-V.
The variation in the loop body also happens on aarch64 (when I disable vectorization), but I don't know if there is a performance difference.
On aarch64 there are also the same useless shifts, where the loop counter is not used for indexing.