-
Couldn't load subscription status.
- Fork 15k
Open
Description
In function in Coremark:
#define bit_extract(x,from,to) (((x)>>(from)) & (~(0xffffffff << (to))))
void matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B) {
ee_u32 i,j,k;
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
C[i*N+j]=0;
for(k=0;k<N;k++)
{
MATRES tmp=(MATRES)A[i*N+k] * (MATRES)B[k*N+j];
C[i*N+j]+=bit_extract(tmp,2,4)*bit_extract(tmp,5,7);
}
}
}
}
LLVM is currently generating worse code than GCC with -O3 -march=rv32imbc -mabi=ilp32.
The inner loop for clang is:
.LBB0_4:
lhu t6, 0(t5)
lhu s0, 0(a4)
addi a5, a5, -1
add a4, a4, t3
mul s0, s0, t6
slli t6, s0, 26
slli s0, s0, 20
srli t6, t6, 28
srli s0, s0, 25
mul s0, t6, s0
add t4, t4, s0
addi t5, t5, 2
bnez a5, .LBB0_4
while gcc generates one less instruction:
.L4:
lh a4,0(a6)
lh a5,0(a2)
addi a2,a2,2
sh1add a6,a0,a6
mul a5,a5,a4
srai a4,a5,2
srai a5,a5,5
andi a4,a4,15
andi a5,a5,127
mul a5,a4,a5
add a7,a7,a5
bne t1,a2,.L4
I could not really pinpoint what exactly goes different, but I believe that gcc uses the memory address itself to terminate the loop while clang maintains a separate counter it decrements (a5).
dtcxzyw