[RISC-V] Possible performance improvement in matrix multiply function in Coremark

In function in Coremark:

```
#define bit_extract(x,from,to) (((x)>>(from)) & (~(0xffffffff << (to))))

void matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B) {
	ee_u32 i,j,k;
	for (i=0; i<N; i++) {
		for (j=0; j<N; j++) {
			C[i*N+j]=0;
			for(k=0;k<N;k++)
			{
				MATRES tmp=(MATRES)A[i*N+k] * (MATRES)B[k*N+j];
				C[i*N+j]+=bit_extract(tmp,2,4)*bit_extract(tmp,5,7);
			}
		}
	}
}
```

LLVM is currently generating worse code than GCC with -O3 -march=rv32imbc -mabi=ilp32. 
The inner loop for clang is:

```
.LBB0_4:
        lhu     t6, 0(t5)
        lhu     s0, 0(a4)
        addi    a5, a5, -1
        add     a4, a4, t3
        mul     s0, s0, t6
        slli    t6, s0, 26
        slli    s0, s0, 20
        srli    t6, t6, 28
        srli    s0, s0, 25
        mul     s0, t6, s0
        add     t4, t4, s0
        addi    t5, t5, 2
        bnez    a5, .LBB0_4
```

while gcc generates one less instruction:

```
.L4:
        lh      a4,0(a6)
        lh      a5,0(a2)
        addi    a2,a2,2
        sh1add  a6,a0,a6
        mul     a5,a5,a4
        srai    a4,a5,2
        srai    a5,a5,5
        andi    a4,a4,15
        andi    a5,a5,127
        mul     a5,a4,a5
        add     a7,a7,a5
        bne     t1,a2,.L4
```

I could not really pinpoint what exactly goes different, but I believe that gcc uses the memory address itself to terminate the loop while clang maintains a separate counter it decrements (a5).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RISC-V] Possible performance improvement in matrix multiply function in Coremark #163757

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RISC-V] Possible performance improvement in matrix multiply function in Coremark #163757

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions