Skip to content

[RISC-V] Possible performance improvement in matrix multiply function in CoremarkΒ #163757

@christian-herber-nxp

Description

@christian-herber-nxp

In function in Coremark:

#define bit_extract(x,from,to) (((x)>>(from)) & (~(0xffffffff << (to))))

void matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B) {
	ee_u32 i,j,k;
	for (i=0; i<N; i++) {
		for (j=0; j<N; j++) {
			C[i*N+j]=0;
			for(k=0;k<N;k++)
			{
				MATRES tmp=(MATRES)A[i*N+k] * (MATRES)B[k*N+j];
				C[i*N+j]+=bit_extract(tmp,2,4)*bit_extract(tmp,5,7);
			}
		}
	}
}

LLVM is currently generating worse code than GCC with -O3 -march=rv32imbc -mabi=ilp32.
The inner loop for clang is:

.LBB0_4:
        lhu     t6, 0(t5)
        lhu     s0, 0(a4)
        addi    a5, a5, -1
        add     a4, a4, t3
        mul     s0, s0, t6
        slli    t6, s0, 26
        slli    s0, s0, 20
        srli    t6, t6, 28
        srli    s0, s0, 25
        mul     s0, t6, s0
        add     t4, t4, s0
        addi    t5, t5, 2
        bnez    a5, .LBB0_4

while gcc generates one less instruction:

.L4:
        lh      a4,0(a6)
        lh      a5,0(a2)
        addi    a2,a2,2
        sh1add  a6,a0,a6
        mul     a5,a5,a4
        srai    a4,a5,2
        srai    a5,a5,5
        andi    a4,a4,15
        andi    a5,a5,127
        mul     a5,a4,a5
        add     a7,a7,a5
        bne     t1,a2,.L4

I could not really pinpoint what exactly goes different, but I believe that gcc uses the memory address itself to terminate the loop while clang maintains a separate counter it decrements (a5).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions