Skip to content

Conversation

@nakagawa-fj
Copy link
Contributor

Closes #5347
This PR improves the multi-thread performance of GEMM on A64FX by setting DIVIDE_RATE to 1.
The thread control in GEMM currently uses default value of DIVIDE_RATE=2, which always splits N dimension of matrix into two parts for computation. However, this splitting occurs even when N is small (e.g., N=2), leading to a decrease in computational efficiency.
For GEMM on A64FX, I tried DIVIDE_RATE=1 and confirmed performance improvements as shown in the graphs below.
While improvements were expected for narrow matrices with small N dimensions, performance gains were also observed for square matrices.

gemm_divide_rate_1
gemm_divide_rate_2
gemm_divide_rate_3

@martin-frbg martin-frbg added this to the 0.3.31 milestone Jul 1, 2025
@martin-frbg
Copy link
Collaborator

That's a very insteresting result - somewhat counterintuitive, but I guess if it helps to make optimum use of the vector length... Certainly a reminder that some core design decisions for OpenBLAS (GotoBLAS really) were made on and for the cpu architectures of some twenty years ago.
I notice that DIVIDE_RATE is also used in the multithreaded SYRK code (and GETRF/POTRF, though they may be speed-limited by other constraints). And I guess it will be interesting to see if a non-default value would also be beneficial for other (SVE/SME) arm64 or RISCV-vector platforms

@martin-frbg martin-frbg merged commit a06bcf8 into OpenMathLib:develop Jul 1, 2025
86 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inefficiency of thread control with DIVIDE_RATE in GEMM

2 participants