Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

nakagawa-fj · 2025-06-30T12:49:59Z

Closes #5347
This PR improves the multi-thread performance of GEMM on A64FX by setting DIVIDE_RATE to 1.
The thread control in GEMM currently uses default value of DIVIDE_RATE=2, which always splits N dimension of matrix into two parts for computation. However, this splitting occurs even when N is small (e.g., N=2), leading to a decrease in computational efficiency.
For GEMM on A64FX, I tried DIVIDE_RATE=1 and confirmed performance improvements as shown in the graphs below.
While improvements were expected for narrow matrices with small N dimensions, performance gains were also observed for square matrices.

A64FX.

martin-frbg · 2025-07-01T11:38:55Z

That's a very insteresting result - somewhat counterintuitive, but I guess if it helps to make optimum use of the vector length... Certainly a reminder that some core design decisions for OpenBLAS (GotoBLAS really) were made on and for the cpu architectures of some twenty years ago.
I notice that DIVIDE_RATE is also used in the multithreaded SYRK code (and GETRF/POTRF, though they may be speed-limited by other constraints). And I guess it will be interesting to see if a non-default value would also be beneficial for other (SVE/SME) arm64 or RISCV-vector platforms

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for

5253c8f

A64FX.

martin-frbg added this to the 0.3.31 milestone Jul 1, 2025

martin-frbg merged commit a06bcf8 into OpenMathLib:develop Jul 1, 2025
86 of 87 checks passed

nakagawa-fj mentioned this pull request Jul 29, 2025

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

Uh oh!

nakagawa-fj commented Jun 30, 2025

Uh oh!

martin-frbg commented Jul 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

Uh oh!

Conversation

nakagawa-fj commented Jun 30, 2025

Uh oh!

martin-frbg commented Jul 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants