Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

nakagawa-fj · 2025-07-29T10:09:15Z

This pull request provides a performance improvement for Neoverse V1, addressing Issue #5347.
It differs from the fix in pull request #5353 for A64FX, focusing on matrix size N=2.
While this change primarily enhances performance for N=2, there's potential for further gains up to N=6 on certain architectures. To support this, a new macro, GEMM_DIVIDE_LIMIT, has been introduced to manage the DIVIDE_RATE threshold.
This modification has shown performance improvements for GEMM operations on AWS Graviton3E (Neoverse V1) when N=2, as illustrated in the graph below.

Multi-thread GEMM Performance Improvement on NeoverseV1 (DIVIDE_RATE=1)

7e29f11

martin-frbg added this to the 0.3.31 milestone Jul 30, 2025

martin-frbg merged commit d23680b into OpenMathLib:develop Jul 30, 2025
77 of 88 checks passed

martin-frbg mentioned this pull request Aug 3, 2025

test_extensions/test_sgemmt.c fails with SME on Apple M4 #5414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Uh oh!

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Uh oh!

Conversation

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants