Skip to content

Conversation

@nakagawa-fj
Copy link
Contributor

This pull request provides a performance improvement for Neoverse V1, addressing Issue #5347.
It differs from the fix in pull request #5353 for A64FX, focusing on matrix size N=2.
While this change primarily enhances performance for N=2, there's potential for further gains up to N=6 on certain architectures. To support this, a new macro, GEMM_DIVIDE_LIMIT, has been introduced to manage the DIVIDE_RATE threshold.
This modification has shown performance improvements for GEMM operations on AWS Graviton3E (Neoverse V1) when N=2, as illustrated in the graph below.

pullReq250729_1 pullReq250729_2

@martin-frbg martin-frbg added this to the 0.3.31 milestone Jul 30, 2025
@martin-frbg martin-frbg merged commit d23680b into OpenMathLib:develop Jul 30, 2025
77 of 88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants