Implement FP32 kleidiai Gemv #26302

JonathanC-ARM · 2025-10-14T15:45:44Z

Description

Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1

Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels

Indicative Performance

In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model

More Benchmarks to come shortly

hariharans29 · 2025-10-14T19:50:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-14T19:50:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-16T17:10:08Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T17:10:27Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-22T17:19:31Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-22T17:19:52Z

Azure Pipelines successfully started running 4 pipeline(s).

edgchen1 · 2025-10-23T15:31:45Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
+kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();


GetKleidiAIXUKernel() returns const&. do we need to make a copy here?

Suggested change

kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();

kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();

const kai_matmul_clamp_f32_f32p_f32p_ukernel& sgemm_gemm = GetKleidiAISGemmUKernel();

const kai_matmul_clamp_f32_f32_f32p_ukernel& sgemm_gemv = GetKleidiAISGemvUKernel();

updated to const in the latest push

edgchen1 · 2025-10-23T15:38:10Z

onnxruntime/core/mlas/lib/qgemm.cpp

    //No fallback and putting in guards
-    if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
-    ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
+    if(ArmKleidiAI::SMEInfo::CanUseSME2){


there are other places that need to be updated, like:

onnxruntime/onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

Line 218 in b3ba580

if (!CPUIDInfo::GetCPUIDInfo().HasArm_SME()) {

onnxruntime/onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

Line 24 in b3ba580

if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()) {

I might be missing some.

I think it would be worth making a helper function like MlasIsDynamicQGemmAvailable that has the appropriate checks and using that instead.

Added in the updated checks in various places like these in the latest push

I think it would be worth making a helper function like MlasIsDynamicQGemmAvailable that has the appropriate checks and using that instead.

to clarify, this was the main suggestion.

hariharans29 · 2025-10-28T20:24:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-28T20:25:16Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-30T21:14:40Z

onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp

  void Test(size_t M, size_t N, size_t K, size_t BatchSize) {
    // Currently, MlasDynamicQGemmBatch() and associated functions require SME or else they are no-ops.
-    if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()) {
+    if (!ArmKleidiAI::SMEInfo::CanUseSME2) {


Nit: I guess the Gtest skip comment needs corresponding update too.

hariharans29 · 2025-10-30T21:22:30Z

onnxruntime/core/mlas/lib/qgemm.cpp

    //No fallback and putting in guards
-    if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
-    ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
+    if(ArmKleidiAI::SMEInfo::CanUseSME2){


I guess after merging #26301, the checks looking for SME2 will go away (i.e.) it can be run on both SME1 and SME2 then ?

Yes thats correct

So one change I've made in the latest push is to remove this structure from our kleidi code specifically and put it into mlasi.h removing the armkleidiai namespacing around it, seemed like a sensible place to put it given that other similar code exists in terms of cpu features

hariharans29 · 2025-11-11T17:33:14Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-11T17:33:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-11-14T04:33:17Z

Could you please rebase this @JonathanC-ARM ?

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-11-14T12:14:14Z

Hi @hariharans29 I've updated the branch now, thanks!

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2025-11-14T18:12:20Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-14T18:12:41Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull Request Overview

This PR implements FP32 GEMV (matrix-vector) optimizations for KleidiAI, addressing degenerate matrix multiplication cases where M=1 or N=1. The implementation introduces a microkernel interface abstraction to simplify code and remove conditional logic for SME/SME2 kernel selection.

Key changes:

Added specialized GEMV path for M=1 and N=1 cases in FP32 SGEMM operations
Introduced SMEInfo struct with static boolean constants to replace scattered SME capability checks
Refactored microkernel selection using typedef interfaces instead of ternary operations

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test_fgemm_fixture.h	Added test cases for GEMV scenarios (M=1 and N=1)
test_dynamic_qgemm.cpp	Updated to use SMEInfo for SME2 capability check
qgemm.cpp	Replaced direct CPU feature checks with SMEInfo struct
platform.cpp	Updated SME availability check to use SMEInfo
mlasi.h	Added SMEInfo struct declaration and inline definitions
sgemm_kleidiai.cpp	Implemented GEMV functions with helper utilities and integrated with GEMM batch
mlasi_kleidiai.h	Removed local UseSME2 variable, added MlasFp32Gemv declaration
convolve_kleidiai.cpp	Updated to use SMEInfo for SME capability checks
kai_ukernel_interface.h	Added SGEMM/SGEMV ukernel interface declarations
kai_ukernel_interface.cpp	Implemented ukernel selection functions for SGEMM/SGEMV
convolve.cpp	Added SME availability guard and formatting fixes
dynamic_quantize_matmul.cc	Updated to use SMEInfo for SME2 capability check
deps.txt	Updated pytorch_cpuinfo dependency version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

Co-authored-by: Copilot <[email protected]>

hariharans29 · 2025-11-14T21:04:24Z

Please rebase to include this : #26559

JonathanC-ARM · 2025-11-15T00:14:41Z

Hi @hariharans29 I've gone ahead and synced my fork now so this branch should include that change now , thanks for letting me know!

hariharans29 · 2025-11-15T00:20:09Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-15T00:20:28Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-20T22:58:50Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+            if (N == 1 && TransB == CblasNoTrans)
+            {
+                g_kai_tls.gemv_lhs_row_tmp.resize(K);
+
+                for (size_t k = 0; k < K; ++k) {
+                    g_kai_tls.gemv_lhs_row_tmp[k] = lhs_base[k * Data[b].ldb];
+                }
+                lhs_base = g_kai_tls.gemv_lhs_row_tmp.data();
+            }


The gather logic has an issue when M == 1 && N == 1. In this case:

Line 158 sets lhs_base = Data[b].A (taking the M == 1 path)

But line 184 checks TransB and line 189 uses Data[b].ldb to stride through the data

This is inconsistent: when M == 1, we're using A as the LHS vector, so we should check TransA and use Data[b].lda for striding

When M == 1 && N == 1, if we take the M == 1 path (which the code does), the gather should check TransA and use lda since the LHS is now A, not B.

hariharans29 · 2025-11-20T23:02:10Z

onnxruntime/core/mlas/lib/mlasi.h

 // of the ONNX Runtime source tree. OpenMP may or may not be enabled in this
 // configuration.
 //
+struct SMEInfo {


Can we guard this by #if defined(MLAS_TARGET_ARM64) as this is ARM specific?

Sure makes sense, will make the change

hariharans29 · 2025-11-20T23:03:30Z

onnxruntime/core/mlas/lib/mlasi.h


+// Boolean condition to determine if we can use SME2
+// By default we should try for SME2 first before falling back to SME.
+inline const bool SMEInfo::CanUseSME2 = MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME2();


Can we avoid duplicate initializations by just moving the initializations out of the BUILD_MLAS_NO_ONNXRUNTIME guarded sections ?

Not 100% sure but I can check and I'd prefer that to be honest

hariharans29 · 2025-11-20T23:04:17Z

cmake/deps.txt

 pthreadpool;https://github.com/google/pthreadpool/archive/dcc9f28589066af0dbd4555579281230abbf74dd.zip;533a77943203ef15ca608bcd9dbe2c94da7451d2
 pybind11;https://github.com/pybind/pybind11/archive/refs/tags/v2.13.6.zip;f780292da9db273c8ef06ccf5fd4b623624143e9
-pytorch_cpuinfo;https://github.com/pytorch/cpuinfo/archive/877328f188a3c7d1fa855871a278eb48d530c4c0.zip;9152d4bf6b8bde9f19b116de3bd8a745097ed9df
+pytorch_cpuinfo;https://github.com/pytorch/cpuinfo/archive/de0ce7c7251372892e53ce9bc891750d2c9a4fd8.zip;c45b8d3619b9bccbd26dc5f657959aee38b18b7a


Can you please remind me - why do we need this cupinfo dependency update for this PR ?

I don't think I made this change, it might be somehow an older version of the deps file. Will look into it.

hariharans29 · 2025-11-20T23:05:31Z

onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

      // Currently, MlasDynamicQGemmBatch() and associated functions require SME2 or else they are no-ops.
      // We check that here too before attempting to use them.
-      if (!CPUIDInfo::GetCPUIDInfo().HasArm_SME2()) {
+      if (!SMEInfo::CanUseSME2) {


Can we skip the qgemm.cpp / dynamic_quantize_matmul.cc / test_dynamic_qgemm.cpp changes in this PR ? #26598 is taking care of it with a new MLAS API for this.

Sure I can revert this change altogether in favor of #26598

hariharans29 · 2025-11-20T23:09:34Z

onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h

-// By default we should try for SME2 first before falling back to SME.
-inline const bool UseSME2 = MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME2();
-
+//


Stray change ?

Yes that seems to be the case, removed the above code but looks like I left the initial //

hariharans29 · 2025-11-20T23:10:56Z

onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h


+bool
+MLASCALL
+MlasFp32Gemv(


Consider renaming to MlasGemvBatch to be consistent with MlasGemmBatch ? Thoughts on this ?

Seems like a sensible suggestion, doesn't really impact anything other than making things consistent. So I'd be happy to make the change

hariharans29 · 2025-11-20T23:15:09Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+    // Attempt GEMV (M==1 or N==1)
+    if (M == 1 || N == 1)
+    {
+        if (ArmKleidiAI::MlasFp32Gemv(TransA, TransB, M, N, K, Data, BatchSize)) {


Any scope for using multiple threads in the Gemv implementation (I see that the Gemv routing doesn't take in the ThreadPool param) ?

If there are plans to add the multi-threaded implementation in the future for Gemv, can you please add a TODO for that ?

I will likely add a todo it's not something that we tested to be honest, so would need to investigate whether threaded implementation provided any benefit to performance given that the kernels in question can handle the entire matrix in a single execution as is. But maybe there are cases where splitting the workload would lead to be benefits. But without testing I'm just speculating.

hariharans29 · 2025-11-20T23:18:10Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+    if (M == 1 || N == 1)
+    {
+        if (ArmKleidiAI::MlasFp32Gemv(TransA, TransB, M, N, K, Data, BatchSize)) {
+            return true;


Just checking - In case the Gemv execution flow returns false for some reason, I am guessing you want to try KleidiAI's Gemm before falling back to MLAS ? That is how it is right now but I wanted to check if that is the intended flow.

That's the intention yes, attempt op and fallback to mlas if we cannot proceed for some reason.

hariharans29 · 2025-11-20T23:19:24Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-20T23:19:41Z

Azure Pipelines successfully started running 4 pipeline(s).

edgchen1 reviewed Oct 23, 2025

View reviewed changes

patryk-kaiser-ARM mentioned this pull request Oct 24, 2025

Fix: Disable KleidiAI on systems with SME1 but not SME2 #26399

Closed

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 2 times, most recently from a3f4f5b to e8ab1b1 Compare October 24, 2025 15:46

hariharans29 reviewed Oct 30, 2025

View reviewed changes

hariharans29 mentioned this pull request Oct 31, 2025

Implement multithreading in qgemm_kleidi #26301

Open

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch 3 times, most recently from 4afc95c to 1d9b7c8 Compare November 11, 2025 15:53

JonathanC-ARM and others added 3 commits November 14, 2025 10:48

Implement FP32 kleidiai Gemv

82480ad

Signed-off-by: Jonathan Clohessy <[email protected]>

Update const for kernel interface and sme checks

4cf6ccd

Signed-off-by: Jonathan Clohessy <[email protected]>

Modify SME Detection struct location and logic

733cd76

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from 1d9b7c8 to c9a507f Compare November 14, 2025 12:08

Align convolve checks with consolidated smeinfo mechanism

1ead7ca

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from c9a507f to 1ead7ca Compare November 14, 2025 12:19

hariharans29 requested a review from Copilot November 14, 2025 18:14

Copilot started reviewing on behalf of hariharans29 November 14, 2025 18:15 View session

Copilot finished reviewing on behalf of hariharans29 November 14, 2025 18:15

Copilot AI reviewed Nov 14, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

hariharans29 and others added 2 commits November 14, 2025 13:03

Update onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

4067e9b

Co-authored-by: Copilot <[email protected]>

Update onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

d920aa1

Co-authored-by: Copilot <[email protected]>

Merge branch 'microsoft:main' into jclohess_kleidiai_gemv_implementation

5b4ff13

Merge branch 'microsoft:main' into jclohess_kleidiai_gemv_implementation

0b44a68

hariharans29 requested a review from Copilot November 20, 2025 22:55

Copilot started reviewing on behalf of hariharans29 November 20, 2025 22:55 View session

Copilot finished reviewing on behalf of hariharans29 November 20, 2025 22:58

Copilot AI reviewed Nov 20, 2025

View reviewed changes

hariharans29 reviewed Nov 20, 2025

View reviewed changes

		kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
		kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();

Implement FP32 kleidiai Gemv #26302

Are you sure you want to change the base?

Implement FP32 kleidiai Gemv #26302

Uh oh!

Conversation

JonathanC-ARM commented Oct 14, 2025

Description

Indicative Performance

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

azure-pipelines bot commented Oct 14, 2025

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

azure-pipelines bot commented Oct 16, 2025

Uh oh!

hariharans29 commented Oct 22, 2025

Uh oh!

azure-pipelines bot commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 28, 2025

Uh oh!

azure-pipelines bot commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Nov 11, 2025

Uh oh!

azure-pipelines bot commented Nov 11, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

JonathanC-ARM commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

azure-pipelines bot commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

JonathanC-ARM commented Nov 15, 2025

Uh oh!

hariharans29 commented Nov 15, 2025

Uh oh!

azure-pipelines bot commented Nov 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hariharans29 Nov 20, 2025 •

edited

Loading

hariharans29 Nov 20, 2025 •

edited

Loading

hariharans29 Nov 20, 2025 •

edited

Loading