Skip to content

[MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path#26773

Open
patryk-kaiser-ARM wants to merge 3 commits intomicrosoft:mainfrom
patryk-kaiser-ARM:kai_bf16_kernel_integration
Open

[MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path#26773
patryk-kaiser-ARM wants to merge 3 commits intomicrosoft:mainfrom
patryk-kaiser-ARM:kai_bf16_kernel_integration

Conversation

@patryk-kaiser-ARM
Copy link
Contributor

@patryk-kaiser-ARM patryk-kaiser-ARM commented Dec 11, 2025

Description
This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM path.

Rework of #24346

Motivation and Context
This kernel provides performance improvements on SME-enabled devices.

@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as draft December 11, 2025 11:48
@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as ready for review January 6, 2026 13:59
@patryk-kaiser-ARM
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the Arm® KleidiAI™ SME2 BF16 kernel into the MLAS SBGEMM (single-precision to bfloat16 GEMM) path. The integration provides performance improvements for bfloat16 matrix multiplication operations on ARM devices with SME2 support.

Changes:

  • Added new sbgemm_kleidiai.cpp implementation with KleidiAI BF16 SME2 kernel
  • Introduced BIsPacked flag to MLAS_SBGEMM_DATA_PARAMS to track pre-packed matrix B state
  • Added override mechanism in SBGEMM path for KleidiAI kernels on SME2-enabled platforms

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/kleidiai/sbgemm_kleidiai.cpp New implementation of SBGEMM using KleidiAI BF16 SME2 kernel
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Added function declarations for SBGEMM KleidiAI overrides
onnxruntime/core/mlas/lib/kai_ukernel_interface.h Added SBGEMM ukernel interface declaration
onnxruntime/core/mlas/lib/kai_ukernel_interface.cpp Added SBGEMM ukernel instantiation for SME2
onnxruntime/core/mlas/lib/mlasi.h Added typedef declarations for SBGEMM override functions
onnxruntime/core/mlas/lib/sbgemm.h Added override mechanism to call KleidiAI SBGEMM functions
onnxruntime/core/mlas/lib/platform.cpp Registered KleidiAI SBGEMM overrides for SME2-enabled platforms
onnxruntime/core/mlas/inc/mlas.h Added BIsPacked field to MLAS_SBGEMM_DATA_PARAMS struct
onnxruntime/core/providers/cpu/math/matmul.cc Set BIsPacked flag when using pre-packed matrix B
onnxruntime/test/mlas/unittest/test_sbgemm.h Updated tests to initialize and set BIsPacked flag
cmake/onnxruntime_mlas.cmake Added sbgemm_kleidiai.cpp to build system

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 changed the title Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path [MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path Jan 21, 2026
@hariharans29
Copy link
Member

Hi @patryk-kaiser-ARM / @damdoo01-arm - Can you please resolve conflicts for this PR if it is still on the agenda ? We can target merging this PR next. Thanks.

@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the kai_bf16_kernel_integration branch from 51617ca to 509c420 Compare February 4, 2026 15:15
@patryk-kaiser-ARM
Copy link
Contributor Author

Hi @hariharans29 I resolved conflicts. This one is still on the agenda - I am currently investigating adding support for fastmath to more operators so that this change can have a larger impact, however it would be a good idea to get this one in first and then open up consequent PRs to bring more ops down this path for fastmath.

@patryk-kaiser-ARM
Copy link
Contributor Author

Can workflows be approved please

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// SPDX-License-Identifier: MIT
//

#if defined(__aarch64__) && defined(__linux__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature Linux only ?

Copy link
Contributor Author

@patryk-kaiser-ARM patryk-kaiser-ARM Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #if defined(__aarch64__) && defined(__linux__) guard was added for consistency with the existing sbgemm.h and related BF16/SBGEMM paths in the codebase. The __linux__ define also covers Android.

@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the kai_bf16_kernel_integration branch from 509c420 to 6a201f1 Compare February 27, 2026 10:22
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@hariharans29 hariharans29 requested a review from Copilot February 27, 2026 18:32
@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the kai_bf16_kernel_integration branch from 6a201f1 to bb7d2cd Compare March 2, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants