You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reworked sgemm_kleidi memory allocations to reuse memory buffers (#26166)
### **Key changes**
This PR makes changes to KleidiAI integration within the existing
sgemm_kleidiai.cpp implementation.
It was noted that during internal testing that memory allocation
overhead due to repeated allocations of vectors was having a negative
impact on performance figures.
The changes introduce thread local buffers for reusing memory during
inference.
Android platforms are particularly sensitive to this, we have observed
inference times being significantly impacted due to memory allocation
overheads
### Example performance
All runs were captured using onnxruntime_perf_test
e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1000
**Android Platform**
<img width="996" height="286" alt="image"
src="https://github.com/user-attachments/assets/252165af-c864-4b24-b1f2-c28ada208b06"
/>
In addition to this on M4 we have also observed slight improvements on
models, however its the gain is not as significant as the allocation
overhead is lower in terms of total time on that platform
**Mac Mini M4**
<img width="741" height="153" alt="image"
src="https://github.com/user-attachments/assets/93e6c545-96fd-4bfc-b90f-3a845a1551bc"
/>
**Onnxruntime Mlas Benchmark**
Mlas Benchmark was executed on a Mac Mini M4 with SME2 instructions
Tested code with and without changes in pr and observed the following
results (subset shown) comparison generated using compare.py located in
google benchmark repo tools
`./onnxruntime_mlas_benchmark --benchmark_filter="SGEMM/NORMAL*"
--benchmark_repetitions=100`
```
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------
SGEMM/NORMAL_NoTrans/M:63/N:63/K:63/real_time -0.1897 -0.1897 3270 2650 3270 2650
SGEMM/NORMAL_NoTrans/M:255/N:63/K:63/real_time -0.1468 -0.1469 8383 7152 8382 7151
SGEMM/NORMAL_NoTrans/M:1023/N:63/K:63/real_time -0.1506 -0.1506 19072 16200 19072 16200
SGEMM/NORMAL_NoTrans/M:63/N:255/K:63/real_time -0.1957 -0.1957 7742 6227 7742 6227
SGEMM/NORMAL_NoTrans/M:255/N:255/K:63/real_time -0.1032 -0.1032 14323 12845 14322 12845
SGEMM/NORMAL_TransB/M:63/N:63/K:63/real_time -0.2221 -0.2221 3356 2611 3356 2610
SGEMM/NORMAL_TransB/M:255/N:63/K:63/real_time -0.0439 -0.0438 8602 8224 8601 8224
SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206
SGEMM/NORMAL_TransB/M:63/N:255/K:63/real_time -0.2000 -0.1999 8046 6437 8046 6437
SGEMM/NORMAL_TransB/M:255/N:255/K:63/real_time -0.0979 -0.0979 14131 12747 14130 12747
SGEMM/NORMAL_TransB/M:1023/N:255/K:63/real_time -0.2836 -0.2836 62540 44802 62540 44802
SGEMM/NORMAL_TransB/M:63/N:1023/K:63/real_time -0.2183 -0.2183 15342 11993 15342
```
Some small regressions have been seen but are difficult to explain,
suspected machine variance during run could account for things like
```
SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206
```
For example, as part of testing these results sgemm_kleidi.cpp was
instrumented (after the previous benchmark results) with timer code, in
MlasGemmBatch, MlasGemmPackB, and MlasGemmPackBSize.
Which produced the following, indicating that the code performs better
in this case on average than baseline which is currently in main
```
Head of main
Function Count Avg (ns) Avg (pretty)
----------------------------------------------------------
MlasGemmBatch 42664 19601.015 19.601 us
MlasGemmPackB 42664 373.943 373.943 ns
MlasGemmPackBSize 42664 17.179 17.179 ns
TLB changes
Function Count Avg (ns) Avg (pretty)
----------------------------------------------------------
MlasGemmBatch 55492 16985.256 16.985 us
MlasGemmPackB 55492 344.800 344.800 ns
MlasGemmPackBSize 55492 16.788 16.788 ns
```
---------
Signed-off-by: Jonathan Clohessy <[email protected]>
0 commit comments