Commit 94470f7
Improve kleidi-ai matmul register usage (#5165)
Summary:
X-link: facebookresearch/FBGEMM#2164
Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Cache miss is likely to happen for these loads, as the matrix processing should fill the cache.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.
Benchmarks seem to show a small improvement. Now, some nice runs show almost the same throughput as BGM: P2050470491, P2050484212
Reviewed By: mcfi
Differential Revision: D876564681 parent eb1ae89 commit 94470f7
1 file changed
+965
-965
lines changed
0 commit comments