Skip to content

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Nov 21, 2025

Summary:
Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.

Differential Revision: D87656468

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 21, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87656468.

@meta-cla meta-cla bot added the cla signed label Nov 21, 2025
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 23, 2025
Summary:
X-link: facebookresearch/FBGEMM#2164


Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Cache miss is likely to happen for these loads, as the matrix processing should fill the cache.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.

Benchmarks seem to show a small improvement. Now, some nice runs show almost the same throughput as BGM: P2050470491, P2050484212

Reviewed By: mcfi

Differential Revision: D87656468
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 24, 2025
Summary:
X-link: facebookresearch/FBGEMM#2164


Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Cache miss is likely to happen for these loads, as the matrix processing should fill the cache.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.

Benchmarks seem to show a small improvement. Now, some nice runs show almost the same throughput as BGM: P2050470491, P2050484212

Reviewed By: mcfi

Differential Revision: D87656468
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 24, 2025
Summary:
X-link: facebookresearch/FBGEMM#2164


Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Cache miss is likely to happen for these loads, as the matrix processing should fill the cache.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.

Benchmarks seem to show a small improvement. Now, some nice runs show almost the same throughput as BGM: P2050470491, P2050484212

Reviewed By: mcfi

Differential Revision: D87656468
Summary:
X-link: facebookresearch/FBGEMM#2164


Refactor kleidi-ai matrix multiplication routines to only rely on temporary registers.
Doing this removes from each call the need to save and restore registers [x19, x30], and also simd registers [d9, d15] for the smaller routines.
Cache miss is likely to happen for these loads, as the matrix processing should fill the cache.
Around 10 memory load/store instructions get removed on each subroutine.
Because larger matrixes get broken down into small pieces, these savings are once per piece.
Reducing code size for these routines also makes them more likely to be in cache when needing to execute them.

Benchmarks seem to show a small improvement. Now, some nice runs show almost the same throughput as BGM: P2050470491, P2050484212

Reviewed By: mcfi

Differential Revision: D87656468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant