cpu: rv64: gemm: improve performance of gemm f32 kernel #4414
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR improves the performance of the
rvv_gemm_f32kernel with a set of fine‑tuned approaches:copy_Amethod with a software pipeline.block_kerkernel using RVV intrinsics.LMULoptimization to combine vector registers for a larger effectivevl.gemm_traits_t, including themandBNfactors.Key Features
copy_ASoftware PipelineIn the
copy_Amethod, matrix A is repacked in memory for fast loading inkernel_mxnvia simple load/store operations. We pipeline these loads and stores in software to hide memory latency and to better utilize multiple vector load/store units in hardware.Tail Vectorization
The previous version handled tail elements with scalar computation. We now use RVV intrinsics for tail processing so that all elements, including the tails, benefit from vector execution.
LMULOptimizationA larger
vlprovides higher computation parallelism and better throughput per instruction. RVV allows us to use theLMULparameter to form vector register groups and effectively extend the hardwareVLENfor a single vector operation. Because many vector registers were previously unused in the kernel, we adoptLMUL = m4to increase the number of elements processed per vector instruction without increasing loop overhead. This improves compute utilization for compute‑bound shapes.Kernel Trait Parameter Tuning
The original kernel used
m = 8. We increase this tom = 16and re‑tuneBNso that each kernel invocation computes over a larger block of rows and columns. This increases data reuse in L1/L2 caches and reduces per‑call overhead (such as loop control and address computation). The result is higher effective FLOP/s for both GEMM and the primitives that build on it (matmul and convolution).Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance Improvements
We evaluated the optimized
rvv_gemm_f32kernel through thervv_matmulandrvv_gemm_convolutionprimitives (which have been verified to use the optimized GEMM kernel).All measurements were taken on an SG2044 platform with fixed CPU resources (
taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:benchdnnmatmul and convolution workloadsf32--mode=PResults
On average, the optimized kernel improves performance by 1.54× and 1.34× over the existing RVV GEMM implementation on
matmulandconvprimitives, respectively.The detailed results are shown below.
Table I. Runtime Comparisons on
matmulTable II. Runtime Comparisons on
convTable III. Improvement Contribution on Four Methods
To find out how our four optimization methods contribute to the total improvements, we adopt ONLY one of the following methods and compare their performance:
copy_ApipelineLMULoptimization on Method 1 & 2Note that the Contribution Percentage row does not sum to 100% across Method 1–4 because these optimizations are not independent.