cpu: rv64: gemm: improve performance of gemm f32 kernel #4414

zhangjian29 · 2025-12-04T09:35:52Z

Description

This PR improves the performance of the rvv_gemm_f32 kernel with a set of fine‑tuned approaches:

Optimization of the copy_A method with a software pipeline.
Tail vectorization of the block_ker kernel using RVV intrinsics.
LMUL optimization to combine vector registers for a larger effective vl.
Fine‑tuning of gemm_traits_t, including the m and BN factors.

Key Features

`copy_A` Software Pipeline

In the copy_A method, matrix A is repacked in memory for fast loading in kernel_mxn via simple load/store operations. We pipeline these loads and stores in software to hide memory latency and to better utilize multiple vector load/store units in hardware.

Tail Vectorization

The previous version handled tail elements with scalar computation. We now use RVV intrinsics for tail processing so that all elements, including the tails, benefit from vector execution.

`LMUL` Optimization

A larger vl provides higher computation parallelism and better throughput per instruction. RVV allows us to use the LMUL parameter to form vector register groups and effectively extend the hardware VLEN for a single vector operation. Because many vector registers were previously unused in the kernel, we adopt LMUL = m4 to increase the number of elements processed per vector instruction without increasing loop overhead. This improves compute utilization for compute‑bound shapes.

Kernel Trait Parameter Tuning

The original kernel used m = 8. We increase this to m = 16 and re‑tune BN so that each kernel invocation computes over a larger block of rows and columns. This increases data reuse in L1/L2 caches and reduces per‑call overhead (such as loop control and address computation). The result is higher effective FLOP/s for both GEMM and the primitives that build on it (matmul and convolution).

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance Improvements

Have you submitted performance data that demonstrates performance improvements?

We evaluated the optimized rvv_gemm_f32 kernel through the rvv_matmul and rvv_gemm_convolution primitives (which have been verified to use the optimized GEMM kernel).

All measurements were taken on an SG2044 platform with fixed CPU resources (taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:

Benchmark: benchdnn matmul and convolution workloads
Data type: f32
benchdnn mode: --mode=P

Results

On average, the optimized kernel improves performance by 1.54× and 1.34× over the existing RVV GEMM implementation on matmul and conv primitives, respectively.

The detailed results are shown below.

Table I. Runtime Comparisons on `matmul`

Batch Shape	Before (ms)	After (ms)	Speedup (×)
shapes_converted_ip_inf_lb_wd	214.809	124.429	1.73
shapes_converted_ip_inf_lb_gmnt	29.1979	14.7028	1.99
shapes_converted_ip_inf_lb_googlenet	293.425	189.086	1.55
shapes_converted_ip_inf_lb_resnet	127.713	78.6858	1.62
shapes_transformer	138.743	96.306	1.44
shapes_converted_ip_inf_lb_vgg16	3434.78	2545.61	1.35
shapes_converted_ip_inf_lb_ncf	40.4079	32.8516	1.23
shapes_converted_ip_inf_lb_alexnet	25741.8	16720.4	1.54
shapes_converted_ip_inf_lb_maskrcnn	4576.55	2709.46	1.69
shapes_converted_ip_inf_lb_rnn_t	5306.91	3450	1.54
shapes_converted_ip_inf_lb_dlrm	1545.22	1017.79	1.52
total	41449.56	26979.32	1.54

Table II. Runtime Comparisons on `conv`

Batch Shape	Before (ms)	After (ms)	Speedup (×)
shapes_alexnet	46378.8	36729.6	1.26
shapes_densnet	725.906	135.532	5.36
shapes_efficientdet	2770.54	937.953	2.96
shapes_fastrcnn_p1	6170.95	4716.56	1.31
shapes_gemm	19193.6	8216.67	2.34
shapes_googlenet_v3	22755.1	15149.4	1.50
shapes_mobilenet	3531.9	2273.79	1.55
shapes_resnet_50	33813.1	19113	1.77
shapes_segnet	75042.4	69441.4	1.08
shapes_vgg_11	21062.5	11062.7	1.90
shapes_unet	3131.37	2307.11	1.36
shapes_yolov2	107280	85753.5	1.25
total	341856.17	255837.22	1.34

Table III. Improvement Contribution on Four Methods

To find out how our four optimization methods contribute to the total improvements, we adopt ONLY one of the following methods and compare their performance:

Only Method 1: copy_A pipeline
Only Method 2: Tail vectorization
Method 3+2+1: LMUL optimization on Method 1 & 2
Only Method 4: Kernel trait parameter optimization

Batch Shape	Method1(ms)	Method2(ms)	Method3+2+1(ms)	Method4(ms)	None(ms)	All(ms)
matmul/shapes_converted_ip_inf_lb_alexnet	24969.4	25509.4	23197.8	19654.4	25741.8	16720.4
conv/shapes_googlenet_v3	22308.6	16692.1	15748.6	19655.5	22755.1	15149.4
total	47278	42201.5	38946.4	39309.9	48496.9	31869.8
Contribution Percentage	7.3%	38%	57%	55%	0%	100%

Note that the Contribution Percentage row does not sum to 100% across Method 1–4 because these optimizations are not independent.

Copilot

Pull request overview

This PR optimizes the RISC-V Vector (RVV) GEMM f32 kernel to improve performance of matrix multiplication operations used in matmul and convolution primitives. The optimization increases the kernel's compute efficiency through four complementary approaches: software pipelining in the copy_A method, tail vectorization using RVV intrinsics, LMUL optimization to process more elements per instruction, and tuning of kernel blocking parameters.

Key Changes:

Increased block size parameters (m: 8→16, BN: 48→256 for non-transposed A) to improve cache reuse
Added two-way software pipelining in copy_A to overlap memory operations and hide latency
Replaced scalar tail processing with vectorized RVV intrinsics using LMUL=m4 for higher throughput

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/cpu/rv64/gemm/rvv_gemm_utils_f32.hpp`	Updated GEMM kernel traits: increased `m` unroll factor from 8 to 16 and `BN` block size from 48 to 256 (non-transposed case)
`src/cpu/rv64/gemm/rvv_gemm_f32.cpp`	Optimized `copy_A` with software pipelining and LMUL=m4; vectorized tail processing for both row and column tails using RVV m4 intrinsics instead of scalar loops

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpu/rv64/gemm/rvv_gemm_f32.cpp

cpu: rv64: gemm: improve performance of gemm f32 kernel

25c168a

zhangjian29 requested a review from a team as a code owner December 4, 2025 09:35

github-actions bot added the platform:cpu-rv64 RISC-V label Dec 4, 2025

vpirogov requested a review from a team December 9, 2025 17:00

zhangfeiv0 approved these changes Dec 11, 2025

View reviewed changes

zhangjian29 requested a review from Copilot December 12, 2025 00:05

Copilot started reviewing on behalf of zhangjian29 December 12, 2025 00:06 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

src/cpu/rv64/gemm/rvv_gemm_f32.cpp Show resolved Hide resolved

cpu: rv64: gemm: add software pipeline in transA branch

41ee96a

zhangjian29 requested a review from vpirogov December 12, 2025 06:10

zhangjian29 requested a review from dzarukin January 7, 2026 09:16

dzarukin approved these changes Jan 7, 2026

View reviewed changes

zhangjian29 merged commit 62737b7 into uxlfoundation:main Jan 7, 2026
13 checks passed

zhangjian29 deleted the improve-rvv-gemm-f32-kernel branch January 7, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: rv64: gemm: improve performance of gemm f32 kernel #4414

cpu: rv64: gemm: improve performance of gemm f32 kernel #4414

Uh oh!

zhangjian29 commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpu: rv64: gemm: improve performance of gemm f32 kernel #4414

cpu: rv64: gemm: improve performance of gemm f32 kernel #4414

Uh oh!

Conversation

zhangjian29 commented Dec 4, 2025

Description

Key Features

copy_A Software Pipeline

Tail Vectorization

LMUL Optimization

Kernel Trait Parameter Tuning

Checklist

General

Performance Improvements

Results

Table I. Runtime Comparisons on matmul

Table II. Runtime Comparisons on conv

Table III. Improvement Contribution on Four Methods

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`copy_A` Software Pipeline

`LMUL` Optimization

Table I. Runtime Comparisons on `matmul`

Table II. Runtime Comparisons on `conv`