Skip to content

Conversation

@zhangjian29
Copy link
Contributor

Description

This PR improves the performance of the rvv_gemm_f32 kernel with a set of fine‑tuned approaches:

  • Optimization of the copy_A method with a software pipeline.
  • Tail vectorization of the block_ker kernel using RVV intrinsics.
  • LMUL optimization to combine vector registers for a larger effective vl.
  • Fine‑tuning of gemm_traits_t, including the m and BN factors.

Key Features

copy_A Software Pipeline

In the copy_A method, matrix A is repacked in memory for fast loading in kernel_mxn via simple load/store operations. We pipeline these loads and stores in software to hide memory latency and to better utilize multiple vector load/store units in hardware.

Tail Vectorization

The previous version handled tail elements with scalar computation. We now use RVV intrinsics for tail processing so that all elements, including the tails, benefit from vector execution.

LMUL Optimization

A larger vl provides higher computation parallelism and better throughput per instruction. RVV allows us to use the LMUL parameter to form vector register groups and effectively extend the hardware VLEN for a single vector operation. Because many vector registers were previously unused in the kernel, we adopt LMUL = m4 to increase the number of elements processed per vector instruction without increasing loop overhead. This improves compute utilization for compute‑bound shapes.

Kernel Trait Parameter Tuning

The original kernel used m = 8. We increase this to m = 16 and re‑tune BN so that each kernel invocation computes over a larger block of rows and columns. This increases data reuse in L1/L2 caches and reduces per‑call overhead (such as loop control and address computation). The result is higher effective FLOP/s for both GEMM and the primitives that build on it (matmul and convolution).

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance Improvements

  • Have you submitted performance data that demonstrates performance improvements?

We evaluated the optimized rvv_gemm_f32 kernel through the rvv_matmul and rvv_gemm_convolution primitives (which have been verified to use the optimized GEMM kernel).

All measurements were taken on an SG2044 platform with fixed CPU resources (taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:

  • Benchmark: benchdnn matmul and convolution workloads
  • Data type: f32
  • benchdnn mode: --mode=P

Results

On average, the optimized kernel improves performance by 1.54× and 1.34× over the existing RVV GEMM implementation on matmul and conv primitives, respectively.

The detailed results are shown below.

Table I. Runtime Comparisons on matmul

Batch Shape Before (ms) After (ms) Speedup (×)
shapes_converted_ip_inf_lb_wd 214.809 124.429 1.73
shapes_converted_ip_inf_lb_gmnt 29.1979 14.7028 1.99
shapes_converted_ip_inf_lb_googlenet 293.425 189.086 1.55
shapes_converted_ip_inf_lb_resnet 127.713 78.6858 1.62
shapes_transformer 138.743 96.306 1.44
shapes_converted_ip_inf_lb_vgg16 3434.78 2545.61 1.35
shapes_converted_ip_inf_lb_ncf 40.4079 32.8516 1.23
shapes_converted_ip_inf_lb_alexnet 25741.8 16720.4 1.54
shapes_converted_ip_inf_lb_maskrcnn 4576.55 2709.46 1.69
shapes_converted_ip_inf_lb_rnn_t 5306.91 3450 1.54
shapes_converted_ip_inf_lb_dlrm 1545.22 1017.79 1.52
total 41449.56 26979.32 1.54

Table II. Runtime Comparisons on conv

Batch Shape Before (ms) After (ms) Speedup (×)
shapes_alexnet 46378.8 36729.6 1.26
shapes_densnet 725.906 135.532 5.36
shapes_efficientdet 2770.54 937.953 2.96
shapes_fastrcnn_p1 6170.95 4716.56 1.31
shapes_gemm 19193.6 8216.67 2.34
shapes_googlenet_v3 22755.1 15149.4 1.50
shapes_mobilenet 3531.9 2273.79 1.55
shapes_resnet_50 33813.1 19113 1.77
shapes_segnet 75042.4 69441.4 1.08
shapes_vgg_11 21062.5 11062.7 1.90
shapes_unet 3131.37 2307.11 1.36
shapes_yolov2 107280 85753.5 1.25
total 341856.17 255837.22 1.34

Table III. Improvement Contribution on Four Methods

To find out how our four optimization methods contribute to the total improvements, we adopt ONLY one of the following methods and compare their performance:

  • Only Method 1: copy_A pipeline
  • Only Method 2: Tail vectorization
  • Method 3+2+1: LMUL optimization on Method 1 & 2
  • Only Method 4: Kernel trait parameter optimization
Batch Shape Method1(ms) Method2(ms) Method3+2+1(ms) Method4(ms) None(ms) All(ms)
matmul/shapes_converted_ip_inf_lb_alexnet 24969.4 25509.4 23197.8 19654.4 25741.8 16720.4
conv/shapes_googlenet_v3 22308.6 16692.1 15748.6 19655.5 22755.1 15149.4
total 47278 42201.5 38946.4 39309.9 48496.9 31869.8
Contribution Percentage 7.3% 38% 57% 55% 0% 100%

Note that the Contribution Percentage row does not sum to 100% across Method 1–4 because these optimizations are not independent.

@zhangjian29 zhangjian29 requested a review from a team as a code owner December 4, 2025 09:35
@vpirogov vpirogov requested a review from a team December 9, 2025 17:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the RISC-V Vector (RVV) GEMM f32 kernel to improve performance of matrix multiplication operations used in matmul and convolution primitives. The optimization increases the kernel's compute efficiency through four complementary approaches: software pipelining in the copy_A method, tail vectorization using RVV intrinsics, LMUL optimization to process more elements per instruction, and tuning of kernel blocking parameters.

Key Changes:

  • Increased block size parameters (m: 8→16, BN: 48→256 for non-transposed A) to improve cache reuse
  • Added two-way software pipelining in copy_A to overlap memory operations and hide latency
  • Replaced scalar tail processing with vectorized RVV intrinsics using LMUL=m4 for higher throughput

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/cpu/rv64/gemm/rvv_gemm_utils_f32.hpp Updated GEMM kernel traits: increased m unroll factor from 8 to 16 and BN block size from 48 to 256 (non-transposed case)
src/cpu/rv64/gemm/rvv_gemm_f32.cpp Optimized copy_A with software pipelining and LMUL=m4; vectorized tail processing for both row and column tails using RVV m4 intrinsics instead of scalar loops

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zhangjian29 zhangjian29 requested a review from vpirogov December 12, 2025 06:10
@zhangjian29 zhangjian29 requested a review from dzarukin January 7, 2026 09:16
@zhangjian29 zhangjian29 merged commit 62737b7 into uxlfoundation:main Jan 7, 2026
13 checks passed
@zhangjian29 zhangjian29 deleted the improve-rvv-gemm-f32-kernel branch January 7, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants