Optimize fp32 RVV distance kernels with e32m8 accumulator by ihb2032 · Pull Request #1613 · zilliztech/knowhere

ihb2032 · 2026-05-04T11:49:17Z

What

This PR optimizes non-batch fp32 RVV distance kernels by replacing the current e32m2 x 4 accumulator implementation with a simpler e32m8 x 1 strip-mining implementation.

Updated kernels:

fvec_inner_product_rvv
fvec_L2sqr_rvv
fvec_norm_L2sqr_rvv

The new implementation uses one vfloat32m8_t accumulator and handles the whole loop through vsetvl_e32m8, removing the manual 4-way unrolled main loop and separate tail loop.

Why

The previous implementation used four e32m2 accumulators to process 4 * VLMAX(e32m2) elements per main-loop iteration. Since this is equivalent to VLMAX(e32m8), the same number of elements can be processed with a single e32m8 vector operation.

On Muse Pi Pro, the e32m8 x 1 version is consistently faster than the current e32m2 x 4 implementation. The improvement is especially stable for medium and large dimensions.

Benchmark

Tested on Muse Pi Pro.

`fvec_inner_product_rvv`

speedup = old_m2x4_ns / new_m8x1_ns

dim	speedup
16	1.38x
32	1.85x
64	2.39x
128	2.52x
256	2.62x
512	2.68x
1024	2.72x
2048	2.74x

For d >= 256, the speedup is stable around 2.62x ~ 2.74x.

The maximum observed absolute difference from the long-double reference was around 2e-6.

`fvec_L2sqr_rvv`

speedup = old_m2x4_ns / new_m8x1_ns

dim	speedup
16	1.33x
32	1.85x
64	2.43x
128	2.57x
256	2.66x
512	2.71x
1024	2.74x
2048	2.77x

For d >= 256, the speedup is stable around 2.65x ~ 2.77x.

The maximum observed absolute difference from the long-double reference was around 7.1e-5.

Notes

The new implementation may produce slightly different floating-point results from the previous RVV implementation because the accumulation/reduction order is different. This is expected for vectorized floating-point reduction and the observed numerical differences are small.
issue: #1614

sre-ci-robot · 2026-05-04T11:49:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ihb2032
To complete the pull request process, please assign cqy123456 after the PR has been reviewed.
You can assign the PR to them by writing /assign @cqy123456 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot · 2026-05-04T11:49:28Z

Welcome @ihb2032! It looks like this is your first PR to zilliztech/knowhere 🎉

mergify · 2026-05-04T11:49:55Z

@ihb2032 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

If you're fixing a bug, label it as kind/bug.
For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Signed-off-by: ihb2032 <hebome@foxmail.com>

mergify · 2026-05-04T19:56:40Z

@ihb2032 e2e jenkins job failed, comment /run-e2e-sse can trigger the job again.

alexanderguzhva · 2026-05-05T14:03:01Z

@ihb2032 Thanks for your donation, . A question: is this e32m8 supported only on high-end chips? Basically, I wonder whether the situation with e32m2 vs e32m8 is somewhat similar to AVX2 vs AVX512 for x86.

ihb2032 · 2026-05-05T14:24:02Z

Thanks for the question.

According to the RISC-V Vector spec, LMUL is the vector register group multiplier. When LMUL is greater than 1, it represents the number of vector registers combined to form one vector register group. The spec also says implementations must support integer LMUL values 1, 2, 4, and 8.

So e32m8 means SEW=32 with LMUL=8: one vector operand uses a group of 8 vector registers, and its VLMAX is 8 * VLEN / 32.

RVV also uses a vector-length-agnostic / strip-mining model. The program provides AVL, and vsetvl/vsetvli sets vl to the number of elements the hardware will process in that iteration, based on the implementation and the current vtype. Therefore the same loop is not tied to a fixed SIMD width.

alexanderguzhva · 2026-05-05T23:59:54Z

understood.

/hold
wait for #1605

alexanderguzhva · 2026-05-08T00:00:43Z

/unhold

alexanderguzhva · 2026-05-08T00:00:53Z

@ihb2032 please rebase to master. Thanks.

sre-ci-robot requested review from liliu-z and zhengbuqian May 4, 2026 11:49

sre-ci-robot added the size/L label May 4, 2026

mergify Bot added dco-passed do-not-merge/missing-related-issue labels May 4, 2026

Optimize fp32 RVV distance kernels

b3440d4

Signed-off-by: ihb2032 <hebome@foxmail.com>

ihb2032 force-pushed the optimize-rvv-fp32-m8-kernels branch from 719c698 to b3440d4 Compare May 4, 2026 11:54

mergify Bot added the ci-passed label May 5, 2026

sre-ci-robot added the do-not-merge/hold label May 5, 2026

sre-ci-robot removed the do-not-merge/hold label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fp32 RVV distance kernels with e32m8 accumulator#1613

Optimize fp32 RVV distance kernels with e32m8 accumulator#1613
ihb2032 wants to merge 1 commit intozilliztech:mainfrom
ihb2032:optimize-rvv-fp32-m8-kernels

ihb2032 commented May 4, 2026 •

edited

Loading

Uh oh!

sre-ci-robot commented May 4, 2026

Uh oh!

sre-ci-robot commented May 4, 2026

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

alexanderguzhva commented May 5, 2026

Uh oh!

ihb2032 commented May 5, 2026

Uh oh!

alexanderguzhva commented May 5, 2026

Uh oh!

alexanderguzhva commented May 8, 2026

Uh oh!

alexanderguzhva commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ihb2032 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Benchmark

fvec_inner_product_rvv

fvec_L2sqr_rvv

Notes

Uh oh!

sre-ci-robot commented May 4, 2026

Uh oh!

sre-ci-robot commented May 4, 2026

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

alexanderguzhva commented May 5, 2026

Uh oh!

ihb2032 commented May 5, 2026

Uh oh!

alexanderguzhva commented May 5, 2026

Uh oh!

alexanderguzhva commented May 8, 2026

Uh oh!

alexanderguzhva commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ihb2032 commented May 4, 2026 •

edited

Loading

`fvec_inner_product_rvv`

`fvec_L2sqr_rvv`