Skip to content

Optimize fp32 RVV distance kernels with e32m8 accumulator#1613

Open
ihb2032 wants to merge 1 commit intozilliztech:mainfrom
ihb2032:optimize-rvv-fp32-m8-kernels
Open

Optimize fp32 RVV distance kernels with e32m8 accumulator#1613
ihb2032 wants to merge 1 commit intozilliztech:mainfrom
ihb2032:optimize-rvv-fp32-m8-kernels

Conversation

@ihb2032
Copy link
Copy Markdown

@ihb2032 ihb2032 commented May 4, 2026

What

This PR optimizes non-batch fp32 RVV distance kernels by replacing the current e32m2 x 4 accumulator implementation with a simpler e32m8 x 1 strip-mining implementation.

Updated kernels:

  • fvec_inner_product_rvv
  • fvec_L2sqr_rvv
  • fvec_norm_L2sqr_rvv

The new implementation uses one vfloat32m8_t accumulator and handles the whole loop through vsetvl_e32m8, removing the manual 4-way unrolled main loop and separate tail loop.

Why

The previous implementation used four e32m2 accumulators to process 4 * VLMAX(e32m2) elements per main-loop iteration. Since this is equivalent to VLMAX(e32m8), the same number of elements can be processed with a single e32m8 vector operation.

On Muse Pi Pro, the e32m8 x 1 version is consistently faster than the current e32m2 x 4 implementation. The improvement is especially stable for medium and large dimensions.

Benchmark

Tested on Muse Pi Pro.

fvec_inner_product_rvv

speedup = old_m2x4_ns / new_m8x1_ns

dim speedup
16 1.38x
32 1.85x
64 2.39x
128 2.52x
256 2.62x
512 2.68x
1024 2.72x
2048 2.74x

For d >= 256, the speedup is stable around 2.62x ~ 2.74x.

The maximum observed absolute difference from the long-double reference was around 2e-6.

fvec_L2sqr_rvv

speedup = old_m2x4_ns / new_m8x1_ns

dim speedup
16 1.33x
32 1.85x
64 2.43x
128 2.57x
256 2.66x
512 2.71x
1024 2.74x
2048 2.77x

For d >= 256, the speedup is stable around 2.65x ~ 2.77x.

The maximum observed absolute difference from the long-double reference was around 7.1e-5.

Notes

The new implementation may produce slightly different floating-point results from the previous RVV implementation because the accumulation/reduction order is different. This is expected for vectorized floating-point reduction and the observed numerical differences are small.
issue: #1614

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ihb2032
To complete the pull request process, please assign cqy123456 after the PR has been reviewed.
You can assign the PR to them by writing /assign @cqy123456 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

Welcome @ihb2032! It looks like this is your first PR to zilliztech/knowhere 🎉

@mergify
Copy link
Copy Markdown

mergify Bot commented May 4, 2026

@ihb2032 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Signed-off-by: ihb2032 <hebome@foxmail.com>
@ihb2032 ihb2032 force-pushed the optimize-rvv-fp32-m8-kernels branch from 719c698 to b3440d4 Compare May 4, 2026 11:54
@mergify
Copy link
Copy Markdown

mergify Bot commented May 4, 2026

@ihb2032 e2e jenkins job failed, comment /run-e2e-sse can trigger the job again.

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

@ihb2032 Thanks for your donation, . A question: is this e32m8 supported only on high-end chips? Basically, I wonder whether the situation with e32m2 vs e32m8 is somewhat similar to AVX2 vs AVX512 for x86.

@ihb2032
Copy link
Copy Markdown
Author

ihb2032 commented May 5, 2026

Thanks for the question.

According to the RISC-V Vector spec, LMUL is the vector register group multiplier. When LMUL is greater than 1, it represents the number of vector registers combined to form one vector register group. The spec also says implementations must support integer LMUL values 1, 2, 4, and 8.

So e32m8 means SEW=32 with LMUL=8: one vector operand uses a group of 8 vector registers, and its VLMAX is 8 * VLEN / 32.

RVV also uses a vector-length-agnostic / strip-mining model. The program provides AVL, and vsetvl/vsetvli sets vl to the number of elements the hardware will process in that iteration, based on the implementation and the current vtype. Therefore the same loop is not tied to a fixed SIMD width.

@mergify mergify Bot added the ci-passed label May 5, 2026
@alexanderguzhva
Copy link
Copy Markdown
Collaborator

understood.

/hold
wait for #1605

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

/unhold

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

@ihb2032 please rebase to master. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants