Optimize fp32 RVV distance kernels with e32m8 accumulator#1613
Optimize fp32 RVV distance kernels with e32m8 accumulator#1613ihb2032 wants to merge 1 commit intozilliztech:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ihb2032 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @ihb2032! It looks like this is your first PR to zilliztech/knowhere 🎉 |
|
@ihb2032 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
Signed-off-by: ihb2032 <hebome@foxmail.com>
719c698 to
b3440d4
Compare
|
@ihb2032 e2e jenkins job failed, comment |
|
@ihb2032 Thanks for your donation, . A question: is this |
|
Thanks for the question. According to the RISC-V Vector spec, LMUL is the vector register group multiplier. When LMUL is greater than 1, it represents the number of vector registers combined to form one vector register group. The spec also says implementations must support integer LMUL values 1, 2, 4, and 8. So RVV also uses a vector-length-agnostic / strip-mining model. The program provides AVL, and |
|
understood. /hold |
|
/unhold |
|
@ihb2032 please rebase to master. Thanks. |
What
This PR optimizes non-batch fp32 RVV distance kernels by replacing the current
e32m2 x 4accumulator implementation with a simplere32m8 x 1strip-mining implementation.Updated kernels:
fvec_inner_product_rvvfvec_L2sqr_rvvfvec_norm_L2sqr_rvvThe new implementation uses one
vfloat32m8_taccumulator and handles the whole loop throughvsetvl_e32m8, removing the manual 4-way unrolled main loop and separate tail loop.Why
The previous implementation used four
e32m2accumulators to process4 * VLMAX(e32m2)elements per main-loop iteration. Since this is equivalent toVLMAX(e32m8), the same number of elements can be processed with a singlee32m8vector operation.On Muse Pi Pro, the
e32m8 x 1version is consistently faster than the currente32m2 x 4implementation. The improvement is especially stable for medium and large dimensions.Benchmark
Tested on Muse Pi Pro.
fvec_inner_product_rvvspeedup = old_m2x4_ns / new_m8x1_nsFor
d >= 256, the speedup is stable around2.62x ~ 2.74x.The maximum observed absolute difference from the long-double reference was around
2e-6.fvec_L2sqr_rvvspeedup = old_m2x4_ns / new_m8x1_nsFor
d >= 256, the speedup is stable around2.65x ~ 2.77x.The maximum observed absolute difference from the long-double reference was around
7.1e-5.Notes
The new implementation may produce slightly different floating-point results from the previous RVV implementation because the accumulation/reduction order is different. This is expected for vectorized floating-point reduction and the observed numerical differences are small.
issue: #1614