-
Notifications
You must be signed in to change notification settings - Fork 13.7k
ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_1, q8bytes_1), scale_lane_1); | ||
| isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_2, q8bytes_2), scale_lane_2); | ||
| isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_3, q8bytes_3), scale_lane_3); | ||
| isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_4, q8bytes_4), scale_lane_4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something to try is to have 4 separate accumulators here. Don't have a machine that supports SVE to give this a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I implemented this fix and measured the elapsed time of ggml_vec_dot_q6_K_q8_K in perf, I found a performance degradation of about 5%.
So I think it's better to leave it as it is.
I consider the following:
By providing a separate accumulator:
- The seven dependencies on the critical path in the loop were reduced by one through the fix.
- Three add instructions were added to sum up the sepalated accumulators outside the for loop (reducing the dependencies to two).
Instruction dependencies are reduced to 7->1, while the number of instructions is increased by 3.
If the number of loop rotations is large, the proposed modification is expected to improve performance. However, in this case, since the number of loop count was only 2, performance degradation due to the increase in the number of instructions was dominant.
|
Hey, is this implementation expected to give a boost on mobile devices as well (which afaik usually have a 128-bit SVE vector), or is this meant for server-grade CPUs (with larger SVE widths)?I tested it on my Pixel 9 (which has SVE and SVE2 at a 128-bit) and couldn't see any performance improvement. so wanted to check if I am missing something or it is expected. |
|
This imprementation will improve performance if the processor meets the following conditions: SIMD width is 128
SIMD width is 256 or more
|
|
@fj-y-saito what compiler did you use for these tests? I'm trying to replicate the tg128 result in a FX700 but your results are 50% faster in the tg128. Also, if you could share your cmake options that'd be great. Thanks! |
|
I used gcc 11.3.0 for the tests. At the moment, I'm not sure why the performance differs in this specific test. |
This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q6_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #11227.
Verifying Features
This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.
performance check
Performance was measured with FX700.
Performance is improved as follows (measured with
llama-bench).original
This PR