-
Notifications
You must be signed in to change notification settings - Fork 13.4k
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.
| r1 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_h), svreinterpret_s64_s8(q8bytes_1_h))); | ||
| r2 = svreinterpret_s8_s64(svzip1_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l))); | ||
| r3 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l))); | ||
| sumi2 = svmmla_s32(svmmla_s32(svmmla_s32(svmmla_s32(svdup_n_s32(0), r0, l0), r1, l1), r2, l2), r3, l3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does svmmla_s32 require to check for __ARM_FEATURE_SVE_MATMUL_INT8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. svmmla_s32() is intrinsic function for sve instruction and it need i8mm feature to CPU.
Could you please explain what the repack mechanism refers to, or point me to the relevant implementation details? |
|
@ggerganov |
This PR improves q4_k_q8_k and q6_K_q8_K gemm kernel with arm64 i8mm instruction with SVE.
similar proposal for NEON support is made in PR #13886
Since it uses SVE instructions, it is characterized by improved performance even on machines with a SIMD width of 128 bits or more.
Verifying Features
This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
I also verified that the perplexity matches between the NEON and SVE implementations.
performance check
Performance was measured with AWS Graviton3.
Performance is improved as follows (measured with
llama-bench).original
This PR