-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275
Conversation
Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference? |
@jukofyork Repacking of weights enables much more efficient usage of AVX512 which is not the case with existing setup. Thanks |
|
Update : Scalar code accuracy issues are fixed and the code is ready for further review. Thanks |
Thanks - when it gets finalised then I will give this a try with my dual https://en.wikichip.org/wiki/intel/xeon_gold/6248 the main thing they have is I currently run large MoE models with everything in |
|
Hi @slaren / @ggerganov , any thoughts on further steps with regards to this PR. Thanks |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor formatting comments.
The main issue as usual is that we don't have CI for AVX512 and hard to approve these changes. Will ping you if we encounter any problems in the future.
ggml/src/ggml-cpu/repack.cpp
Outdated
| block_q6_Kx8* dst = (block_q6_Kx8*)t->data; | ||
| const block_q6_K* src = (const block_q6_K*)data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| block_q6_Kx8* dst = (block_q6_Kx8*)t->data; | |
| const block_q6_K* src = (const block_q6_K*)data; | |
| block_q6_Kx8 * dst = (block_q6_Kx8 *)t->data; | |
| const block_q6_K * src = (const block_q6_K *)data; |
ggml/src/ggml-cpu/repack.cpp
Outdated
| GGML_UNUSED(data_size); | ||
| } | ||
|
|
||
| static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) { | |
| static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor * t, int interleave_block, const void * GGML_RESTRICT data, size_t data_size) { |
| } | ||
| return out; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | |
| return out; | |
| } | |
| return out; |
| for (int i = 0; i < 128; i++) { | ||
|
|
||
| // Index for selecting which q6k super block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (int i = 0; i < 128; i++) { | |
| // Index for selecting which q6k super block | |
| for (int i = 0; i < 128; i++) { | |
| // Index for selecting which q6k super block |
ggml/src/ggml-cpu/repack.cpp
Outdated
| } | ||
|
|
||
|
|
||
| static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) { | |
| static block_q6_Kx8 make_block_q6_Kx8(block_q6_K * in, unsigned int blck_size_interleave) { |
ggml/src/ggml-cpu/repack.cpp
Outdated
| const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64; | ||
| const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | ||
| const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | ||
| const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64; | |
| const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | |
| const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | |
| const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; | |
| const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64; | |
| const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | |
| const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | |
| const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; |
ggml/src/ggml-cpu/repack.cpp
Outdated
| const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64; | ||
| const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | ||
| const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | ||
| const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64; | |
| const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | |
| const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | |
| const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; | |
| const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64; | |
| const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16; | |
| const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32; | |
| const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48; |
| const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7) | ||
| const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7) | ||
|
|
||
| const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15) | ||
| const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15) | ||
|
|
||
| const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7) | ||
| const __m256i rhs_mat_2367_40_s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7) | |
| const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7) | |
| const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15) | |
| const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15) | |
| const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7) | |
| const __m256i rhs_mat_2367_40_s | |
| } | |
| #else | |
| ggml_gemm_q6_K_8x8_q8_K_generic(n, s, bs, vx, vy, nr, nc); | |
| #endif |
|
This failure is a bit suspicious: https://github.com/ggml-org/llama.cpp/actions/runs/19328651203/job/55285977638?pr=15275 Will rerun the CI and see if it happens again. |
Block Interleaving Formats
Block_Q6_Kx8 :
Performance numbers with llama2 7B model quantized to Q6_K is attached here
GCC Linux :
Q6_K Model :
GCC Version = 12.3
The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |