Skip to content

Conversation

@Srihari-mcw
Copy link
Collaborator

@Srihari-mcw Srihari-mcw commented Aug 12, 2025

  • The PR contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
  • Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
  • The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture
  • repack_q6_K_to_q6_K_8_bl function rearranges the weight in Q6_K format to Q6_Kx8 format(block_q6_Kx8)

Block Interleaving Formats

Block_Q6_Kx8 :

  • Used to contain data of 8 Q6_K blocks in interleaved fashion
  • uint8 scales[128] - Scales from source Q6_K blocks are taken. Every 16 byte here is packed such that it contains scales for corresponding sub blocks from Q6_K structure - There are 16 sub blocks in original Q6_K structure
  • The d values from source Q6_K blocks are stored together in an array
  • Quant values (hbits and lbits) from the source Q6_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance numbers with llama2 7B model quantized to Q6_K is attached here

GCC Linux :

Q6_K Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 40.22 ± 0.04 79c116 - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 45.51 ± 0.07 13.15% 3b3d551 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 59.81 ± 0.11 48.71% 3b3d551 - AVX512 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.55 ± 0.00 79c116 - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.29 ± 0.00 -2.46% 3b3d551 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.29 ± 0.00 -2.46% 3b3d551 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 12, 2025
@Srihari-mcw
Copy link
Collaborator Author

The perplexity results with llama2 7B are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q6_K 5.8164 +/- 0.03250 79c116 - Base Commit
llama 7B Q6_K 5.8163 +/- 0.03250 3b3d551 - Updated Commit

@jukofyork
Copy link
Collaborator

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

@Srihari-mcw
Copy link
Collaborator Author

Srihari-mcw commented Aug 15, 2025

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

@jukofyork Repacking of weights enables much more efficient usage of AVX512 which is not the case with existing setup. Thanks

@Srihari-mcw
Copy link
Collaborator Author

Update : Scalar code accuracy issues are fixed and the code is ready for further review. Thanks

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 15, 2025

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

@jukofyork Repacking of weights enables much more efficient usage of AVX512 which is not the case with existing setup. Thanks

Thanks - when it gets finalised then I will give this a try with my dual Xeon Gold 6248:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

system_info: n_threads = 80 (n_threads_batch = 80) / 80 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

the main thing they have is AVX512_VNNI = 1 due to being Cascade Lake, so will be interesting to see what effect this PR has.

I currently run large MoE models with everything in Q6_K apart from the non-shared expert tensors which I use Q4_K, and the Q4_K is only kept on the CPU/RAM for small batch sizes where the cost of offloading to the GPU via the PCI-E bus is too much.

@Srihari-mcw
Copy link
Collaborator Author

Hi @slaren / @ggerganov , any thoughts on further steps with regards to this PR. Thanks

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor formatting comments.

The main issue as usual is that we don't have CI for AVX512 and hard to approve these changes. Will ping you if we encounter any problems in the future.

Comment on lines 1423 to 1424
block_q6_Kx8* dst = (block_q6_Kx8*)t->data;
const block_q6_K* src = (const block_q6_K*)data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
block_q6_Kx8* dst = (block_q6_Kx8*)t->data;
const block_q6_K* src = (const block_q6_K*)data;
block_q6_Kx8 * dst = (block_q6_Kx8 *)t->data;
const block_q6_K * src = (const block_q6_K *)data;

GGML_UNUSED(data_size);
}

static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) {
static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor * t, int interleave_block, const void * GGML_RESTRICT data, size_t data_size) {

Comment on lines 1321 to 1323
}
return out;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
return out;
}
return out;

Comment on lines 1313 to 1315
for (int i = 0; i < 128; i++) {

// Index for selecting which q6k super block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < 128; i++) {
// Index for selecting which q6k super block
for (int i = 0; i < 128; i++) {
// Index for selecting which q6k super block

}


static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) {
static block_q6_Kx8 make_block_q6_Kx8(block_q6_K * in, unsigned int blck_size_interleave) {

Comment on lines 988 to 991
const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;
const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

Comment on lines 522 to 525
const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;
const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64;
const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

Comment on lines 10227 to 10234
const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7)
const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7)

const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15)
const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15)

const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7)
const __m256i rhs_mat_2367_40_s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7)
const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7)
const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15)
const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15)
const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7)
const __m256i rhs_mat_2367_40_s
}
#else
ggml_gemm_q6_K_8x8_q8_K_generic(n, s, bs, vx, vy, nr, nc);
#endif

@Srihari-mcw Srihari-mcw requested a review from slaren as a code owner November 13, 2025 10:36
@ggerganov
Copy link
Member

This failure is a bit suspicious: https://github.com/ggml-org/llama.cpp/actions/runs/19328651203/job/55285977638?pr=15275

Will rerun the CI and see if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants