Q4/Q8 Tiled Gemm Optimization. #16999

shalinib-ibm · 2025-11-04T13:48:50Z

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.

30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Make sure to read the contributing guidelines before submitting a PR

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <[email protected]>

shalinib-ibm · 2025-11-04T13:50:58Z

@taronaeo Can you please review this PR ?

shalinib-ibm · 2025-11-05T13:48:36Z

@ggerganov Can you please review this PR?

ggerganov · 2025-11-05T15:27:28Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+
+#include <pthread.h>
+
+typedef vector unsigned char vec_t;
+typedef __vector_quad acc_t;
+
+static pthread_key_t t_data_key;
+typedef struct {
+    vec_t* A_pack;
+    vec_t* B_pack;
+    int* comparray;
+} thread_scratchpad_t;
+void thread_cleanup(void* arg) {
+    thread_scratchpad_t* data = (thread_scratchpad_t*)arg;
+    if (data) {
+        delete[] data->A_pack;
+        delete[] data->B_pack;
+        delete[] data->comparray;
+
+        delete data;
+    }
+}
+static bool key_created = false;
+


It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.

@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.

void matmul_tiled(const ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) { char* wdata = (char*) params->wdata; constexpr size_t ALIGN = 128; auto align_ptr = [&](char* ptr, size_t alignment) { return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1)); }; char* ptr = align_ptr(wdata, ALIGN); vec_t* A_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * mc * kc * 2; vec_t* B_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * nc * kc * 2; int* comparray = (int*)align_ptr(ptr, ALIGN); // integer part aligned too ptr += sizeof(int) * mc * kc; // rest of your original matmul_tiled() code unchanged }

Benchmark (llama-bench) Baseline pthread-based TLS ggml wdata-based TLS

pp128 69 t/s 89 t/s 36 t/s

pp256 69 t/s 94 t/s 36 t/s

This regression is likely due to:

Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.

Higher memory initialization or shared buffer contention across threads.

I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];

Can you please suggest ?

void matmul_tiled()

You wdata pointer has to take into account the thread ID. For example:

llama.cpp/ggml/src/ggml-cpu/ops.cpp

Lines 615 to 617 in 7f09a68

float * wdata = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith;

From what I can tell from your codeblock, all threads are currently working on the same wdata range and most likely they are data racing. Unless I'm missing something 🤔

@ggerganov Thank you so much for the input. I have implemented the approach you suggested and saw we get best performance with pthread based dynamic memory allocation only . Here is the code and results.

void matmul_tiled(const struct ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) { const int ith = params->ith; const int nth = params->nth; const int64_t TILE_SIZE = 64; const size_t vec_t_sz = 16; const size_t int_sz = 4; const size_t align = (size_t)GGML_CACHE_LINE_SIZE; const size_t A_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz; const size_t B_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz; const size_t C_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * int_sz; const size_t A_aligned = GGML_PAD(A_raw_bytes, align); const size_t B_aligned = GGML_PAD(B_raw_bytes, align); const size_t C_aligned = GGML_PAD(C_raw_bytes, align); const size_t S_PER_THREAD_MAX = GGML_PAD(A_aligned + B_aligned + C_aligned, align); uint8_t* base_u8 = reinterpret_cast<uint8_t*>(params->wdata); uint8_t* thread_base_unaligned = base_u8 + (S_PER_THREAD_MAX +(align-1))* (size_t)ith; uint8_t* p = (uint8_t*)GGML_PAD((uintptr_t)thread_base_unaligned, align); vec_t* A_pack = reinterpret_cast<vec_t*>(p); p += A_aligned; vec_t* B_pack = reinterpret_cast<vec_t*>(p); p += B_aligned; int* comparray = reinterpret_cast<int*>(p); constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>; int64_t ytiles = m / mc; int64_t xtiles = n / nc; int64_t tiles = xtiles * ytiles; int64_t duty = (tiles + nth - 1) / nth; int64_t start = duty * ith; int64_t end = start + duty; if (end > tiles) { end = tiles; } for (int64_t job = start; job < end; ++job) { int64_t ii = (job / xtiles) * mc; int64_t jj = (job % xtiles) * nc; for (int64_t kk = 0; kk < k; kk += kc) { if constexpr(is_Ablock_q4) { packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray); } else { packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray); } packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true); KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray); } } }

Summary of Thread Model Performance Evaluation (Power10)

We compared three builds of llama.cpp on Power10 for the same configuration (Meta-Llama-3-8B Q4_0, 20 threads, prompt 128, 1 token with llama-bench):

Build Type pp128 t/s Cycles (↓ better) IPC Elapsed Time (s)

Base (upstream) 68.08 841B 2.56 13.34

GGML thread patch 52.32 1076B 1.59 16.60

Pthread-based patch 84.27 625B 2.46 11.04

Observations

The ggml-thread patch shows a ~25% regression vs base (pp128: 68 → 52 t/s) and a ~28% increase in total cycles, indicating higher synchronization or scheduling overhead.

The pthread-based version outperforms both:

+24% faster than base for pp128 (84.27 vs 68.08 t/s),

~34% fewer cycles and ~17% lower elapsed time (11.0s vs 13.3s),

IPC and cache behavior remain healthy and consistent.

Given these results:

The params->wdata approach adds noticeable overhead on Power10.

The pthread-based implementation provides clear performance benefits and better scaling with available cores.

Hi @ggerganov
I’ve added a note explaining that the params->wdata approach didn’t provide benefits.
When you have some time, could you please take another look at the patch?
Thank you!

shalinib-ibm requested review from ggerganov and slaren as code owners November 4, 2025 13:48

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. auroralabs-loci/llama.cpp#81

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025

ggerganov reviewed Nov 5, 2025

View reviewed changes

shalinib-ibm requested a review from ggerganov November 7, 2025 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q4/Q8 Tiled Gemm Optimization. #16999

Q4/Q8 Tiled Gemm Optimization. #16999

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Uh oh!

shalinib-ibm Nov 6, 2025

Uh oh!

ggerganov Nov 7, 2025

Uh oh!

shalinib-ibm Nov 12, 2025 •

edited

Loading

Uh oh!

shalinib-ibm Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


	float * wdata = (float ) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) ith;

Build Type	`pp128` t/s	Cycles (↓ better)	IPC	Elapsed Time (s)
Base (upstream)	68.08	841B	2.56	13.34
GGML thread patch	52.32	1076B	1.59	16.60
Pthread-based patch	84.27	625B	2.46	11.04

Q4/Q8 Tiled Gemm Optimization. #16999

Are you sure you want to change the base?

Q4/Q8 Tiled Gemm Optimization. #16999

Conversation

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary of Thread Model Performance Evaluation (Power10)

Observations

Uh oh!

shalinib-ibm Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shalinib-ibm Nov 12, 2025 •

edited

Loading