-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Q4/Q8 Tiled Gemm Optimization. #16999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <[email protected]>
|
@taronaeo Can you please review this PR ? |
|
@ggerganov Can you please review this PR? |
|
|
||
| #include <pthread.h> | ||
|
|
||
| typedef vector unsigned char vec_t; | ||
| typedef __vector_quad acc_t; | ||
|
|
||
| static pthread_key_t t_data_key; | ||
| typedef struct { | ||
| vec_t* A_pack; | ||
| vec_t* B_pack; | ||
| int* comparray; | ||
| } thread_scratchpad_t; | ||
| void thread_cleanup(void* arg) { | ||
| thread_scratchpad_t* data = (thread_scratchpad_t*)arg; | ||
| if (data) { | ||
| delete[] data->A_pack; | ||
| delete[] data->B_pack; | ||
| delete[] data->comparray; | ||
|
|
||
| delete data; | ||
| } | ||
| } | ||
| static bool key_created = false; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.
void matmul_tiled(const ggml_compute_params* params,
int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
char* wdata = (char*) params->wdata;
constexpr size_t ALIGN = 128;
auto align_ptr = [&](char* ptr, size_t alignment) {
return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1));
};
char* ptr = align_ptr(wdata, ALIGN);
vec_t* A_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * mc * kc * 2;
vec_t* B_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * nc * kc * 2;
int* comparray = (int*)align_ptr(ptr, ALIGN); // integer part aligned too
ptr += sizeof(int) * mc * kc;
// rest of your original matmul_tiled() code unchanged
}
| Benchmark (llama-bench) | Baseline | pthread-based TLS | ggml wdata-based TLS |
|---|---|---|---|
| pp128 | 69 t/s | 89 t/s | 36 t/s |
| pp256 | 69 t/s | 94 t/s | 36 t/s |
This regression is likely due to:
- Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.
- Higher memory initialization or shared buffer contention across threads.
I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];
Can you please suggest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void matmul_tiled()
You wdata pointer has to take into account the thread ID. For example:
llama.cpp/ggml/src/ggml-cpu/ops.cpp
Lines 615 to 617 in 7f09a68
| float * wdata = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith; | |
From what I can tell from your codeblock, all threads are currently working on the same wdata range and most likely they are data racing. Unless I'm missing something 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thank you so much for the input. I have implemented the approach you suggested and saw we get best performance with pthread based dynamic memory allocation only . Here is the code and results.
void matmul_tiled(const struct ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
const int ith = params->ith;
const int nth = params->nth;
const int64_t TILE_SIZE = 64;
const size_t vec_t_sz = 16;
const size_t int_sz = 4;
const size_t align = (size_t)GGML_CACHE_LINE_SIZE;
const size_t A_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
const size_t B_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
const size_t C_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * int_sz;
const size_t A_aligned = GGML_PAD(A_raw_bytes, align);
const size_t B_aligned = GGML_PAD(B_raw_bytes, align);
const size_t C_aligned = GGML_PAD(C_raw_bytes, align);
const size_t S_PER_THREAD_MAX = GGML_PAD(A_aligned + B_aligned + C_aligned, align);
uint8_t* base_u8 = reinterpret_cast<uint8_t*>(params->wdata);
uint8_t* thread_base_unaligned = base_u8 + (S_PER_THREAD_MAX +(align-1))* (size_t)ith;
uint8_t* p = (uint8_t*)GGML_PAD((uintptr_t)thread_base_unaligned, align);
vec_t* A_pack = reinterpret_cast<vec_t*>(p);
p += A_aligned;
vec_t* B_pack = reinterpret_cast<vec_t*>(p);
p += B_aligned;
int* comparray = reinterpret_cast<int*>(p);
constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>;
int64_t ytiles = m / mc;
int64_t xtiles = n / nc;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles) {
end = tiles;
}
for (int64_t job = start; job < end; ++job) {
int64_t ii = (job / xtiles) * mc;
int64_t jj = (job % xtiles) * nc;
for (int64_t kk = 0; kk < k; kk += kc) {
if constexpr(is_Ablock_q4) {
packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray);
} else {
packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray);
}
packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true);
KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray);
}
}
}
Summary of Thread Model Performance Evaluation (Power10)
We compared three builds of llama.cpp on Power10 for the same configuration (Meta-Llama-3-8B Q4_0, 20 threads, prompt 128, 1 token with llama-bench):
| Build Type | pp128 t/s |
Cycles (↓ better) | IPC | Elapsed Time (s) |
|---|---|---|---|---|
| Base (upstream) | 68.08 | 841B | 2.56 | 13.34 |
| GGML thread patch | 52.32 | 1076B | 1.59 | 16.60 |
| Pthread-based patch | 84.27 | 625B | 2.46 | 11.04 |
Observations
- The ggml-thread patch shows a ~25% regression vs base (pp128: 68 → 52 t/s) and a ~28% increase in total cycles, indicating higher synchronization or scheduling overhead.
- The pthread-based version outperforms both:
- +24% faster than base for
pp128(84.27 vs 68.08 t/s), - ~34% fewer cycles and ~17% lower elapsed time (11.0s vs 13.3s),
- IPC and cache behavior remain healthy and consistent.
- +24% faster than base for
Given these results:
- The params->wdata approach adds noticeable overhead on Power10.
- The pthread-based implementation provides clear performance benefits and better scaling with available cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov
I’ve added a note explaining that the params->wdata approach didn’t provide benefits.
When you have some time, could you please take another look at the patch?
Thank you!
This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.
30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).
Make sure to read the contributing guidelines before submitting a PR