- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
[Perf] [CPU] eliminate redundant memory access in group query attention #13319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ZelinMa557 <[email protected]>
| 
           Just tested this out of curiosity: Qwen 3 degrades in quality (ignores   | 
    
          
 Hi, thanks for your reply! It do not break compatibility with older models in theory, but there might be small bugs in my implementation. In my test, it works with Qwen 2.5 7b. Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug.  | 
    
          
 I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too.  | 
    
          
 Thanks, I have reproduced the same problem. I will try to fix it.  | 
    
Signed-off-by: ZelinMa557 <[email protected]>
| 
           I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now  | 
    
          
 Thank you! I've already deleted Qwen models, unfortunately, but Mistral Small 2 generates text correctly now. I'll test it a bit more with other models, but so far it seems to be fixed. On i7 8700 with Mistral Small 3 (the 24b one, q4_k_m) I get 2.08t/s with this PR vs 1.97t/s on current main.  | 
    
Signed-off-by: ZelinMa557 <[email protected]>
| 
           Hm, I opened your PR in my editor and saw this: I ran  Here's a patch to fix the change, if you can't find the lineFrom 3c7b2ed48acfcb5a9c06846ed0b548b3e48707af Mon Sep 17 00:00:00 2001
From: Excigma <[email protected]>
Date: Mon, 12 May 2025 15:46:03 +1200
Subject: [PATCH] style: remove trailing whitespace
---
 ggml/src/ggml-cpu/ops.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 250b6abc..a1481d9e 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -7055,7 +7055,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
 
             const float * pq = (const float *) ((char *) q->data + (iq1*nbq1 + (iq2 + i_gqa)*nbq2 + iq3*nbq3));
             q_to_vec_dot(pq, Q_q[i_gqa], DK);
-  
+
             const uint32_t h = iq2 + i_gqa;
             slope[i_gqa] = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;
         }
-- 
2.49.0
 | 
    
Signed-off-by: ZelinMa557 <[email protected]>
| 
           @slaren Hi, would you mind reviewing this PR when you have time? I’ve verified the changes pass all tests and followed the contribution guidelines. Happy to address any feedback! Thanks for your time! 🙏  | 
    
| 
           I tried this, but the performance I see on my system is not always better, and in some cases it is worse than master. The problem with testing with  You can try these commands to get started with these tools:  | 
    
          
 Thanks for your patient guidance! I will try these commands  | 
    





Modern LLMs (Llama3, qwen 2.5, etc) usually use group query attention, which significantly reduces memory usage caused by KV cache. Group query attention means that query rows of neighbor query heads share kv rows of the same kv head, so we can reorder the loop to:
to improve spatial locality of memory access. However the original implemention of cpu flash attention kernel didn't consider that, and this pr improves it.
This is my test command:
The mastrer branch result:
With the optimization, the result is:
We can see slight speed up in prefill, and 25% speed up in decode!
Further work:
My test environment: