-
Notifications
You must be signed in to change notification settings - Fork 154
Fix GLM-4.5 attention #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you! On it! |
Thank you!
That appears to be OS specific. It is faster for me and also for @ubergarm on Linux. Or maybe it is Blackwell? Not sure, hard to figure out with my computing resources. |
Improved hybrid inference too. As context builds TG no longer drops off so fast for big GLM. |
I am not sure exactly the full command you are using for your llama-sweep-bench tests, but looking above at some of your llama-server commands, some observations:
As such, give this slightly massaged command a try (replacing with the latest tip of main executable) CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 \
~/ik_llama-ik-try_cuda_graphs-b4107-7693263-bin-win-cuda-12.8-x64-avx512/llama-sweep-bench \
-m GLM-4.5-Air-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00804.gguf \
-fa -fmoe \
-c 135168 \
-ngl 99 \
-b 4096 -ub 4096 \
--threads 1 \
--main-gpu 0 \
--warmup-batch Then for mainline just remove I'll post again in a bit as I'm doing longer context sweeps out to ~100k and using |
Okay, for this specific config/quant here is what I'm seeing:
![]() (in my previous graph in earlier comment, i probably had been running q8_0 despite not showing it in the table, so this chart is up to date most reflective now at this point in time) 👈 Details./build/bin/llama-sweep-bench \
--model "$model"\
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-c 102400 \
-ngl 99 \
--threads 1 \
-ub 4096 -b 4096 \
--warmup-batch ik_llama.cpp@a3a52300 CUDA ubergarm/GLM-4.5-Air-Q4_0
ik_llama.cpp@a3a52300 CUDA ubergarm/GLM-4.5-Air-Q4_0 -ctk q8_0 -ctv q8_0
llama.cpp@19f4dec CUDA ubergarm/GLM-4.5-Air-Q4_0
llama.cpp@19f4dec CUDA ubergarm/GLM-4.5-Air-Q4_0 -ctk q8_0 -ctv q8_0
|
@ubergarm - thank you for the tips and explanations. You are maybe the 4th person this month to comment about the fact that the CPU: Intel® Core™ i9-7980XE with 18 cores and 36 threads Commands:
Binaries:
![]() ![]() Recipe used:
I have also conducted additional tests when some layers are assigned to the CPU here: Thireus/GGUF-Tool-Suite#20 (comment) I believe the fact that you (and others) are observing a 5% loss when using too many threads may be due to how Windows manages threads differently than Linux. And I believe I'm confusing a lot of Linux folks, so I should probably use whatever number of threads is optimum on Linux despite running these benchmarks on Windows (when they have no effect on prefs of course)... For DeepSeek-R1-0528 I have already benchmarked that surprisingly using --threads 36 (my max number of threads) performs noticeably better than --threads 18:
My conclusion is that I should keep using Let me know your thoughts. |
There have been reports of
llama.cpp
being faster thanik_llama.cpp
for the GLM-4.5 MoE models, see e.g. here or here, with thellama.cpp
advantage increasing with context length, which points to a problem with the self-attention implementation inik_llama.cpp
for this specific model.After wasting a lot of time scrutinizing the various kernels involved, I finally located the problem in this function, where we have:
Hahaha. GLM-4.5 has a GQA ratio of 12, hence none of the conditions for GQA-related optimizations takes effect, so the slow path is taken, resulting in a poor performance for this, or any other model that does not have GQA of 2, 4, or 8.
This PR fixes it. I'm RAM (64 GB) and VRAM (16 GB) poor, so testing using Unsloth's
GLM-4.5-Air-UD-IQ2_XXS.gguf
with all experts left on the CPU (but even then, I can only go up to a context of 32k tokens). As attention is computed on the GPU where we have the problem at hand, this should still be representative for what VRAM-rich people can expect with full GPU offload after this PR. CPU is Ryzen-7950X, GPU is RTX-4080. OS is Linux.TG performance
PP performance
ik_llama.cpp, main
ik_llama.cpp PR
llama.cpp, build: 6181 (de2192794)
@Thireus It would be great if you could repeat you GLM-4.5-Air sweep bench with this PR. Thanks in advance!