-
Just migrated from Ollama (Windows) to llama.cpp (Ubuntu 24.04) and need help optimizing. Hardware: i7-14700K + 192Gb @ 5600 MT/s + RTX 5090 (via PCIe) + RTX 4090 (via OCuLink) Model example: GLM-4.5-Air (UD-Q4_K_XL)
Prompt processing is suspiciously slow (~21 tokens/sec) while generation is decent (~27 tokens/sec) – but I'm seeing reports of much faster prompt processing on weaker hardware. llama.cpp build:
llama-swap config for GLM-4.5-Air
log:
nvidia-smi output during prompt processing
nvidia-smi output during generation
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
For small batches, or slow PCIe connections, |
Beta Was this translation helpful? Give feedback.
-
What does I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:
My llama-swap config:
If the GTX 1660 Super is Device 0, the prompt processing is eternal. But if my RTX 3090 is Device 0, the prompt processing is good. |
Beta Was this translation helpful? Give feedback.
What does
--main-gpu 1
do for you? Why does the RTX 4090 display asDevice 0
and not the RTX 5090? Could you place the RTX 5090 as device 0 using theCUDA_VISIBLE_DEVICES
env variable to change the order so that the RTX 5090 goes first?I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:
My llama-swap config: