Prompt processing is slow if model does not fit in VRAM. Is it OK? #15211

Galliot · 2025-08-10T12:52:12Z

Galliot
Aug 10, 2025

Just migrated from Ollama (Windows) to llama.cpp (Ubuntu 24.04) and need help optimizing.
If model does not fit in VRAM - I cant get any resonable prompt processing speed.

Hardware: i7-14700K + 192Gb @ 5600 MT/s + RTX 5090 (via PCIe) + RTX 4090 (via OCuLink)
Software: Ubuntu 24.04 + NVidia Driver 575.57.08 + NVidia CUDA 12.9 + llama.cpp + llama-swap

Model example: GLM-4.5-Air (UD-Q4_K_XL)

prompt eval time =   25593.56 ms /   532 tokens (   48.11 ms per token,    20.79 tokens per second)
       eval time =  110469.27 ms /  3041 tokens (   36.33 ms per token,    27.53 tokens per second)
      total time =  136062.84 ms /  3573 tokens

Prompt processing is suspiciously slow (~21 tokens/sec) while generation is decent (~27 tokens/sec) – but I'm seeing reports of much faster prompt processing on weaker hardware.

llama.cpp build:

    git fetch
    git pull
    cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89;120" -DGGML_CCACHE=OFF
    cmake --build build --config Release -j28

llama-swap config for GLM-4.5-Air

      ${latest-llama}        # path to llama.cpp build + port definition
      --model "/path_to_models/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf"
      --threads 8
      --parallel 1
      --main-gpu 1
      --flash-attn
      --n-gpu-layers 999
      --n-cpu-moe 15
      --ctx-size 32768
      --cache-type-k q8_0
      --cache-type-v q8_0
      --no-mmap
      -b 2048
      -ub 2048
      -ts 24,18

log:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 6121 (e54d41be) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 28

system_info: n_threads = 8 (n_threads_batch = 8) / 28 | CUDA : ARCHS = 890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23674 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) - 31510 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
...
load_tensors:            offloading 47 repeating layers to GPU
load_tensors:            offloading output layer to GPU
load_tensors:            offloaded 48/48 layers to GPU
load_tensors:            CUDA_Host model buffer size = 19712.00 MiB
load_tensors:            CUDA0 model buffer size = 19762.16 MiB
load_tensors:            CUDA1 model buffer size = 27541.64 MiB
load_tensors:            CPU model buffer size =   333.00 MiB
...
llama_context:           CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:  CUDA0 KV buffer size =  1904.00 MiB
llama_kv_cache_unified:  CUDA1 KV buffer size =  1224.00 MiB
llama_kv_cache_unified:  size = 3128.00 MiB ( 32768 cells,  46 layers,  1/1 seqs), K (q8_0): 1564.00 MiB, V (q8_0): 1564.00 MiB
llama_context:           CUDA0 compute buffer size =  1508.04 MiB
llama_context:           CUDA1 compute buffer size =  1280.00 MiB
llama_context:           CUDA_Host compute buffer size =   288.05 MiB
llama_context:           graph nodes  = 3101
llama_context:           graph splits = 45 (with bs=2048), 31 (with bs=1)
...
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 532
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 532, n_tokens = 532, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 532, n_tokens = 532
slot      release: id  0 | task 0 | stop processing: n_past = 3572, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =   25593.56 ms /   532 tokens (   48.11 ms per token,    20.79 tokens per second)
       eval time =  110469.27 ms /  3041 tokens (   36.33 ms per token,    27.53 tokens per second)
      total time =  136062.84 ms /  3573 tokens

nvidia-smi output during prompt processing

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8             12W /  600W |   30745MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:07:00.0 Off |                  Off |
|  0%   41C    P2             57W /  450W |   23747MiB /  24564MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2147      G   /usr/lib/xorg/Xorg                       43MiB |
|    0   N/A  N/A            2275      G   /usr/bin/gnome-shell                     15MiB |
|    0   N/A  N/A            3170      C   ...ma.cpp/build/bin/llama-server      30612MiB |
|    1   N/A  N/A            2147      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            3170      C   ...ma.cpp/build/bin/llama-server      23686MiB |
+-----------------------------------------------------------------------------------------+

nvidia-smi output during generation

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   38C    P1             92W /  600W |   30845MiB /  32607MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:07:00.0 Off |                  Off |
|  0%   44C    P2             89W /  450W |   23747MiB /  24564MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2147      G   /usr/lib/xorg/Xorg                       43MiB |
|    0   N/A  N/A            2275      G   /usr/bin/gnome-shell                     15MiB |
|    0   N/A  N/A            3170      C   ...ma.cpp/build/bin/llama-server      30712MiB |
|    1   N/A  N/A            2147      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            3170      C   ...ma.cpp/build/bin/llama-server      23686MiB |
+-----------------------------------------------------------------------------------------+
``

Any suggestions, what's wrong? Or it's "Ok" and I shouldn't expect any pp speed improvements?

Answered by abc-nix

Aug 10, 2025

What does --main-gpu 1 do for you? Why does the RTX 4090 display as Device 0 and not the RTX 5090? Could you place the RTX 5090 as device 0 using the CUDA_VISIBLE_DEVICES env variable to change the order so that the RTX 5090 goes first?

I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:

prompt eval time =   33366.42 ms /  9279 tokens (    3.60 ms per token,   278.09 tokens per second)
       eval time =  326033.94 ms /  3218 tokens (  101.32 ms per token,     9.87 tokens per second)
      total time =  359400.36 ms / 12497 tokens

My llama-swap config:

  "glm-4.5-air-IQ4":
    cmd: |
      …

View full answer

slaren · 2025-08-10T13:36:04Z

slaren
Aug 10, 2025
Maintainer

For small batches, or slow PCIe connections, --no-op-offload may help.

2 replies

Galliot Aug 10, 2025
Author

it gives me 2-5x speed up for different models

YannFollet Aug 11, 2025

I have a 16x gen5 PCIe using 5090 and it is faster 2x faster with --no-op-offload using only cpu ... on kimi_k2, I don't understand why

abc-nix · 2025-08-10T14:39:59Z

abc-nix
Aug 10, 2025

What does --main-gpu 1 do for you? Why does the RTX 4090 display as Device 0 and not the RTX 5090? Could you place the RTX 5090 as device 0 using the CUDA_VISIBLE_DEVICES env variable to change the order so that the RTX 5090 goes first?

I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:

prompt eval time =   33366.42 ms /  9279 tokens (    3.60 ms per token,   278.09 tokens per second)
       eval time =  326033.94 ms /  3218 tokens (  101.32 ms per token,     9.87 tokens per second)
      total time =  359400.36 ms / 12497 tokens

My llama-swap config:

  "glm-4.5-air-IQ4":
    cmd: |
      ${llama-server}
      ${qwen3-args-think}
      --model /home/pc/models/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf
      -c 64000
      --jinja
      --reasoning-format deepseek
      -fa -ngl 99 -ub 2048
      # 48 layers
      --tensor_split 44,4
      --n-cpu-moe 33
      --cache-type-k q8_0
      --cache-type-v q8_0
      #~ --no-warmup
      --cache-reuse 2048

    env:
      - "CUDA_VISIBLE_DEVICES=0,2"
    filters:
      strip_params: "temperature, top_p, top_k"

If the GTX 1660 Super is Device 0, the prompt processing is eternal. But if my RTX 3090 is Device 0, the prompt processing is good.

4 replies

Galliot Aug 10, 2025
Author

yes. your're right. now I get

prompt eval time =    1197.29 ms /   532 tokens (    2.25 ms per token,   444.34 tokens per second)
       eval time =  114182.92 ms /  3249 tokens (   35.14 ms per token,    28.45 tokens per second)
      total time =  115380.21 ms /  3781 tokens

mashdragon Aug 13, 2025

Can you explain your other args? What is the tensor split, CPU MoE, and cache type arguments for? Do those improve prompt processing speed?

abc-nix Aug 13, 2025

@mashdragon, you should check the llama-server help. But I will try explaining.

--tensor_split splits the model layers between different GPUs. In the example above, the first 44 layers go to the first GPU, and last 4 layers go to the second GPU.
--n-cpu-moe offload the MOE (expert) part of the model to CPU. In the example above, the first 33 layers' expert tensors are offloaded to the CPU (the expert specific part of the final 15 layers are divided between the second GPU (4 layers) and the first GPU (11 layers)).
--cache-type-k and --cache-type-v are used to quant the KV cache (takes up less space on the VRAM, so you can offload more full layers).

The more layers and experts you have on GPU, the faster it works. By offloading all non-expert tensors to the GPU, you get a speed increase. Then, by offloading as many experts as possible to the GPU you get a further speed increase. Using smaller KV cache gives you more free VRAM for more offloading.

Something like that. Better to read the --help output and search online for better explanation. I am not very good explaining.

mashdragon Aug 13, 2025

@abc-nix Thank you for the explanation, and sorry for not being specific; the help output is informative but does not provide enough intuition (i.e., why you would choose which options under which configurations). I'm glad to hear the more experts are on the GPU, the better, since that is what I have been doing. I wonder why some guides recommend offloading specific expert layers to the CPU (like Unsloth's Kimi-K2 guide).

Certainly --n-cpu-moe seems easier than having to compute a regex to use for -ot for every test...

Prompt processing is slow if model does not fit in VRAM. Is it OK? #15211

Uh oh!

Galliot Aug 10, 2025

Replies: 2 comments · 6 replies

Uh oh!

slaren Aug 10, 2025 Maintainer

Uh oh!

Galliot Aug 10, 2025 Author

Uh oh!

YannFollet Aug 11, 2025

Uh oh!

Uh oh!

abc-nix Aug 10, 2025

Uh oh!

Uh oh!

Galliot Aug 10, 2025 Author

Uh oh!

mashdragon Aug 13, 2025

Uh oh!

abc-nix Aug 13, 2025

Uh oh!

mashdragon Aug 13, 2025

Galliot
Aug 10, 2025

Replies: 2 comments 6 replies

slaren
Aug 10, 2025
Maintainer

Galliot Aug 10, 2025
Author

abc-nix
Aug 10, 2025

Galliot Aug 10, 2025
Author