Skip to content

Prompt processing is slow if model does not fit in VRAM. Is it OK? #15211

Answered by abc-nix
Galliot asked this question in Q&A
Discussion options

You must be logged in to vote

What does --main-gpu 1 do for you? Why does the RTX 4090 display as Device 0 and not the RTX 5090? Could you place the RTX 5090 as device 0 using the CUDA_VISIBLE_DEVICES env variable to change the order so that the RTX 5090 goes first?

I have less powerful hardware than you, and can run GLM-4.5 Air IQ4 with a 3090 (PCIe 16x) + 1660 Super (PCIe 1x) with PP of 278 t/s for 9279 tokens:

prompt eval time =   33366.42 ms /  9279 tokens (    3.60 ms per token,   278.09 tokens per second)
       eval time =  326033.94 ms /  3218 tokens (  101.32 ms per token,     9.87 tokens per second)
      total time =  359400.36 ms / 12497 tokens

My llama-swap config:

  "glm-4.5-air-IQ4":
    cmd: |
      …

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
2 replies
@Galliot
Comment options

@YannFollet
Comment options

Comment options

You must be logged in to vote
4 replies
@Galliot
Comment options

@mashdragon
Comment options

@abc-nix
Comment options

@mashdragon
Comment options

Answer selected by Galliot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants