Replies: 4 comments 14 replies
-
I personally do advise using ad-hoc experimentation like simple If you'd like to document some best practices for multi-GPU offloading strategies for multiple LLMs, that would be welcome contribution! However keep in mind, things change fast so honestly spend some time looking through recently closed and newly opened PRs as some quants are getting a big boost for PP etc.
Offload early offload often! No really, if you can offload the whole thing that is great. If not put as much as possible on your fastest GPUs first. Try to keep kv-cache near the attn tensors probably or all on a single e.g. Usually the dense ffn layers get offloaded to CPU first as they are just larger. Hopefully your quant has optimized those for CPU/RAM usage e.g. I don't recommend separating
This works pretty well a lot of time!
See above, put more layers on the fast GPUs first.
I'd suggest some combination of this depending on which model you're running:
Use the
I believe there is some device ordering environment variables you can use to swap those around, I thought I saw some chatter near the linked discussion above. e.g. Cheers! |
Beta Was this translation helpful? Give feedback.
-
It's a good idea to view all the layers and their file sizes on the model. That way you know what you can fit onto your GPUs. Not all blocks have the same sized layers. I have used llama.cpp mainline to print the sizes and adjusted accordingly. The more you cram, the more t/s you generally get. Smaller AMB can get you smaller buffers and perhaps fit another layer. Benchmarking is your friend. Once you cache the model in sysram, its easy to re-load and try things. There is some variance in llama-sweep-bench, but you can still spot larger trends. After a lot of testing, it might be wise to dump your cache and re-test your best ones. |
Beta Was this translation helpful? Give feedback.
-
@ubergarm and @ikawrakow I have recently obtained 2XBlackwell Pro 6000 GPUs. so I have 192 Gigs of VRAM. I am able to offload your Qwen3-235B model completely on the model and i get over 1000 prompt processing and 50 tk/sec generation speed. But for the Deepseek model, i cant get beyond the 12-13tk/second. Would you have any advice for me? Below is the video where i compare all the different options, I have chapters in the video so that you can see the right sections. But any help will be really appreciated. I am starting to give up on deepseek. if two blackwells aren't enough then what is :/ https://www.youtube.com/watch?v=cFddXR1nPLg 8:01 - Test 1: Qwen 3 235B (Fully GPU Offloaded) |
Beta Was this translation helpful? Give feedback.
-
Hi Mukul. Thanks for your helpful videos. Hardware: Command line options:
I used Ubergarm's high quality DeepSeek V3 R4 quantized model before ik llama.cpp had support for that quantization type on GPU with a 4090 and all experts, except for the shared one on CPU with about 12 output tps <10000 context. I will try again with the latest model from Ubergarm later. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@ikawrakow or @ubergarm
I've recently expanded my GPU rig to include (2x RTX 5090 + 2x RTX 4090) and am seeking your expertise to develop a systematic approach for offloading layers across these GPUs.
While I have experience with hardware configurations, I'd like to avoid ad-hoc experimentation and instead follow best practices or documented methodologies specific to ik_llama.cpp's architecture. Could you please share recommendations regarding:
Which types of layers (e.g., attention, feed-forward) benefit most from GPU acceleration? How do i know which layer I need to offload? Currently I have been randomly offloading whatever i can.
Optimal strategies for distributing work across heterogeneous GPUs (5090 vs 4090)?
Are there built-in features/flags in ik_llama.cpp to control layer distribution?
I'm particularly interested in any rationale behind layer offloading decisions in GPU-accelerated LLMs.
this is one of the commands that I used:
For some reason my nvidia-smi shows GPU 0 and 3 as NVIDIA 5090, but in reality CUDA_VISIBLE_DEVICES sees GPUs 2 and 3 as NVIDIA 5090. So I have arranged it such that the first and last -ot parameter is for NVIDIA 5090 and in between two -ot parameters are for NVIDIA 40900.
for Qwen3-235B
and for a DeepSeek-R1
Beta Was this translation helpful? Give feedback.
All reactions