-
Notifications
You must be signed in to change notification settings - Fork 154
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
With the original llama.cpp, I can specify which device(s) the model will be running on, using -dev
/--device
and -devd
/--device-draft
.
This way, using speculative decoding, I can run the target model on CUDA0
and the draft model on CUDA1
, making use of 2 GPUs. Like this:
C:\apps\llama.cpp\llama-server.exe --port 15900 `
--model "E:\AI\LLM\gguf\unsloth\Qwen3-235B-A22B-Instruct-2507-GGUF\Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL-00001-of-00003.gguf" `
--ctx-size 16384 `
-ctk q8_0 -ctv q8_0 `
-fa on `
--n-gpu-layers 999 `
-mg 0 `
-dev CUDA0 `
-ot "blk\.(0?[0-9]|1[0-6])\.ffn_.*_exps.=CUDA0" `
-ot ".ffn_.*_exps.=CPU" `
--parallel 1 `
-tb 30 -t 15 `
--jinja --reasoning-budget 0 --no-mmap `
--model-draft "E:\AI\LLM\gguf\unsloth\Qwen3-4B-Instruct-2507-GGUF\Qwen3-4B-Instruct-2507-UD-Q6_K_XL.gguf" `
-devd CUDA1 `
-ngld 999 `
--ctx-size-draft 16384 `
-ctkd q8_0 -ctvd q8_0
If anything, my GPUs are different: RTX 5090 (32GB) and RTX 4070 Ti Super (16GB).
In ik_llama.cpp, as far as I understand, there's no way to specify which GPUs to use, except this environment variable CUDA_VISIBLE_DEVICES
. But it doesn't let you use different cards for the target and the draft model.
Motivation
Using different GPUs for the target and the draft model when using speculative decoding, making it more effective.
Possible Implementation
No response
PaHuMbIu and kkontosis
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request