Skip to content

Conversation

g2mt
Copy link
Contributor

@g2mt g2mt commented Aug 16, 2025

See ggml-org/llama.cpp#12635 for equivalent PR in mainline. Related to #645

@g2mt g2mt marked this pull request as ready for review August 16, 2025 18:19
@ikawrakow
Copy link
Owner

It would be nice to have an example telling us how to test and use (which models to use, command line, etc.). For people like me who don't have the hardware to run giant models, it would be nice if there was an examples with main_draft models that can be loaded in 64 GB RAM + 16 GB VRAM.

@g2mt
Copy link
Contributor Author

g2mt commented Aug 18, 2025

It should work the same as passing a compatible draft model. I also added the --spec-replace argument for translating the template tags.

Here are my prompt/generation speeds for a simple repeat prompt. I'm using Devstral Small with Qwen2 coder 0.5 as the draft model.

without speculative decoding:

"ik_llama.cpp/build/bin/Release/llama-server.exe"  "-m" "Devstral-Small-2507-UD-Q4_K_XL.gguf" "-c" "32768" "--run-time-repack" "-t" "12"
before-spec

with it enabled:

"ik_llama.cpp/build/bin/Release/llama-server.exe"  "--port" "10006" "-m" "Devstral-Small-2507-UD-Q4_K_XL.gguf" "-c" "32768" "--run-time-repack" "-t" "12" "-md" "Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" "--spec-replace" "[INST]" "<|im_begin|>user\n" "--spec-replace" "[/INST]" "<|im_end|><|im_begin|>assistant" "--spec-replace" "</s>" "<|im_begin|>user\n"  
after-spec

@ikawrakow ikawrakow merged commit 23fe18c into ikawrakow:main Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants