Port universal assisted decoding to llama-server #699

g2mt · 2025-08-16T17:59:22Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

See ggml-org/llama.cpp#12635 for equivalent PR in mainline. Related to #645

ikawrakow · 2025-08-17T04:56:08Z

It would be nice to have an example telling us how to test and use (which models to use, command line, etc.). For people like me who don't have the hardware to run giant models, it would be nice if there was an examples with main_draft models that can be loaded in 64 GB RAM + 16 GB VRAM.

g2mt · 2025-08-18T04:16:35Z

It should work the same as passing a compatible draft model. I also added the --spec-replace argument for translating the template tags.

Here are my prompt/generation speeds for a simple repeat prompt. I'm using Devstral Small with Qwen2 coder 0.5 as the draft model.

without speculative decoding:

"ik_llama.cpp/build/bin/Release/llama-server.exe"  "-m" "Devstral-Small-2507-UD-Q4_K_XL.gguf" "-c" "32768" "--run-time-repack" "-t" "12"

with it enabled:

"ik_llama.cpp/build/bin/Release/llama-server.exe"  "--port" "10006" "-m" "Devstral-Small-2507-UD-Q4_K_XL.gguf" "-c" "32768" "--run-time-repack" "-t" "12" "-md" "Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" "--spec-replace" "[INST]" "<|im_begin|>user\n" "--spec-replace" "[/INST]" "<|im_end|><|im_begin|>assistant" "--spec-replace" "</s>" "<|im_begin|>user\n"

common/common.cpp

common/speculative.cpp

g2mt added 3 commits August 16, 2025 17:53

port universal assisted decoding to server

091f59c

fix calls

af75d64

fix LOG_INFO

0b00eea

g2mt marked this pull request as ready for review August 16, 2025 18:19

fix llama_detokenize call

d6374fb

ikawrakow approved these changes Aug 18, 2025

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

common/speculative.cpp Show resolved Hide resolved

use emplace_back

f03bcb4

ikawrakow merged commit 23fe18c into ikawrakow:main Aug 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port universal assisted decoding to llama-server #699

Port universal assisted decoding to llama-server #699

Uh oh!

g2mt commented Aug 16, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Aug 17, 2025

Uh oh!

g2mt commented Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Port universal assisted decoding to llama-server #699

Port universal assisted decoding to llama-server #699

Uh oh!

Conversation

g2mt commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Aug 17, 2025

Uh oh!

g2mt commented Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

g2mt commented Aug 16, 2025 •

edited

Loading