Skip to content

Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c) #732

@alint77

Description

@alint77

What happened?

When using speculative decoding in llama-server, when specifying different context sizes for the target model (-c) and the draft model (-cd), with the draft context being smaller than the target (e.g. -cd < -c). At the start of generation, performance is excellent, high throughput and low GPU power consumption. But once the draft fills its context window, token generation slows drastically (lower than half) and GPU power draw increases.

This bug is also present in the mainline llama.cpp. I know there's some kind of sliding window mechanism going on in this situation and clearly there's too much overhead when trying to pass only the latest {cd} tokens to the draft model. I tried to fix the code myself but wasn't successful.

I think fixing this would greatly improve the feasability of using SD on llama.cpp in general, because the small model doesn't really need a big ctx window to come up with high quality draft tokens and running the draft model with same context size as the target model is a complete waste of precious VRAM.

in my testing on qwen3 30B coder on a partially offloaded setting, running the qwen3 0.6B q4KM draft model with -ctkd q8_0 -ctvd q8_0 -cd 1024 has pretty much the same acceptance rate and speedup as running with no -cd and it only uses around 1.5GB extra vram (compared to no draft model) for 1.2-1.8x speedup in TG. I'd imagine the gains to be even more pronounced with bigger models...

Name and Version

./llama-server --version
version: 3860 (0cc32ff)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions