Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c)

### What happened?

When using speculative decoding in llama-server, when specifying different context sizes for the target model (-c) and the draft model (-cd), with the draft context being smaller than the target (e.g. -cd < -c). At the start of generation, performance is excellent, high throughput and low GPU power consumption. But once the draft fills its context window, token generation slows drastically (lower than half) and GPU power draw increases.

This bug is also present in the mainline llama.cpp. I know there's some kind of sliding window mechanism going on in this situation and clearly there's too much overhead when trying to pass only the latest {cd} tokens to the draft model. I tried to fix the code myself but wasn't successful.

I think fixing this would greatly improve the feasability of using SD on llama.cpp in general, because the small model doesn't really need a big ctx window to come up with high quality draft tokens and running the draft model with same context size as the target model is a complete waste of precious VRAM. 

in my testing on qwen3 30B coder on a partially offloaded setting, running the qwen3 0.6B q4KM draft model with -ctkd q8_0 -ctvd q8_0 -cd 1024 has pretty much the same acceptance rate and speedup as running with no -cd and it only uses around 1.5GB extra vram (compared to no draft model) for 1.2-1.8x speedup in TG. I'd imagine the gains to be even more pronounced with bigger models... 

### Name and Version

./llama-server --version
version: 3860 (0cc32ff0)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c) #732

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c) #732

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions