common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

Copilot · 2025-08-09T07:18:28Z

CLI Flags Now Working

All three speculative draft model offload flags are now fully functional:

--override-tensor-draft: Specify tensor buffer type overrides for the draft model
--cpu-moe-draft: Keep all MoE weights in CPU for the draft model
--n-cpu-moe-draft N: Keep MoE weights of first N layers in CPU for the draft model

Example Usage

# Use different tensor overrides for draft model
./llama-speculative --override-tensor-draft "*.weight=CPU" --model-draft draft.gguf --model main.gguf

# Keep MoE weights in CPU for draft model
./llama-server --cpu-moe-draft --model-draft draft.gguf --model main.gguf

# Combined usage
./llama-speculative --override-tensor-draft "blk.0.*=CPU" --n-cpu-moe-draft 3 --model-draft draft.gguf --model main.gguf

Entrypoints Updated

All speculative decoding entrypoints properly apply draft-specific tensor overrides:

examples/speculative/speculative.cpp
examples/speculative-simple/speculative-simple.cpp
tools/server/server.cpp

Validation Results

All executables build successfully
CLI help shows all flags correctly
Flag validation works (shows available buffer types on error)
No assertion failures when using draft tensor overrides
Draft overrides are isolated from main model overrides
All flags can be used together without conflicts

The implementation ensures draft model tensor overrides are completely independent from main model overrides, enabling flexible heterogeneous hardware setups and advanced MoE configurations for speculative decoding workflows.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: CISC <[email protected]>

common/arg.cpp

bitbottrap · 2025-08-09T23:02:39Z

I didn't exhaustively test but works for my use case. Small tests show ~25% improvement for a few prompts and a Q4 draft model.

tools/server/server.cpp

examples/speculative-simple/speculative-simple.cpp

examples/speculative/speculative.cpp

tools/server/server.cpp

common/arg.cpp

slaren · 2025-08-11T10:59:39Z

I don't mind AI generated code, but you need to take responsibility of the review yourself.

CISC · 2025-08-11T11:09:52Z

I don't mind AI generated code, but you need to take responsibility of the review yourself.

Sure, it was a test to see how well Copilot can resolve simple issues without too much interaction, went fairly well I think, I just need to be more hands on next time. :)

ggerganov · 2025-08-11T11:13:23Z

the future is now 👀

…slaren

…slaren's feedback) Co-authored-by: ggerganov <[email protected]>

common/arg.cpp

Co-authored-by: Diego Devesa <[email protected]>

CISC and others added 2 commits August 9, 2025 07:18

Checkpoint from VS Code for coding agent session

b9675d9

Initial plan

f98ee9e

Copilot AI assigned Copilot and CISC Aug 9, 2025

Copilot started work on behalf of CISC August 9, 2025 07:18 View session

github-actions bot added examples server labels Aug 9, 2025

CISC linked an issue Aug 9, 2025 that may be closed by this pull request

Feature Request: Add separate --override-tensor control for draft models. #15185

Open

4 tasks

CISC changed the title ~~[WIP] Speculative draft model offload: Add --override-tensor-draft, --cpu-moe-draft, --n-cpu-moe-draft CLI flags~~ common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025

Copilot AI and others added 2 commits August 9, 2025 07:30

Fix typo in --override-tensor-draft flag implementation

06f7547

Co-authored-by: CISC <[email protected]>

Add null termination for speculative tensor buffer overrides

e41f1de

Co-authored-by: CISC <[email protected]>

Copilot AI changed the title ~~common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters~~ Fix speculative draft model offload CLI flags implementation Aug 9, 2025

Copilot AI requested a review from CISC August 9, 2025 07:34

Copilot finished work on behalf of CISC August 9, 2025 07:34

CISC changed the title ~~Fix speculative draft model offload CLI flags implementation~~ common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025

CISC approved these changes Aug 9, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

common/arg.cpp Outdated Show resolved Hide resolved

common/arg.cpp Outdated Show resolved Hide resolved

CISC requested a review from slaren August 9, 2025 07:51

Apply suggestions from code review

0d38374

CISC marked this pull request as ready for review August 9, 2025 07:52

CISC requested a review from ngxson as a code owner August 9, 2025 07:52

slaren reviewed Aug 10, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

CISC reviewed Aug 10, 2025

View reviewed changes

examples/speculative-simple/speculative-simple.cpp Outdated Show resolved Hide resolved

examples/speculative/speculative.cpp Outdated Show resolved Hide resolved

tools/server/server.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

c87f4b0

CISC requested a review from slaren August 11, 2025 10:03

slaren reviewed Aug 11, 2025

View reviewed changes

common/arg.cpp Show resolved Hide resolved

Copilot started work on behalf of ggerganov August 11, 2025 11:12 View session

Extract tensor override parsing logic to common function (addresses @…

072ab9c

…slaren's feedback) Co-authored-by: ggerganov <[email protected]>

Copilot AI requested a review from ggerganov August 11, 2025 11:26

Copilot finished work on behalf of ggerganov August 11, 2025 11:26

slaren reviewed Aug 11, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

7906f01

Co-authored-by: Diego Devesa <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

Copilot AI commented Aug 9, 2025 •

edited by CISC

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bitbottrap commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren commented Aug 11, 2025

Uh oh!

CISC commented Aug 11, 2025

Uh oh!

ggerganov commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

Are you sure you want to change the base?

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

Conversation

Copilot AI commented Aug 9, 2025 • edited by CISC Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CLI Flags Now Working

Example Usage

Entrypoints Updated

Validation Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bitbottrap commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren commented Aug 11, 2025

Uh oh!

CISC commented Aug 11, 2025

Uh oh!

ggerganov commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Aug 9, 2025 •

edited by CISC

Loading