Skip to content

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 13, 2025

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Aug 9, 2025

CLI Flags Now Working

All three speculative draft model offload flags are now fully functional:

  • --override-tensor-draft: Specify tensor buffer type overrides for the draft model
  • --cpu-moe-draft: Keep all MoE weights in CPU for the draft model
  • --n-cpu-moe-draft N: Keep MoE weights of first N layers in CPU for the draft model

Example Usage

# Use different tensor overrides for draft model
./llama-speculative --override-tensor-draft "*.weight=CPU" --model-draft draft.gguf --model main.gguf

# Keep MoE weights in CPU for draft model
./llama-server --cpu-moe-draft --model-draft draft.gguf --model main.gguf

# Combined usage
./llama-speculative --override-tensor-draft "blk.0.*=CPU" --n-cpu-moe-draft 3 --model-draft draft.gguf --model main.gguf

Entrypoints Updated

All speculative decoding entrypoints properly apply draft-specific tensor overrides:

  • examples/speculative/speculative.cpp
  • examples/speculative-simple/speculative-simple.cpp
  • tools/server/server.cpp

Validation Results

  • All executables build successfully
  • CLI help shows all flags correctly
  • Flag validation works (shows available buffer types on error)
  • No assertion failures when using draft tensor overrides
  • Draft overrides are isolated from main model overrides
  • All flags can be used together without conflicts

The implementation ensures draft model tensor overrides are completely independent from main model overrides, enabling flexible heterogeneous hardware setups and advanced MoE configurations for speculative decoding workflows.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI assigned Copilot and CISC Aug 9, 2025
@CISC CISC linked an issue Aug 9, 2025 that may be closed by this pull request
4 tasks
@CISC CISC changed the title [WIP] Speculative draft model offload: Add --override-tensor-draft, --cpu-moe-draft, --n-cpu-moe-draft CLI flags common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025
@Copilot Copilot AI changed the title common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Fix speculative draft model offload CLI flags implementation Aug 9, 2025
@Copilot Copilot AI requested a review from CISC August 9, 2025 07:34
Copilot finished work on behalf of CISC August 9, 2025 07:34
@CISC CISC changed the title Fix speculative draft model offload CLI flags implementation common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025
@CISC CISC requested a review from slaren August 9, 2025 07:51
@CISC CISC marked this pull request as ready for review August 9, 2025 07:52
@CISC CISC requested a review from ngxson as a code owner August 9, 2025 07:52
@bitbottrap
Copy link

I didn't exhaustively test but works for my use case. Small tests show ~25% improvement for a few prompts and a Q4 draft model.

@CISC CISC requested a review from slaren August 11, 2025 10:03
@slaren
Copy link
Member

slaren commented Aug 11, 2025

I don't mind AI generated code, but you need to take responsibility of the review yourself.

@CISC
Copy link
Collaborator

CISC commented Aug 11, 2025

I don't mind AI generated code, but you need to take responsibility of the review yourself.

Sure, it was a test to see how well Copilot can resolve simple issues without too much interaction, went fairly well I think, I just need to be more hands on next time. :)

@ggerganov
Copy link
Member

image

the future is now 👀

@Copilot Copilot AI requested a review from ggerganov August 11, 2025 11:26
Copilot finished work on behalf of ggerganov August 11, 2025 11:26
@mfurseman
Copy link

Wouldn't it be coherent to have the short form -otd as an argument as well?

@CISC CISC merged commit d8914fc into master Aug 13, 2025
87 of 88 checks passed
@CISC CISC deleted the copilot/vscode1754723901399 branch August 13, 2025 10:44
the-phobos pushed a commit to the-phobos/llama.cpp that referenced this pull request Aug 14, 2025
…e-draft parameters (ggml-org#15191)

* Checkpoint from VS Code for coding agent session

* Initial plan

* Fix typo in --override-tensor-draft flag implementation

* Add null termination for speculative tensor buffer overrides

* Apply suggestions from code review

* Apply suggestions from code review

* Extract tensor override parsing logic to common function (addresses @slaren's feedback)

* Apply suggestions from code review

* Apply suggestions

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add separate --override-tensor control for draft models.
6 participants