Skip to content

common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Aug 9, 2025

CLI Flags Now Working

All three speculative draft model offload flags are now fully functional:

  • --override-tensor-draft: Specify tensor buffer type overrides for the draft model
  • --cpu-moe-draft: Keep all MoE weights in CPU for the draft model
  • --n-cpu-moe-draft N: Keep MoE weights of first N layers in CPU for the draft model

Example Usage

# Use different tensor overrides for draft model
./llama-speculative --override-tensor-draft "*.weight=CPU" --model-draft draft.gguf --model main.gguf

# Keep MoE weights in CPU for draft model
./llama-server --cpu-moe-draft --model-draft draft.gguf --model main.gguf

# Combined usage
./llama-speculative --override-tensor-draft "blk.0.*=CPU" --n-cpu-moe-draft 3 --model-draft draft.gguf --model main.gguf

Entrypoints Updated

All speculative decoding entrypoints properly apply draft-specific tensor overrides:

  • examples/speculative/speculative.cpp
  • examples/speculative-simple/speculative-simple.cpp
  • tools/server/server.cpp

Validation Results

  • All executables build successfully
  • CLI help shows all flags correctly
  • Flag validation works (shows available buffer types on error)
  • No assertion failures when using draft tensor overrides
  • Draft overrides are isolated from main model overrides
  • All flags can be used together without conflicts

The implementation ensures draft model tensor overrides are completely independent from main model overrides, enabling flexible heterogeneous hardware setups and advanced MoE configurations for speculative decoding workflows.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI assigned Copilot and CISC Aug 9, 2025
@CISC CISC linked an issue Aug 9, 2025 that may be closed by this pull request
4 tasks
@CISC CISC changed the title [WIP] Speculative draft model offload: Add --override-tensor-draft, --cpu-moe-draft, --n-cpu-moe-draft CLI flags common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025
@Copilot Copilot AI changed the title common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Fix speculative draft model offload CLI flags implementation Aug 9, 2025
@Copilot Copilot AI requested a review from CISC August 9, 2025 07:34
Copilot finished work on behalf of CISC August 9, 2025 07:34
@CISC CISC changed the title Fix speculative draft model offload CLI flags implementation common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters Aug 9, 2025
@CISC CISC requested a review from slaren August 9, 2025 07:51
@CISC CISC marked this pull request as ready for review August 9, 2025 07:52
@CISC CISC requested a review from ngxson as a code owner August 9, 2025 07:52
@bitbottrap
Copy link

I didn't exhaustively test but works for my use case. Small tests show ~25% improvement for a few prompts and a Q4 draft model.

@CISC CISC requested a review from slaren August 11, 2025 10:03
@slaren
Copy link
Member

slaren commented Aug 11, 2025

I don't mind AI generated code, but you need to take responsibility of the review yourself.

@CISC
Copy link
Collaborator

CISC commented Aug 11, 2025

I don't mind AI generated code, but you need to take responsibility of the review yourself.

Sure, it was a test to see how well Copilot can resolve simple issues without too much interaction, went fairly well I think, I just need to be more hands on next time. :)

@ggerganov
Copy link
Member

image

the future is now 👀

@Copilot Copilot AI requested a review from ggerganov August 11, 2025 11:26
Copilot finished work on behalf of ggerganov August 11, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add separate --override-tensor control for draft models.
5 participants