-
Notifications
You must be signed in to change notification settings - Fork 12.7k
common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: CISC <[email protected]>
Co-authored-by: CISC <[email protected]>
I didn't exhaustively test but works for my use case. Small tests show ~25% improvement for a few prompts and a Q4 draft model. |
I don't mind AI generated code, but you need to take responsibility of the review yourself. |
Sure, it was a test to see how well Copilot can resolve simple issues without too much interaction, went fairly well I think, I just need to be more hands on next time. :) |
…slaren's feedback) Co-authored-by: ggerganov <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
CLI Flags Now Working
All three speculative draft model offload flags are now fully functional:
--override-tensor-draft
: Specify tensor buffer type overrides for the draft model--cpu-moe-draft
: Keep all MoE weights in CPU for the draft model--n-cpu-moe-draft N
: Keep MoE weights of first N layers in CPU for the draft modelExample Usage
Entrypoints Updated
All speculative decoding entrypoints properly apply draft-specific tensor overrides:
examples/speculative/speculative.cpp
examples/speculative-simple/speculative-simple.cpp
tools/server/server.cpp
Validation Results
The implementation ensures draft model tensor overrides are completely independent from main model overrides, enabling flexible heterogeneous hardware setups and advanced MoE configurations for speculative decoding workflows.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.