-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I'm using --override-tensor to selectively load parts of Qwen3 480B onto a GPU and load the rest into system memory. Attempting to use Qwen3 30B as a draft model to improve performance as it can run entirely in GPU RAM. However draft model performance is significantly degraded as it shares the same tensor-override setting as the main model and the MOEs are executed on CPU.
The enhancement would provide a new command line flag --override-tensor-draft to specify different offload parameters for the draft model. In addition, when providing this flag, the default behavior (without specifying --override-tensor-draft) should be to offload all layers/tensors to the GPU (if GPU offloading with ngld is specified) to match main model behavior.
Motivation
Performance improvement with draft models using MOE architecture.
Possible Implementation
No response