-
Notifications
You must be signed in to change notification settings - Fork 155
A slightly better version of offloading only activated experts #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I will have to test. I've been dicking with distributed wan. Will test by EOD and see what happened post all of these commits. Currently I am built to: 46968d4 so I will use it as a baseline. |
Wow, that's a pretty significant difference. I gain just 5% for Qwen3-30B-A3B. I'm going to merge it. I ran many times while testing up to a context of 65k tokens for 3 different models and didn't observe the system hanging. But if it turns out to be real, it is behind a command line argument, so one does not have to use it. |
I only experienced a similar bug with NCCL tensor parallel. A GPU gets out of sync and then stuck at "100%". Will give it some regular use and see if it reoccurs. |
Is the flag avaible for llama-server ? I've compile the last version and they are not provided, or i didn't understand "it is behind a command line argument, so one does not have to use it." |
Yes it is available. Just add -ooae to your other server command line arguments. |
See #698
@Ph0rk0z
Does this fix the problem with your multi-GPU setup?
You need to add
--ooae
to your command line to activate the Ofloat Only Activated Eexperts feature.