Training a model with parallel linear layer #7870
Unanswered
Honzys
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Hey! I'm not exactly sure, but there may be a way with fully sharded training, which enables a sort of model parallelism. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
first of all thanks for your great work on this package!
I stumbled upon a problem and I am wondering if someone can point me to the right direction or share their knowledge.
I am trying to train a model with a huge number of classes. The final linear layer can barely fit on one gpu. That's why I want to split my final linear amongst my gpus (let's say I have 8 gpus available and the number of classes is 800K).
Than on each gpu I have:
Is there a simple way how to achieve it?
I've found out that I have to somehow prevent the "TrainingType" class not to wrap the linear layer with either DDP or prevent horovod optimizer to wrap the linear layer parameters.
Have anyone of you tried similiar approach? I would love to hear any information regarding this issue.
Thanks in advance! :)
Beta Was this translation helpful? Give feedback.
All reactions