Training a model with parallel linear layer #7870

Honzys · 2021-06-07T20:31:17Z

Honzys
Jun 7, 2021

Hello,
first of all thanks for your great work on this package!

I stumbled upon a problem and I am wondering if someone can point me to the right direction or share their knowledge.

I am trying to train a model with a huge number of classes. The final linear layer can barely fit on one gpu. That's why I want to split my final linear amongst my gpus (let's say I have 8 gpus available and the number of classes is 800K).

Than on each gpu I have:

Backbone that performs feature extraction and I DO want the weights to be exactly the same on each gpu
Linear layer with 100k output features but those weights I DON'T want to share - I want each linear layer to keep their own weights corresponding to the classes.

Is there a simple way how to achieve it?
I've found out that I have to somehow prevent the "TrainingType" class not to wrap the linear layer with either DDP or prevent horovod optimizer to wrap the linear layer parameters.

Have anyone of you tried similiar approach? I would love to hear any information regarding this issue.

Thanks in advance! :)

awaelchli · 2021-06-10T10:16:53Z

awaelchli
Jun 10, 2021

Hey!

I'm not exactly sure, but there may be a way with fully sharded training, which enables a sort of model parallelism.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training a model with parallel linear layer #7870

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training a model with parallel linear layer #7870

Uh oh!

Honzys Jun 7, 2021

Replies: 1 comment

Uh oh!

awaelchli Jun 10, 2021

Honzys
Jun 7, 2021

awaelchli
Jun 10, 2021