Split one model's different parts on different gpus #7162

dalek-who · 2021-04-22T11:25:37Z

dalek-who
Apr 22, 2021

🚀 Feature

Motivation

Related to this torch Model Parallel tutorial:
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

In my case, I have a simplified large model like this:

class MyLargeModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # a large backbone like bert
        self.bert = Bert().to("cuda:0")

        # a very very large classifier layer with 6 million classes
        self.classifier = nn.Linear(768, 6_000_000).to("cuda:1")
    
    def forward(x):
        emb = self.bert(x.to(self.bert.device))
        score = self.classifier(emb.to(self.classifier.device))
        return score

self.classifier is so large that it must be on another gpu.

However, if I simply set pl.Trainer:

trainer = pl.Trainer(
    ...
    gpus=2,
    ...
)

It will copy the model on two gpus (and both will raise CUDA out of memory), rather than split it on two gpus.

Pitch

A easy way to manually split one model on different device like the tutorial above.

Alternatives

Additional context

vballoli · 2021-04-22T11:47:48Z

vballoli
Apr 22, 2021

PyTorch Lightning has support for FairScale, see here. You can add define the model as

self.model = nn.Sequential(Bert(), Linear(10, 20)) # __init__()
...
...
self.model(x) # forward()
...
plugin = RPCSequentialPlugin(balance=[1, 1])
trainer = Trainer()

0 replies

tchaton · 2021-04-22T12:11:50Z

tchaton
Apr 22, 2021
Maintainer

Hey @dalek-who,

I won't recommend to use RPCSequentialPlugin as it will be soon depreciated.

Instead, you can DeepSpeed Integration: https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed#deepspeed.

We managed to scale crazy large model. It can also be used on only 1 gpu with activation_checkpointing and cpu_offload.

Give it a try and give us feedback.

Best,
T.C

0 replies

dalek-who · 2021-04-22T12:27:20Z

dalek-who
Apr 22, 2021
Author

@tchaton Can you provide a simple DeepSpeedPlugin example for my case like RPCSequentialPlugin ?

0 replies

vballoli · 2021-04-22T12:53:28Z

vballoli
Apr 22, 2021

Oh, I wasn't aware of the deprecation. Sorry about that.

0 replies

SeanNaren · 2021-04-22T13:07:10Z

SeanNaren
Apr 22, 2021

Hey guys :)

Regarding the deprecation of the RPCSequentialPlugin this is being done within #6152 with more information coming in the following weeks on how you can leverage FSDP instead for simpler balancing of the model across GPUs! No need to be sorry @vballoli we'll make this clearer in the follow week!

DeepSpeed Stage 3 offers the same practice which we already have within Lightning. A minimal example of how all this can work can be found here: https://github.com/SeanNaren/minGPT/tree/stage3

Regarding a layer (in this case self.classifier) being too large to fit onto one GPU, we offer a hook called configure_sharded_model within the LightningModule, which is built just for this.

https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#shard-model-instantly-to-reduce-initialization-time-memory

We are planning on a refresh in the documentation to make it easier to find these tidbits, as things have become a bit complex in the ecosystem.

For a small example:

class MyLargeModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # a large backbone like bert
        self.bert = Bert()

       
    def configure_sharded_model(self):
        # a very very large classifier layer with 6 million classes, is now sharded instantly onto all GPUs
        # Using DeepSpeed Stage 3
        self.classifier = nn.Linear(768, 6_000_000)

    def forward(x):
        emb = self.bert(x)
        score = self.classifier(emb)
        return score

trainer = pl.Trainer(
    gpus=4,
    plugins='deepspeed_stage_3'
)
trainer.fit(model)

DeepSpeed Stage 3 shards the model across all GPUs, but configure_sharded_model allows this to happen instantly, saving memory and allowing you to instantiate really large layers. More information about Stage 3 here: https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html#shard-model-instantly-to-reduce-initialization-time-memory

1 reply

dalek-who Apr 23, 2021
Author

Layers in configure_sharded_model will shard on all gpus and layers in __init__ will still on "main" gpu? I can't find configure_sharded_model in latest document.

dalek-who · 2021-04-22T13:19:38Z

dalek-who
Apr 22, 2021
Author

@SeanNaren which torch and pytorch-lightning version should I use?

1 reply

SeanNaren Apr 22, 2021

I'd suggest using master for Lightning or the RC candidate which contains the newest Stage 3:

pip install pytorch-lightning==1.3.0rc1

If you can install the latest PyTorch (1.8.1) but this isn't necessary!

tchaton · 2021-04-22T13:20:24Z

tchaton
Apr 22, 2021
Maintainer

Dear @dalek-who,

You should you PyTorch 1.3.0rc1 and latest PyTorch.

Best,
T.C

0 replies

dalek-who · 2021-04-23T03:39:07Z

dalek-who
Apr 23, 2021
Author

@tchaton I use pl-1.3.0rc1 and torch-1.8.1. Some problems of this solution:

When using trainer.test, trainer will call fit at first (inside trainer.test), but during testing I have no training dataloader so it raise this exception.

I think this is the same bug, which is fixed but not released
#6376 (comment)

File "/home/projects/long_tail_link/link_main.py", line 479, in main
   trainer.test(model=pl_module, verbose=False)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 956, in test
   results = self.fit(model)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in fit
   self.pre_dispatch()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 512, in pre_dispatch
   self.accelerator.pre_dispatch(self)
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 105, in pre_dispatch
   self.training_type_plugin.pre_dispatch()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 234, in pre_dispatch
   self.init_deepspeed()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 239, in init_deepspeed
   self._format_config()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 395, in _format_config
   self._format_batch_size_and_grad_accum_config()
 File "/home/anaconda3/envs/conda-long-tail-link/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 407, in _format_batch_size_and_grad_accum_config
   batch_size = self.lightning_module.train_dataloader().batch_sampler.batch_size
AttributeError: 'NoneType' object has no attribute 'batch_sampler'

plugins='deepspeed_stage_3' is not an legal string here, does it equals to plugins=DeepSpeedPlugin(stage=3)?
I set Trainer = (..., gpus=3, plugins=DeepSpeedPlugin(stage=3), ...) and write my model in your format using configure_sharded_model, and it starts ddp mode and prepare data on each process. However, what I want is model parallel only (loading data to a "main" gpu and sequentially computes on each gpu), not ddp+model parallel.

1 reply

SeanNaren Apr 25, 2021

Thanks @dalek-who

Apologies I just realised the ability to call deepspeed_stage_3 has not been made into the RC candidate yet. If you'd like to use the string alias deepspeed_stage_3, you can install master for now:

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

Regarding testing without training, and passing only a test_dataloader, that is definitely a bug we need to address. Can you make an issue for this to track?

Split one model's different parts on different gpus #7162

Uh oh!

Uh oh!

dalek-who Apr 22, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Replies: 8 comments · 3 replies

Uh oh!

vballoli Apr 22, 2021

Uh oh!

tchaton Apr 22, 2021 Maintainer

Uh oh!

dalek-who Apr 22, 2021 Author

Uh oh!

vballoli Apr 22, 2021

Uh oh!

Uh oh!

SeanNaren Apr 22, 2021

Uh oh!

dalek-who Apr 23, 2021 Author

Uh oh!

dalek-who Apr 22, 2021 Author

Uh oh!

SeanNaren Apr 22, 2021

Uh oh!

tchaton Apr 22, 2021 Maintainer

Uh oh!

dalek-who Apr 23, 2021 Author

Uh oh!

SeanNaren Apr 25, 2021

dalek-who
Apr 22, 2021

Replies: 8 comments 3 replies

vballoli
Apr 22, 2021

tchaton
Apr 22, 2021
Maintainer

dalek-who
Apr 22, 2021
Author

vballoli
Apr 22, 2021

SeanNaren
Apr 22, 2021

dalek-who Apr 23, 2021
Author

dalek-who
Apr 22, 2021
Author

tchaton
Apr 22, 2021
Maintainer

dalek-who
Apr 23, 2021
Author