GPU memory consumption fluctuates rapidly with FSDP training #13594

manideep2510 · 2022-07-11T06:50:47Z

manideep2510
Jul 11, 2022

Hi, I'm trying to use FSDP training with Pytorch Lightning for a classification training task which has 1 Million classes in the classification layer, and the layer before that has 512 nodes. Without FSDP, I could not fit the model into 4 GPUs as expected, because there are around 512 Million parameters in the last layer. With FSDP, after model sharding of the last layer, I was able to fit the model and train it as well on these 4 GPUs.

But, with FSDP, the GPU memory consumption fluctuates rapidly between 6.5GB and 10GB. Because of this, I am not able to increase the batch size. Is this behavior expected, or is this a bug in Fairscale/Lightning, or am I using FSDP wrong? Any help will be appreciated.

System & Environment
4 x RTX 2080Ti, CUDA 11.6
12 Core Intel Xeon
RAM 64GB installed
Torch 1.12.0
Pytorch-lightning 1.6.4
fairscale 0.4.6
Python 3.9.13

Memory Consumption

Screen.Recording.2022-07-11.at.12.05.49.PM.mov

The part of Lightning Module where the Linear layer is wrapped with FSDP's wrap(),

class Model(pl.LightningModule):
    def __init__(self, embedding_size=512, num_classes=1000000, config, mode='classify'):
        super(Model, self).__init__()
        self.mode = mode  # 'classify' or 'encode' or 'both'
        self.num_classes = num_classes
        self.embedding_size = embedding_size
        
        self.backbone = SEBackBone(embedding_size=embedding_size)
        self.classifier = nn.Linear(embedding_size, num_classes, bias=False)

        self.arcloss = ArcLoss()
        self.criterion = nn.CrossEntropyLoss()
        self.metric = torchmetrics.Accuracy()

        self.config = config
        self.lr = config.lr
        self.momentum = config.momentum
        self.weight_decay = config.weight_decay 
    def configure_sharded_model(self):
        # Wrap the last layer with 1 Million nodes for sharding
        self.classifier = wrap(self.classifier)
    
    def forward(self, x, mode=None):
        features = self.backbone(x)
        output = self.classifier(features)
        return output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU memory consumption fluctuates rapidly with FSDP training #13594

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GPU memory consumption fluctuates rapidly with FSDP training #13594

Uh oh!

manideep2510 Jul 11, 2022

Memory Consumption

Replies: 0 comments

manideep2510
Jul 11, 2022