Turn off ddp_sharded during evaluation #8534

zhu-y11 · 2021-07-23T09:12:22Z

zhu-y11
Jul 23, 2021

Hi there,

I am using ddp_sharded with fairscale, and it works fine during training with the lightning Trainer. But I found that in evaluation for dev/test set, ddp_sharded is still turned on, i.e. the dataset is split into shards and evaluated separately, which is difficult to calculate evaluation metric (e.g. acc), or using early stopping. So is there anyway I can use ddp_sharded during training, but turn it off for evaluation only on a single GPU?

Here is code snippet of my trainer, and the model is a simple pytorch classifier using huggingface transformers.

class CLSTrainerINTCLS(pl.LightningModule):
def init(self, args, model):
super().init()
self.args = args
self.model = model

def training_step(self, batch, batch_idx):
    output_dicts = self.model(**batch)
    return output_dicts["loss"]

def validation_step(self, batch, batch_idx):
    return self.eval_step(batch, batch_idx, split="dev")

def test_step(self, batch, batch_idx):
    return self.eval_step(batch, batch_idx, split="test")

def eval_step(self, batch, batch_idx, split=None):
    output_dicts = self.model(**batch)
    preds = torch.argmax(output_dicts["logits"], dim=-1)
    return {
        f"{split}_loss": output_dicts["loss"].detach().cpu(),
        f"{split}_gold": batch["labels"].tolist(),
        f"{split}_pred": preds.tolist(),
    }

def validation_epoch_end(self, val_step_outputs):
    return self.eval_epoch_end(val_step_outputs, split="dev")

def test_epoch_end(self, test_step_outputs):
    return self.eval_epoch_end(test_step_outputs, split="test")

def eval_epoch_end(self, eval_step_outputs, split=None):
    loss = torch.mean(
        torch.stack([t[f"{split}_loss"] for t in eval_step_outputs])
    ).detach()
    golds = list(itertools.chain(*[t[f"{split}_gold"] for t in eval_step_outputs]))
    preds = list(itertools.chain(*[t[f"{split}_pred"] for t in eval_step_outputs]))
    acc = accuracy_score(golds, preds) * 100
    self.log(f"{split}_acc", acc, prog_bar=False)




es = EarlyStopping(
        monitor="dev_acc",
        patience=args.early_stop_patience,
        strict=False,
        verbose=True,
        mode="max",
    )

trainer = pl.Trainer(
        precision=16,
        gpus=torch.cuda.device_count(),
        accelerator="ddp",
        check_val_every_n_epoch=1,
        accumulate_grad_batches=args.accumulate_grad_batches,
        max_epochs=args.trn_epochs,
        callbacks=[es],
        benchmark=True,
        deterministic=True,
        num_sanity_val_steps=0,
        plugins='ddp_sharded',
    )

pytorch: v1.8.1
pytorch lightning: v1.3.8
fairscale: v0.3.8
transformers: v4.6.1

tchaton · 2021-07-26T09:42:03Z

tchaton
Jul 26, 2021
Maintainer

Dear @zhu-y11,

Using ddp_sharded doesn't have an impact on your data being splitted directly. Actually, this is the result of Lightning injecting a DistributedDataSampler within your DataLoader.

Here are one option to resolve your problems:

Rely on Lightning to perform metrics reduction as yours is wrongly implemented there.
This is done with local losses

 loss = torch.mean(
        torch.stack([t[f"{split}_loss"] for t in eval_step_outputs])
    ).detach()

If you do self.log(..., on_epoch=True), Lightning will take care to properly reduce this across processes and entire epoch and pass it the EarlyStoppingCallback to decide if the training should stop.

Best,
T.C

2 replies

zhu-y11 Jul 27, 2021
Author

Thanks a log Thomas. :) That's really helpful! So here I want to do earlystopping with dev_acc, and after a bit research, here is the minimum code, and I am not sure if I did correctly:

class CLSTrainerINTCLS(pl.LightningModule):
    def __init__(self, args, model):
        super().__init__()
        ......
        self.eval_metric = torchmetrics.Accuracy(compute_on_step=True, dist_sync_on_step=True)

    def validation_step(self, batch, batch_idx):
        # get predicted and gold labels for validation batch
        ...... 
        batch_acc = self.eval_metric(pred_labels, gold_labels)
        self.log(
            "dev_acc", batch_acc, prog_bar=False, on_epoch=True, on_step=False
        )

    def validation_epoch_end(self, eval_step_outputs):
        pass

So I need to turn dist_sync_on_step=True to sync across processes for each validation steps, and when I set on_epoch=True, I actually do not need to do anything for validation_epoch_end?

zhu-y11 Jul 27, 2021
Author

Actually I found that if I reduce with on_epoch=True, it seems just a simple average of ACCs for each batch without considering potentially different instance numbers in each batch, so I need to do still compute the ACC for the whole evaluation dataset in validation_epoch_end:

class CLSTrainerINTCLS(pl.LightningModule):
    def __init__(self, args, model):
        super().__init__()
        ......
        self.eval_metric = torchmetrics.Accuracy(compute_on_step=False, dist_sync_on_step=True)

    def validation_step(self, batch, batch_idx):
        # get predicted and gold labels for validation batch
        ...... 
        self.eval_metric(pred_labels, gold_labels)

    def validation_epoch_end(self, eval_step_outputs):
        acc = self.eval_metric.compute()
                self.log(
            "dev_acc", acc, prog_bar=False, on_epoch=True, on_step=False
        )
        self.eval_metric.reset()

By doing this, dev_acc will be computed at each rank (although the value would be exactly the same) in validation_epoch_end. I tried to calculate only when self.global_rank == 0, but the code hanged there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turn off ddp_sharded during evaluation #8534

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Turn off ddp_sharded during evaluation #8534

Uh oh!

Uh oh!

zhu-y11 Jul 23, 2021

Replies: 1 comment · 2 replies

Uh oh!

tchaton Jul 26, 2021 Maintainer

Uh oh!

Uh oh!

zhu-y11 Jul 27, 2021 Author

Uh oh!

Uh oh!

zhu-y11 Jul 27, 2021 Author

zhu-y11
Jul 23, 2021

Replies: 1 comment 2 replies

tchaton
Jul 26, 2021
Maintainer

zhu-y11 Jul 27, 2021
Author

zhu-y11 Jul 27, 2021
Author