A question about metrics on ddp #5781

haichao592 · 2020-06-28T12:02:22Z

haichao592
Jun 28, 2020

Thanks for the nice work! 🥳🥳🥳

Backgroud

The new metrics package is released with build-in ddp support.
DistributedSampler is used by dataloaders to work with ddp.
It adds extra samples to make the samples evenly divisible.
This means during evaluation and testing, the added extra samples affect the eval result.

Question

Does the metrics package remove the extra redundant samples to make sure the eval result is not affected?

I have looked into the code and find the answer is NO.
Did I miss something or it's just not been resolved for now?

justusschock · 2020-07-02T06:50:27Z

justusschock
Jul 2, 2020
Maintainer

Hi @haichao592 , AFAIK the DistributedSampler simply does not behave like this. It only divides the dataset into disjunct subsets and does not add any extra samples. This is also the reason, we haven't implemented this.

0 replies

zerogerc · 2020-07-02T07:41:45Z

zerogerc
Jul 2, 2020

@justusschock, the DistributedSampler actually adds samples from the start of the dataset to make the number of examples evenly divisible:

        # add extra samples to make it evenly divisible
        indices += indices[:(self.total_size - len(indices))]

Reference: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L79

0 replies

justusschock · 2020-07-02T07:53:04Z

justusschock
Jul 2, 2020
Maintainer

Okay, thanks @zerogerc learned something new :)

But to answer the question here: No we don't over that and I also think we can't do that so easily. Because we don't know which samples that are (there may also be shuffling) and we also don't know which batch you currently feed in your metric. So far this is only a module that calculates and syncs with some sanity checks. The more advanced features are yet to come, but I'm not sure, if this is possible :)

@SkafteNicki do you think, we can add this feature? Probably not without knowing the index and removing duplicates, which we won't do, right?

0 replies

zerogerc · 2020-07-02T08:04:16Z

zerogerc
Jul 2, 2020

Hi @justusschock, just my thoughts on this matter.

I believe that even divisibility is only critical for backward propagation and gradient calculation. I think that writing a distributed sampler which does not add additional samples would solve the problem for the metrics calculation.

0 replies

SkafteNicki · 2020-07-02T08:04:26Z

SkafteNicki
Jul 2, 2020
Collaborator

I was neither aware of this, and cannot really understand why this is default behavior of pytorch (at least without any kind of warning).
I guess that the only way to prevent this would be to write a custom DistributedSampler that corrects this.

0 replies

justusschock · 2020-07-02T08:08:33Z

justusschock
Jul 2, 2020
Maintainer

@SkafteNicki @zerogerc when you think about it, this makes sense (and can not be changed that easily without dropping the last batches), since you cannot sync/reduce something that is not available on some of the processes (which would be the case if you have a not evenly divisible number of batches/samples.

0 replies

SkafteNicki · 2020-07-02T08:13:27Z

SkafteNicki
Jul 2, 2020
Collaborator

@justusschock for something like that to work, a torch.distributed.barrier would be needed before any reduction?

0 replies

zerogerc · 2020-07-02T08:37:16Z

zerogerc
Jul 2, 2020

@justusschock, @SkafteNicki
One more concern.

I think that custom sampler would break the MEAN reduce operation for metrics. For example, accuracy cannot be correctly reduced if each process handled the different number of examples. For correct reduction, each process should calculate the number of handled examples. Another approach is to return tps and sups and always use SUM operation to reduce ddp.

0 replies

SkafteNicki · 2020-07-02T08:42:58Z

SkafteNicki
Jul 2, 2020
Collaborator

@zerogerc we are aware of that problem, and I have an upcoming PR that will fix that issue by using SUM to sync metric states (for accuracy tps and sups) which afterwards gets divided

0 replies

SkafteNicki · 2020-07-02T08:49:32Z

SkafteNicki
Jul 2, 2020
Collaborator

This problem was also discussed in pytorch (pytorch/pytorch#22584) with the solution that the only way to do this would be to keep track of the sample ids and then filter the computations based on that.

Interestingly enough this is the very reason the official imagenet example only adds distributed sampler to their train dataloader and not their validation dataloader (https://github.com/pytorch/examples/blob/49ec0bd72b85be55579ae8ceb278c66145f593e1/imagenet/main.py#L216-L233)

0 replies

zerogerc · 2020-07-02T08:59:42Z

zerogerc
Jul 2, 2020

I think that tracking of sample ids and filtering would add significant overhead to validation step.

0 replies

zerogerc · 2020-07-02T09:11:16Z

zerogerc
Jul 2, 2020

Judging from the implementation of DistributedDataParallel the computation only hangs if the gradient computation is required, so the uneven distribution should work. Needs testing, though.

https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L501

0 replies

carmocca · 2020-10-20T02:12:05Z

carmocca
Oct 20, 2020

Judging from the implementation of DistributedDataParallel the computation only hangs if the gradient computation is required, so the uneven distribution should work. Needs testing, though.

@zerogerc did you test it?

0 replies

SkafteNicki · 2020-12-16T10:52:53Z

SkafteNicki
Dec 16, 2020
Collaborator

This issue should be solved by this PR #5141 which enables support for even input in LightningDDP

3 replies

miquelmarti Feb 10, 2022

This PR was never merged but the original issue was closed. Has there been any progress on this? Is there any way to have distributed (and correct 😄) evaluation?

carmocca Feb 10, 2022

It's still WIP.

In the meantime you can try using this distributed sampler:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/overrides/distributed.py#L81

miquelmarti Feb 10, 2022

Great, I will give it a try. Thanks!

A question about metrics on ddp #5781

Uh oh!

Backgroud

Question

Replies: 14 comments · 3 replies

Uh oh!

justusschock Jul 2, 2020 Maintainer

Uh oh!

Uh oh!

Uh oh!

justusschock Jul 2, 2020 Maintainer

Uh oh!

Uh oh!

SkafteNicki Jul 2, 2020 Collaborator

Uh oh!

justusschock Jul 2, 2020 Maintainer

Uh oh!

SkafteNicki Jul 2, 2020 Collaborator

Uh oh!

Uh oh!

SkafteNicki Jul 2, 2020 Collaborator

Uh oh!

SkafteNicki Jul 2, 2020 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SkafteNicki Dec 16, 2020 Collaborator

Uh oh!

Uh oh!

Uh oh!

Replies: 14 comments 3 replies

justusschock
Jul 2, 2020
Maintainer

justusschock
Jul 2, 2020
Maintainer

SkafteNicki
Jul 2, 2020
Collaborator

justusschock
Jul 2, 2020
Maintainer

SkafteNicki
Jul 2, 2020
Collaborator

SkafteNicki
Jul 2, 2020
Collaborator

SkafteNicki
Jul 2, 2020
Collaborator

SkafteNicki
Dec 16, 2020
Collaborator