A question about metrics on ddp #5781
Replies: 14 comments 3 replies
-
Hi @haichao592 , AFAIK the DistributedSampler simply does not behave like this. It only divides the dataset into disjunct subsets and does not add any extra samples. This is also the reason, we haven't implemented this. |
Beta Was this translation helpful? Give feedback.
-
@justusschock, the
Reference: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L79 |
Beta Was this translation helpful? Give feedback.
-
Okay, thanks @zerogerc learned something new :) But to answer the question here: No we don't over that and I also think we can't do that so easily. Because we don't know which samples that are (there may also be shuffling) and we also don't know which batch you currently feed in your metric. So far this is only a module that calculates and syncs with some sanity checks. The more advanced features are yet to come, but I'm not sure, if this is possible :) @SkafteNicki do you think, we can add this feature? Probably not without knowing the index and removing duplicates, which we won't do, right? |
Beta Was this translation helpful? Give feedback.
-
Hi @justusschock, just my thoughts on this matter. I believe that even divisibility is only critical for backward propagation and gradient calculation. I think that writing a distributed sampler which does not add additional samples would solve the problem for the metrics calculation. |
Beta Was this translation helpful? Give feedback.
-
I was neither aware of this, and cannot really understand why this is default behavior of pytorch (at least without any kind of warning). |
Beta Was this translation helpful? Give feedback.
-
@SkafteNicki @zerogerc when you think about it, this makes sense (and can not be changed that easily without dropping the last batches), since you cannot sync/reduce something that is not available on some of the processes (which would be the case if you have a not evenly divisible number of batches/samples. |
Beta Was this translation helpful? Give feedback.
-
@justusschock for something like that to work, a |
Beta Was this translation helpful? Give feedback.
-
@justusschock, @SkafteNicki I think that custom sampler would break the |
Beta Was this translation helpful? Give feedback.
-
@zerogerc we are aware of that problem, and I have an upcoming PR that will fix that issue by using |
Beta Was this translation helpful? Give feedback.
-
This problem was also discussed in pytorch (pytorch/pytorch#22584) with the solution that the only way to do this would be to keep track of the sample ids and then filter the computations based on that. Interestingly enough this is the very reason the official imagenet example only adds distributed sampler to their train dataloader and not their validation dataloader (https://github.com/pytorch/examples/blob/49ec0bd72b85be55579ae8ceb278c66145f593e1/imagenet/main.py#L216-L233) |
Beta Was this translation helpful? Give feedback.
-
I think that tracking of sample ids and filtering would add significant overhead to validation step. |
Beta Was this translation helpful? Give feedback.
-
Judging from the implementation of https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L501 |
Beta Was this translation helpful? Give feedback.
-
@zerogerc did you test it? |
Beta Was this translation helpful? Give feedback.
-
This issue should be solved by this PR #5141 which enables support for even input in LightningDDP |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for the nice work! 🥳🥳🥳
Backgroud
The new metrics package is released with build-in ddp support.
DistributedSampler is used by dataloaders to work with ddp.
It adds extra samples to make the samples evenly divisible.
This means during evaluation and testing, the added extra samples affect the eval result.
Question
Does the metrics package remove the extra redundant samples to make sure the eval result is not affected?
I have looked into the code and find the answer is NO.
Did I miss something or it's just not been resolved for now?
Beta Was this translation helpful? Give feedback.
All reactions