Logging Metrics Manually in DDP mode with 2 GPUs #3088
Replies: 1 comment
-
When using torchmetrics with PyTorch Lightning in DDP mode (multi-GPU), metrics need to be synchronized across all processes before calling A recommended approach: Use the from torchmetrics import Accuracy
metric = Accuracy(sync_on_compute=True) Then you can safely call: metric.update(preds, target)
result = metric.compute()
metric.reset() Alternatively, you can manually call Also, make sure you only call This should fix the hanging issue in DDP mode. Let me know if you want a minimal example to demonstrate this! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The docs for torchmetrics when using pytorch lightning describe how to manually call compute() and reset() at the end of the train/val epochs if logging a metric manually instead of letting self.log(...) do the work. This works for me when I use a single GPU, but when using 2 GPUs and DDP strategy, the metric.compute() call hangs for a long time and then timeout error occurs. Is there a different set of instructions for manually logging in DDP mode when using > 1 GPUs?
Beta Was this translation helpful? Give feedback.
All reactions