Replies: 1 comment
-
The issue you are facing with moving metrics in a MetricCollection to GPU in a DDP setting might be related to a bug or an incomplete handling of device placement in older versions of torchmetrics. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using torchmetrics with multiple metrics classes implemented along with DDP. I use these classes inside a MetricCollection defined as part of the model. The DDP module transfers the model to 4 GPUs but during the call to the metrics forward method, I get the following error:
I define metrics like this in the model's init method:
Methods defined in the distributed class:
Call to metrics while training:
When I print the metric object reference from the model here, I can see MetricCollection and all metrics inside it being printed 4 times:
This means all metrics are on 4 different GPUs but still the individual metrics do not seem to be moved to GPU.
What am I missing in the code to move the torchmetrics to the GPU?
Beta Was this translation helpful? Give feedback.
All reactions