Low results when validated using multiple GPUs / DDP as compared to single GPU #7133
Replies: 3 comments
-
Are you computing an F1 score here for classification? Apologies I have not studied the code in detail but for correct metric reduction and logging we recommend the torchmetrics. TorchMetrics overview: https://torchmetrics.readthedocs.io/en/latest/pages/overview.html#overview Is there evidence that this is a bug or is it ok to convert this to a question/implementation help? |
Beta Was this translation helpful? Give feedback.
-
Sorry, I think this should be converted to a question/implementation help. The target in this problem is the list of ''posting_id' that is similar to the current posting id. So, over here, there is no particular class. In validation, I am doing unsupervised clustering. If you look at the getMetric function, I am basically calculating the intersection of posting ids predicted by my model and the target posting ids. Each row in the dataframe has a different posting_id. For each row in the dataframe, I am calculating F1 score and taking mean of all rows as the final F1 score. So, here there is no notion of classes. I can't think of any way to make the torchmetrics F1 score work over here. It asks me to declare number of classes, which I dont have any. Each sample can have any number of posting ids as the target value. Any suggestions in the getMetric function that I can do to make it work for the multiGPU validation? |
Beta Was this translation helpful? Give feedback.
-
Can you guarantee that [syncing across GPUs then computing the metric] is the same as [computing the metric on each GPU then averaging the metric over the GPUs]? And btw, if the dataloader size is not divisible by the number of GPUs, the DistributedSampler will repeat batches so that every GPU has the same number of batches. So for this reason I recommend to run the trainer.test() on one GPU to make sure that you get the expected metrics correctly before going to the multi GPU case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I trained a model using the following LightningModule. When validated using single GPU and using the trained weights, I get val_f1 score of 0.288. But when validated using multiple GPUs and DDP, I get val_f1 score of 0.082. How can I resolve this?
In the following code, I have also tried using the 'gather_all_tensors' function, but it didn't help.
I use the following Trainer for single GPU.
For 2 GPUs, I do the following:
When I use the following for single GPU, then too I get val_f1 as 0.082
Beta Was this translation helpful? Give feedback.
All reactions