How do I set the steps_per_epoch parameter of a lr scheduler in multi-GPU environment? #2149
-
What is your question?For some learning rate schedulers, there is a required steps_per_epoch parameter. One example is the OneCycleLR scheduler. On a CPU or single GPU, this parameter should be set to the length of the train dataloader. My question is, how should this parameter be set on a multi-GPU machine using DDP. Does this parameter need to be updated to What have you tried?I've tried manually dividing the steps_per_epoch of the OneCycleLR scheduler by the number of GPUs when training on a multi-GPU machine. The LR doesn't seem to be following the expected update pattern and I think the scheduler may be the source of the problem. What's your environment?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
After some more investigation, it seems like dividing the dataloader size by the number of GPUs is the correct way. The documentation could be more clear on this, but I'm closing this now. |
Beta Was this translation helpful? Give feedback.
After some more investigation, it seems like dividing the dataloader size by the number of GPUs is the correct way. The documentation could be more clear on this, but I'm closing this now.