-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
Hello, thanks for your codes. They are elegant and clear. These codes help me a lot.
I got a problem as the training loss performed very well about 0.001 at the beginning of the training.
The default end epoch is set as 10000. But the training loss will get a surprising number about "Training Loss : 325440.0592" at 2000+ epochs. I am curious. Have you ever encountered this issue before?
The training batch size is 96 with 4 GPUs with PyTorch.DDP. Since the full training data set only includes about 4000 images, 4 GPUs only need about 10 iterations to end an epoch. Do you think this is the reason?
Thanks for your codes.
Metadata
Metadata
Assignees
Labels
No labels