-
Notifications
You must be signed in to change notification settings - Fork 276
Description
May I ask which version of pl did you use for developing this codebase?
I tried the newest 2.0 but got lots of bugs, params and functions deprecated, etc. So I degrade it to 1.5 now, with the compatible torch 1.8.0 and torchmetrics, but still find it stuck at step 1770/1850 epoch 0, very confusing.
I thought it might have gone through the validation step, because of a warning by pl as below:
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the 'batch_size' from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use 'self.log(..., batch_size=batch_size)'.
The batch size changed to 1, and also this warning is new in pl 1.5. I don't know if it causes any error in computation.
Back to the stuck issue, I waited for more than 30 mins which is much longer than the eta of training one epoch. Still stuck, no errors or warnings, desperate...
Too many uncertain issues with pl training. So I have to ask the version that can work with this codebase. Thanks a lot!