-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix OverflowError when resuming from checkpoint with an iterable dataset #20624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OverflowError when resuming from checkpoint with an iterable dataset #20624
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #20624 +/- ##
=========================================
- Coverage 88% 79% -9%
=========================================
Files 267 264 -3
Lines 23366 23315 -51
=========================================
- Hits 20475 18368 -2107
- Misses 2891 4947 +2056 |
@lantiga mind double check on the fix, pls? |
Thank you @adamreeve, great catch. The work I did late last year was to overhaul progress tracking to guarantee complete correctness of progress tracking upon restart, but I clearly overlooked this case. Skipping the increment as in this PR produces a progress state for validation that depends on whether you restarted or not, so we should think hard if we can fix it in a way that is consistent. The right thing to do would be to actually increment by the correct amount, that is, actually consume the validation dataloader. This would only happen in this particular case so it may not be too bad, wdyt? |
This does seem a little wasteful as consuming the data loader could be an expensive operation, but maybe it's necessary. I don't think I've fully understood the problem in #14579 and the fix in #20379, but it seems like the original issue was related to the training loop counters, and it's not clear to me that the evaluation loop counters being off could be a problem as they'll get reset once evaluation starts. But I don't have a great understanding of how the counters are used. |
Commenting because I've been hitting the issue that this PR fixes in lightning recently. My current hacky workaround is just to set I definitely don't think in this case lightning should consume the validation dataloader as that is potentially very expensive. In my use case, our dataloader is streaming a lot of data from disk and we definitely don't want to repeat that operation. I also don't have a good understanding of what the counter is doing here but currently lightning crashes in this case so unless it leads to unexpected behaviour downstream this change seems like an improvement? |
Thank you @adamreeve and @dannyfriar for the insightful comments. After some consideration I'm in favor of merging this as is, it's the best option. Thanks again! |
This will be pushed out with the patch release in a few hours. |
What does this PR do?
Fixes #20565
When resuming from a checkpoint, the evaluation loop tries to increment the batch progress by
max_batch
, but this isinf
if the validation DataLoader is iterable and doesn't have alen
. I'm not 100% sure it's OK to just skip the batch progress increment here, so would appreciate some feedback on whether there's a better approach.Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20624.org.readthedocs.build/en/20624/