Fix OverflowError when resuming from checkpoint with an iterable dataset #20624

adamreeve · 2025-03-07T03:23:26Z

What does this PR do?

Fixes #20565

When resuming from a checkpoint, the evaluation loop tries to increment the batch progress by max_batch, but this is inf if the validation DataLoader is iterable and doesn't have a len. I'm not 100% sure it's OK to just skip the batch progress increment here, so would appreciate some feedback on whether there's a better approach.

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20624.org.readthedocs.build/en/20624/

codecov · 2025-03-07T03:46:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (1f5add3) to head (87aca9c).
Report is 126 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (1f5add3) and HEAD (87aca9c). Click for more details.

HEAD has 2689 uploads less than BASE

Flag BASE (1f5add3) HEAD (87aca9c)

gpu 2 0

pytest 314 0

lightning 468 18

cpu 623 24

python3.9 156 6

lightning_fabric 79 0

python3.10 78 3

python3.11 155 6

python3.12.7 234 9

pytorch2.1 117 9

pytest-full 311 24

pytorch_lightning 78 6

pytorch2.3 38 3

pytorch2.2.2 39 3

pytorch2.4.1 39 3

pytorch2.5.1 78 6

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #20624     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      264      -3     
  Lines       23366    23315     -51     
=========================================
- Hits        20475    18368   -2107     
- Misses       2891     4947   +2056

Borda · 2025-03-10T13:18:18Z

@lantiga mind double check on the fix, pls?

lantiga · 2025-03-11T08:16:09Z

Thank you @adamreeve, great catch.

The work I did late last year was to overhaul progress tracking to guarantee complete correctness of progress tracking upon restart, but I clearly overlooked this case. Skipping the increment as in this PR produces a progress state for validation that depends on whether you restarted or not, so we should think hard if we can fix it in a way that is consistent.

The right thing to do would be to actually increment by the correct amount, that is, actually consume the validation dataloader. This would only happen in this particular case so it may not be too bad, wdyt?

need some more validation

adamreeve · 2025-03-12T00:17:50Z

The right thing to do would be to actually increment by the correct amount, that is, actually consume the validation dataloader. This would only happen in this particular case so it may not be too bad, wdyt?

This does seem a little wasteful as consuming the data loader could be an expensive operation, but maybe it's necessary. I don't think I've fully understood the problem in #14579 and the fix in #20379, but it seems like the original issue was related to the training loop counters, and it's not clear to me that the evaluation loop counters being off could be a problem as they'll get reset once evaluation starts. But I don't have a great understanding of how the counters are used.

dannyfriar · 2025-03-12T08:34:20Z

Commenting because I've been hitting the issue that this PR fixes in lightning recently. My current hacky workaround is just to set limit_test_batches to a large number to avoid the float("inf") setting.

I definitely don't think in this case lightning should consume the validation dataloader as that is potentially very expensive. In my use case, our dataloader is streaming a lot of data from disk and we definitely don't want to repeat that operation.

I also don't have a good understanding of what the counter is doing here but currently lightning crashes in this case so unless it leads to unexpected behaviour downstream this change seems like an improvement?

lantiga · 2025-03-14T10:48:14Z

Thank you @adamreeve and @dannyfriar for the insightful comments.

After some consideration I'm in favor of merging this as is, it's the best option. Thanks again!

lantiga · 2025-03-14T10:50:23Z

This will be pushed out with the patch release in a few hours.

…set (#20624) * Add test to reproduce OverflowError exception * Don't increment batch progress in eval loop with inf max batch * Update changelog (cherry picked from commit 1278308)

adamreeve added 3 commits March 7, 2025 15:19

Add test to reproduce OverflowError exception

caeea6e

Don't increment batch progress in eval loop with inf max batch

a428e29

Update changelog

87aca9c

adamreeve requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners March 7, 2025 03:23

github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 7, 2025

Borda previously approved these changes Mar 8, 2025

View reviewed changes

mergify bot added the has conflicts label Mar 10, 2025

Borda self-requested a review March 11, 2025 08:31

mergify bot removed the has conflicts label Mar 11, 2025

Borda added the priority: 0 High priority task label Mar 14, 2025

Borda approved these changes Mar 14, 2025

View reviewed changes

lantiga approved these changes Mar 14, 2025

View reviewed changes

lantiga merged commit 1278308 into Lightning-AI:master Mar 14, 2025
79 checks passed

mergify bot added the ready PRs ready to be merged label Mar 14, 2025

adamreeve deleted the inf_dataloader_restore branch March 17, 2025 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix OverflowError when resuming from checkpoint with an iterable dataset #20624

Fix OverflowError when resuming from checkpoint with an iterable dataset #20624

Uh oh!

adamreeve commented Mar 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 7, 2025 •

edited

Loading

Uh oh!

Borda commented Mar 10, 2025

Uh oh!

lantiga commented Mar 11, 2025

Uh oh!

adamreeve commented Mar 12, 2025

Uh oh!

dannyfriar commented Mar 12, 2025 •

edited

Loading

Uh oh!

lantiga commented Mar 14, 2025

Uh oh!

Uh oh!

lantiga commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix OverflowError when resuming from checkpoint with an iterable dataset #20624

Fix OverflowError when resuming from checkpoint with an iterable dataset #20624

Uh oh!

Conversation

adamreeve commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

codecov bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Borda commented Mar 10, 2025

Uh oh!

lantiga commented Mar 11, 2025

Uh oh!

adamreeve commented Mar 12, 2025

Uh oh!

dannyfriar commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lantiga commented Mar 14, 2025

Uh oh!

Uh oh!

lantiga commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adamreeve commented Mar 7, 2025 •

edited

Loading

codecov bot commented Mar 7, 2025 •

edited

Loading

dannyfriar commented Mar 12, 2025 •

edited

Loading