Skip to content

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Mar 7, 2025

What does this PR do?

Fixes #20565

When resuming from a checkpoint, the evaluation loop tries to increment the batch progress by max_batch, but this is inf if the validation DataLoader is iterable and doesn't have a len. I'm not 100% sure it's OK to just skip the batch progress increment here, so would appreciate some feedback on whether there's a better approach.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20624.org.readthedocs.build/en/20624/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 7, 2025
Copy link

codecov bot commented Mar 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (1f5add3) to head (87aca9c).
Report is 126 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (1f5add3) and HEAD (87aca9c). Click for more details.

HEAD has 2689 uploads less than BASE
Flag BASE (1f5add3) HEAD (87aca9c)
gpu 2 0
pytest 314 0
lightning 468 18
cpu 623 24
python3.9 156 6
lightning_fabric 79 0
python3.10 78 3
python3.11 155 6
python3.12.7 234 9
pytorch2.1 117 9
pytest-full 311 24
pytorch_lightning 78 6
pytorch2.3 38 3
pytorch2.2.2 39 3
pytorch2.4.1 39 3
pytorch2.5.1 78 6
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20624     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      264      -3     
  Lines       23366    23315     -51     
=========================================
- Hits        20475    18368   -2107     
- Misses       2891     4947   +2056     

Borda
Borda previously approved these changes Mar 8, 2025
@Borda
Copy link
Collaborator

Borda commented Mar 10, 2025

@lantiga mind double check on the fix, pls?

@lantiga
Copy link
Collaborator

lantiga commented Mar 11, 2025

Thank you @adamreeve, great catch.

The work I did late last year was to overhaul progress tracking to guarantee complete correctness of progress tracking upon restart, but I clearly overlooked this case. Skipping the increment as in this PR produces a progress state for validation that depends on whether you restarted or not, so we should think hard if we can fix it in a way that is consistent.

The right thing to do would be to actually increment by the correct amount, that is, actually consume the validation dataloader. This would only happen in this particular case so it may not be too bad, wdyt?

@Borda Borda dismissed their stale review March 11, 2025 08:31

need some more validation

@Borda Borda self-requested a review March 11, 2025 08:31
@mergify mergify bot removed the has conflicts label Mar 11, 2025
@adamreeve
Copy link
Contributor Author

The right thing to do would be to actually increment by the correct amount, that is, actually consume the validation dataloader. This would only happen in this particular case so it may not be too bad, wdyt?

This does seem a little wasteful as consuming the data loader could be an expensive operation, but maybe it's necessary. I don't think I've fully understood the problem in #14579 and the fix in #20379, but it seems like the original issue was related to the training loop counters, and it's not clear to me that the evaluation loop counters being off could be a problem as they'll get reset once evaluation starts. But I don't have a great understanding of how the counters are used.

@dannyfriar
Copy link

dannyfriar commented Mar 12, 2025

Commenting because I've been hitting the issue that this PR fixes in lightning recently. My current hacky workaround is just to set limit_test_batches to a large number to avoid the float("inf") setting.

I definitely don't think in this case lightning should consume the validation dataloader as that is potentially very expensive. In my use case, our dataloader is streaming a lot of data from disk and we definitely don't want to repeat that operation.

I also don't have a good understanding of what the counter is doing here but currently lightning crashes in this case so unless it leads to unexpected behaviour downstream this change seems like an improvement?

@Borda Borda added the priority: 0 High priority task label Mar 14, 2025
@lantiga
Copy link
Collaborator

lantiga commented Mar 14, 2025

Thank you @adamreeve and @dannyfriar for the insightful comments.

After some consideration I'm in favor of merging this as is, it's the best option. Thanks again!

@lantiga lantiga merged commit 1278308 into Lightning-AI:master Mar 14, 2025
79 checks passed
@mergify mergify bot added the ready PRs ready to be merged label Mar 14, 2025
@lantiga
Copy link
Collaborator

lantiga commented Mar 14, 2025

This will be pushed out with the patch release in a few hours.

@adamreeve adamreeve deleted the inf_dataloader_restore branch March 17, 2025 20:49
Borda pushed a commit that referenced this pull request Mar 18, 2025
…set (#20624)

* Add test to reproduce OverflowError exception

* Don't increment batch progress in eval loop with inf max batch

* Update changelog

(cherry picked from commit 1278308)
lexierule pushed a commit that referenced this pull request Mar 18, 2025
…set (#20624)

* Add test to reproduce OverflowError exception

* Don't increment batch progress in eval loop with inf max batch

* Update changelog

(cherry picked from commit 1278308)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package priority: 0 High priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Max batches float(inf) handled incorrectly

4 participants