Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early#771
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #771 +/- ##
===================================
Coverage 80% 80%
===================================
Files 52 52
Lines 7371 7373 +2
===================================
+ Hits 5912 5915 +3
+ Misses 1459 1458 -1 🚀 New features to boost your workflow:
|
ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where ParallelStreamingDataset with resume=True would not properly resume after manually loading a state dict following a training crash when iterations were incomplete (breaking early from an epoch). The fix restructures the control flow in StreamingDataLoader.__iter__() to ensure that the ParallelStreamingDataset cycling logic executes even when self.restore=True, enabling proper resume behavior.
Key changes:
- Restructured the
__iter__method inStreamingDataLoaderto handle ParallelStreamingDataset cycling logic before checking the restore flag - Enhanced test coverage with new scenarios that simulate training crashes and manual state dict loading
- Updated test parameters and expected values to better validate cycling and resume behavior
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/litdata/streaming/dataloader.py | Fixed the __iter__ method to properly handle ParallelStreamingDataset cycling when resuming from a manually loaded state dict by restructuring conditional logic |
| tests/streaming/test_parallel.py | Added comprehensive test scenarios for crash simulation and manual resume, updated helper function to support tmpdir reuse, and adjusted test parameters and expected values for better coverage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Clarify that the fix handles both automatic epoch transitions and manual checkpoint resume scenarios - Explain why self.restore is cleared unconditionally - Use accurate terminology (persist vs restore state)
Before submitting
What does this PR do?
Fixes #770.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃