Skip to content

Comments

Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early#771

Merged
bhimrazy merged 6 commits intoLightning-AI:mainfrom
philgzl:fix-parallel-streaming-dset-resume-with-load-state
Jan 8, 2026
Merged

Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early#771
bhimrazy merged 6 commits intoLightning-AI:mainfrom
philgzl:fix-parallel-streaming-dset-resume-with-load-state

Conversation

@philgzl
Copy link
Contributor

@philgzl philgzl commented Dec 9, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #770.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80%. Comparing base (b567e87) to head (73444bb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #771   +/-   ##
===================================
  Coverage    80%    80%           
===================================
  Files        52     52           
  Lines      7371   7373    +2     
===================================
+ Hits       5912   5915    +3     
+ Misses     1459   1458    -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@philgzl philgzl changed the title Fix resuming after loading state Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early Dec 9, 2025
@bhimrazy bhimrazy requested a review from Copilot December 16, 2025 06:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where ParallelStreamingDataset with resume=True would not properly resume after manually loading a state dict following a training crash when iterations were incomplete (breaking early from an epoch). The fix restructures the control flow in StreamingDataLoader.__iter__() to ensure that the ParallelStreamingDataset cycling logic executes even when self.restore=True, enabling proper resume behavior.

Key changes:

  • Restructured the __iter__ method in StreamingDataLoader to handle ParallelStreamingDataset cycling logic before checking the restore flag
  • Enhanced test coverage with new scenarios that simulate training crashes and manual state dict loading
  • Updated test parameters and expected values to better validate cycling and resume behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/litdata/streaming/dataloader.py Fixed the __iter__ method to properly handle ParallelStreamingDataset cycling when resuming from a manually loaded state dict by restructuring conditional logic
tests/streaming/test_parallel.py Added comprehensive test scenarios for crash simulation and manual resume, updated helper function to support tmpdir reuse, and adjusted test parameters and expected values for better coverage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Clarify that the fix handles both automatic epoch transitions and manual
  checkpoint resume scenarios
- Explain why self.restore is cleared unconditionally
- Use accurate terminology (persist vs restore state)
Copy link
Collaborator

@bhimrazy bhimrazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @philgzl, LGTM! 🙌

Probably at some point, might need to look into the test_parallel.py file in detail to simplify things a bit.

@bhimrazy bhimrazy enabled auto-merge (squash) January 8, 2026 10:17
@bhimrazy bhimrazy merged commit 3e79da4 into Lightning-AI:main Jan 8, 2026
49 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParallelStreamingDataset with resume=True does not resume after manually loading state dict when breaking early

3 participants