Skip to content

Cont. Failed to Resume Training w/ CombinedStreamingDataset #694

@karinazad

Description

@karinazad

🐛 Bug

I'm encountering the same issue as #363. I have the latest litdata version 0.2.52

ValueError: The provided `num_samples_yielded` state is greater than the dataset length. Found `10566743` instead of `9017936`.

Expected behavior

Resume training from a checkpoint

Additional context

Environment detail
  • PyTorch Version (e.g., 1.0): 2.8
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): uv
  • Python version: 12.9
  • GPU models and configuration: 40 nodes, 8 B200s per node (observed also on H100s)
  • CUDA/cuDNN version: 13

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions