Skip to content

Max batches float(inf) handled incorrectlyΒ #20565

@dannyfriar

Description

@dannyfriar

Bug description

When using a dataloader which doesn't have __len__ implemented, lightning adds a max_batches as float("inf") here which then breaks further on.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Struggling to provide a simple repro but it happens when loading a checkpoint i.e. any time we have self.resetting as True in the eval loop.

Error messages and logs

    trainer.fit(
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
    results = self._run_stage()
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
    self.fit_loop.run()
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
    self.advance()
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
    self.epoch_loop.run(self._data_fetcher)
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
    self.advance(data_fetcher)
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 270, in advance
    self.val_loop.increment_progress_to_evaluation_end()
   File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 271, in increment_progress_to_evaluation_end
    max_batch = int(max(self.max_batches))
 OverflowError: cannot convert float infinity to integer

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0
#- PyTorch Version (e.g., 2.5): 2.5
#- Python version (e.g., 3.12): 3.10
#- OS (e.g., Linux): Ubuntu
#- CUDA/cuDNN version: CUDA12, cuDNN9
#- GPU models and configuration: A100
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions