-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x
Description
Bug description
When using a dataloader which doesn't have __len__
implemented, lightning adds a max_batches
as float("inf")
here which then breaks further on.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Struggling to provide a simple repro but it happens when loading a checkpoint i.e. any time we have self.resetting
as True
in the eval loop.
Error messages and logs
trainer.fit(
File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
results = self._run_stage()
File "/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
self.fit_loop.run()
File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
self.advance()
File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
self.epoch_loop.run(self._data_fetcher)
File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
self.advance(data_fetcher)
File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 270, in advance
self.val_loop.increment_progress_to_evaluation_end()
File "/venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 271, in increment_progress_to_evaluation_end
max_batch = int(max(self.max_batches))
OverflowError: cannot convert float infinity to integer
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0
#- PyTorch Version (e.g., 2.5): 2.5
#- Python version (e.g., 3.12): 3.10
#- OS (e.g., Linux): Ubuntu
#- CUDA/cuDNN version: CUDA12, cuDNN9
#- GPU models and configuration: A100
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response
mamaj
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x