-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
During training, I noticed that no checkpoints are being saved. Upon investigation, I found that although the progress bar shows the correct step and epoch values after each step, the model's internal global_step and current_epoch remain unchanged.
I am using the model.fit() method and suspect the issue may be related to using multiple optimizers and manual optimization operations. There are no errors in the logs, and I haven't found any documentation addressing this behavior.
Has anyone encountered this or know what might be causing it? Any help is appreciated.
I discovered that the global_step wasn't updating, causing _should_skip_saving_checkpoint to always return true:
def _should_skip_saving_checkpoint(self, trainer: "pl.Trainer") -> bool:
from lightning.pytorch.trainer.states import TrainerFn
return (
bool(trainer.fast_dev_run) # disable checkpointing with fast_dev_run
or trainer.state.fn != TrainerFn.FITTING # don't save anything during non-fit
or trainer.sanity_checking # don't save anything during sanity check
or self._last_global_step_saved == trainer.global_step # already saved at the last step
)
After examining the code, I believe this is an optimizer-related issue:
@property
def global_step(self) -> int:
lightning_module = self.trainer.lightning_module
if lightning_module is None or lightning_module.automatic_optimization:
return self.automatic_optimization.optim_progress.optimizer_steps
return self.manual_optimization.optim_step_progress.total.completed
A temporary solution is to save the checkpoint manually:
self.trainer.save_checkpoint("./model.ckpg")
I'm still uncertain why the model's global_step isn't updating, especially since the optimization has been rewritten and is quite complex.
What version are you seeing the problem on?
master, v2.5
Reproduced in studio
No response
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.2
#- PyTorch Version (e.g., 2.5): 2.19
#- Python version (e.g., 3.12): 3.12.9
#- OS (e.g., Linux): Ubuntu 22.04
#- CUDA/cuDNN version: 12.9
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response
cc @lantiga