Skip to content

Checkpoints not saving: global_step and current_epoch not updating despite correct progress bar #21096

@homjay

Description

@homjay

Bug description

During training, I noticed that no checkpoints are being saved. Upon investigation, I found that although the progress bar shows the correct step and epoch values after each step, the model's internal global_step and current_epoch remain unchanged.

I am using the model.fit() method and suspect the issue may be related to using multiple optimizers and manual optimization operations. There are no errors in the logs, and I haven't found any documentation addressing this behavior.

Has anyone encountered this or know what might be causing it? Any help is appreciated.


I discovered that the global_step wasn't updating, causing _should_skip_saving_checkpoint to always return true:

def _should_skip_saving_checkpoint(self, trainer: "pl.Trainer") -> bool:
    from lightning.pytorch.trainer.states import TrainerFn

    return (
        bool(trainer.fast_dev_run)  # disable checkpointing with fast_dev_run
        or trainer.state.fn != TrainerFn.FITTING  # don't save anything during non-fit
        or trainer.sanity_checking  # don't save anything during sanity check
        or self._last_global_step_saved == trainer.global_step  # already saved at the last step
    )

After examining the code, I believe this is an optimizer-related issue:

@property
def global_step(self) -> int:
    lightning_module = self.trainer.lightning_module
    if lightning_module is None or lightning_module.automatic_optimization:
        return self.automatic_optimization.optim_progress.optimizer_steps
    return self.manual_optimization.optim_step_progress.total.completed

A temporary solution is to save the checkpoint manually:

self.trainer.save_checkpoint("./model.ckpg")

I'm still uncertain why the model's global_step isn't updating, especially since the optimization has been rewritten and is quite complex.

What version are you seeing the problem on?

master, v2.5

Reproduced in studio

No response

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.2
#- PyTorch Version (e.g., 2.5): 2.19
#- Python version (e.g., 3.12): 3.12.9
#- OS (e.g., Linux): Ubuntu 22.04
#- CUDA/cuDNN version: 12.9
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcheckpointingRelated to checkpointingver: 2.5.xwaiting on authorWaiting on user action, correction, or update

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions