Skip to content

I can not save checkpoints in checkpoints epochs #20638

@vanpe20

Description

@vanpe20

Bug description

When I ran the program to train the model, I couldn't save checkpoints after a certain epoch, but instead of getting an error, the model skipped saving and continued training

What version are you seeing the problem on?

v2.5

How to reproduce the bug

default_modelckpt_cfg = {
        "target": "pytorch_lightning.callbacks.ModelCheckpoint",
        "params": {
            "dirpath": ckptdir,
            "filename": "{epoch:04}",
            "verbose": True,
            "save_last": False,
            "every_n_epochs": 1,
            "save_top_k": -1,   # save all checkpoints
        }
    }
modelckpt_cfg = lightning_config.modelcheckpoint
modelckpt_cfg = OmegaConf.merge(default_modelckpt_cfg, modelckpt_cfg)
default_callbacks_cfg["checkpoint_callback"] = modelckpt_cfg
if "callbacks" in lightning_config:
        callbacks_cfg = lightning_config.callbacks
else:
        callbacks_cfg = OmegaConf.create()
callbacks_cfg = OmegaConf.merge(default_callbacks_cfg, callbacks_cfg)

trainer_kwargs["callbacks"] = [
        instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
trainer = Trainer(**trainer_config, **trainer_kwargs, num_nodes=opt.num_nodes)

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.2.2):
#- Python version (e.g., 3.10):
#- OS (e.g., Linux):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions