-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
Description
Bug description
When I ran the program to train the model, I couldn't save checkpoints after a certain epoch, but instead of getting an error, the model skipped saving and continued training
What version are you seeing the problem on?
v2.5
How to reproduce the bug
default_modelckpt_cfg = {
"target": "pytorch_lightning.callbacks.ModelCheckpoint",
"params": {
"dirpath": ckptdir,
"filename": "{epoch:04}",
"verbose": True,
"save_last": False,
"every_n_epochs": 1,
"save_top_k": -1, # save all checkpoints
}
}
modelckpt_cfg = lightning_config.modelcheckpoint
modelckpt_cfg = OmegaConf.merge(default_modelckpt_cfg, modelckpt_cfg)
default_callbacks_cfg["checkpoint_callback"] = modelckpt_cfg
if "callbacks" in lightning_config:
callbacks_cfg = lightning_config.callbacks
else:
callbacks_cfg = OmegaConf.create()
callbacks_cfg = OmegaConf.merge(default_callbacks_cfg, callbacks_cfg)
trainer_kwargs["callbacks"] = [
instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]
trainer = Trainer(**trainer_config, **trainer_kwargs, num_nodes=opt.num_nodes)
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.2.2):
#- Python version (e.g., 3.10):
#- OS (e.g., Linux):
More info
No response