First checkpoint not being saved #19002
Unanswered
jwliu36
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team,
I am trying to save all the checkpoints as well as the last.ckpt by setting
save_last=True
andsave_top_k=-1
. Additionally, I am also usingAsyncCheckpointIO
for async uploading checkpoints to local and S3 file paths.However, I am running into issue where the first
checkpoint-{epoch}-{step}.ckpt
is not getting saved, but only last.ckpt is created. As the training job goes on, all subsequentcheckpoint-{epoch}-{step}.ckpt
would get saved into the same directory.Can you point me to which method within
ModelCheckpoint
class that I may need to override?Would it be
_save_last_checkpoint
: code ref or_should_skip_saving_checkpoint
code ref? If it is other methods, please point me to the reference. Thank you!Beta Was this translation helpful? Give feedback.
All reactions