Checkpoint on NaN #12306
Unanswered
agrimgupta92
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Checkpoint on NaN
#12306
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to debug a NaN error which happens during the middle of training i.e. after some epochs. After I detect the NaN I would like to save the model weights to inspect later. I was unable to find a way to do the same. Any help in this regard would be appreciated. Note I am using DDP so can't directly save the state dict.
Beta Was this translation helpful? Give feedback.
All reactions