save checkpoint issue in nested trainer #7669
-
Hi all, I use multi-GPU to train the model, this error happens when saving the checkpoint. But using a single GPU to train the model, everything works fine.
I wonder if it is because I nest the trainer inside another model, the codes for a nested trainer is as followed:
So many thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @eddiecong, the trainer is not pickleable and so you're probably correct that it's to do with your nested trainer code. You could override the Hope that helps 😃 |
Beta Was this translation helpful? Give feedback.
Hi @eddiecong, the trainer is not pickleable and so you're probably correct that it's to do with your nested trainer code. You could override the
on_save_checkpoint
hook and then delete the trainer from the checkpoint as suggested by this answer: #3444 (comment)Hope that helps 😃