Where do I save model checkpoint when training with ddp_spawn? #12269
Unanswered
ggonzalezp
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The documentation says that when using
ddp_spawn
the model is not updated in the main process, only in the sub-processes, so it should be saved and loaded again to obtain performance metrics. When I dotrainer.save_checkpoint(checkpoint_dir)
from the script I calltrainer.fit(self.distribution_fitting_module)
, the saved checkpoint contains the same state as before training (which is expected).Then my question is, where do I have to add
trainer.save_checkpoint(checkpoint_dir)
to save the updated model state? Would it be inside one of the functions of the PL module?Beta Was this translation helpful? Give feedback.
All reactions