Where do I save model checkpoint when training with ddp_spawn? #12269

ggonzalezp · 2022-03-08T11:31:54Z

ggonzalezp
Mar 8, 2022

The documentation says that when using ddp_spawn the model is not updated in the main process, only in the sub-processes, so it should be saved and loaded again to obtain performance metrics. When I do trainer.save_checkpoint(checkpoint_dir) from the script I call trainer.fit(self.distribution_fitting_module), the saved checkpoint contains the same state as before training (which is expected).

Then my question is, where do I have to add trainer.save_checkpoint(checkpoint_dir) to save the updated model state? Would it be inside one of the functions of the PL module?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Where do I save model checkpoint when training with ddp_spawn? #12269

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Where do I save model checkpoint when training with ddp_spawn? #12269

Uh oh!

ggonzalezp Mar 8, 2022

Replies: 0 comments

ggonzalezp
Mar 8, 2022