model weight file corrupted when training on multi-gpus #12654
Answered
by
rohitgr7
kleinzcy
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
-
When I trained on multi-GPU, the model weight file is corrupted. I guess the reason is that multi-gpus are saving the model weight at the same time. So how can I call the CheckpointCallback on a specific GPU? |
Beta Was this translation helpful? Give feedback.
Answered by
rohitgr7
Apr 7, 2022
Replies: 1 comment
-
checkpoints are saved only on global rank 0 which means only one of the GPUs will save the checkpoint even in the case of multi-node training. The issue might be something else. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
kleinzcy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
checkpoints are saved only on global rank 0 which means only one of the GPUs will save the checkpoint even in the case of multi-node training. The issue might be something else.