model weight file corrupted when training on multi-gpus #12654

kleinzcy · 2022-04-07T11:21:10Z

kleinzcy
Apr 7, 2022

When I trained on multi-GPU, the model weight file is corrupted. I guess the reason is that multi-gpus are saving the model weight at the same time. So how can I call the CheckpointCallback on a specific GPU?

Answered by rohitgr7

Apr 7, 2022

checkpoints are saved only on global rank 0 which means only one of the GPUs will save the checkpoint even in the case of multi-node training. The issue might be something else.

View full answer

rohitgr7 · 2022-04-07T12:51:59Z

rohitgr7
Apr 7, 2022

checkpoints are saved only on global rank 0 which means only one of the GPUs will save the checkpoint even in the case of multi-node training. The issue might be something else.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model weight file corrupted when training on multi-gpus #12654

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

model weight file corrupted when training on multi-gpus #12654

Uh oh!

kleinzcy Apr 7, 2022

Replies: 1 comment

Uh oh!

rohitgr7 Apr 7, 2022

kleinzcy
Apr 7, 2022

rohitgr7
Apr 7, 2022