You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source-pytorch/common/checkpointing_intermediate.rst
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,9 +167,11 @@ In distributed training cases where a model is running across many machines, Lig
167
167
trainer = Trainer(strategy="ddp")
168
168
model = MyLightningModule(hparams)
169
169
trainer.fit(model)
170
+
170
171
# Saves only on the main process
172
+
# Handles strategy-specific saving logic like XLA, FSDP, DeepSpeed etc.
171
173
trainer.save_checkpoint("example.ckpt")
172
174
173
-
Not using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` can lead to unexpected behavior and potential deadlock. Using other saving functions will result in all devices attempting to save the checkpoint. As a result, we highly recommend using the Trainer's save functionality.
174
-
If using custom saving functions cannot be avoided, we recommend using the :func:`~lightning.pytorch.utilities.rank_zero.rank_zero_only` decorator to ensure saving occurs only on the main process. Note that this will only work if all ranks hold the exact same state and won't work when using
175
-
model parallel distributed strategies such as deepspeed or sharded training.
175
+
176
+
By using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` instead of ``torch.save``, you make your code agnostic to the distributed training strategy being used.
177
+
It will ensure that checkpoints are saved correctly in a multi-process setting, avoiding race conditions, deadlocks and other common issues that normally require boilerplate code to handle properly.
0 commit comments