Skip to content

Commit e6b3ac2

Browse files
LucasLLCsvekars
andauthored
Update recipes_source/distributed_async_checkpoint_recipe.rst
Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 25ea481 commit e6b3ac2

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

recipes_source/distributed_async_checkpoint_recipe.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
Asynchronous Saving with Distributed Checkpoint (DCP)
22
=====================================================
33

4+
**Author:** `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__
5+
46
Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
57
One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
68
from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__

0 commit comments

Comments
 (0)