|
1 | 1 | Asynchronous Saving with Distributed Checkpoint (DCP)
|
2 | 2 | =====================================================
|
3 | 3 |
|
| 4 | +**Author:** `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__ |
| 5 | + |
4 | 6 | Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
|
5 | 7 | One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
|
6 | 8 | from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
|
7 | 9 | to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``.
|
8 | 10 |
|
9 |
| -**Author**: , `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__ |
10 | 11 |
|
11 | 12 | .. grid:: 2
|
12 | 13 |
|
@@ -156,9 +157,12 @@ If the above optimization is still not performant enough, you can take advantage
|
156 | 157 | Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between
|
157 | 158 | checkpoint requests users can take advantage of direct memory access to speed up this copy.
|
158 | 159 |
|
159 |
| -.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above), |
160 |
| -any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same |
161 |
| -peak memory pressure being sustained through the application life. |
| 160 | +.. note:: |
| 161 | + The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without |
| 162 | + the pinned memory optimization (as demonstrated above), any checkpointing buffers are released as soon as |
| 163 | + checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, |
| 164 | + leading to the same |
| 165 | + peak memory pressure being sustained through the application life. |
162 | 166 |
|
163 | 167 |
|
164 | 168 | .. code-block:: python
|
|
0 commit comments