Skip to content

Commit 0e3c3ec

Browse files
committed
formatting updates
1 parent 16a2f05 commit 0e3c3ec

File tree

2 files changed

+41
-12
lines changed

2 files changed

+41
-12
lines changed

recipes_source/distributed_async_checkpoint_recipe.rst

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,36 @@ Asynchronous Saving with Distributed Checkpoint (DCP)
44
Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
55
One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
66
from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
7-
to show how this can be integrated quite easily with `torch.distributed.checkpoint.async_save`.
7+
to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``.
88

9+
**Author**: , `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__
910

10-
Notes on Asynchronous Checkpointing
11+
.. grid:: 2
12+
13+
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
14+
:class-card: card-prerequisites
15+
16+
* How to use DCP to generate checkpoints in parallel
17+
* Effective strategies to optimize performance
18+
19+
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
20+
:class-card: card-prerequisites
21+
22+
* PyTorch v2.4.0 or later
23+
* `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
24+
25+
26+
Asynchronous Checkpointing Overview
1127
------------------------------------
12-
Before getting started with Asynchronous Checkpointing, it's important that we discuss some differences and limitations as compared to synchronous checkpointing.
28+
Before getting started with Asynchronous Checkpointing, it's important to understand it's differences and limitations as compared to synchronous checkpointing.
1329
Speciically:
1430

1531
* Memory requirements - Asynchronous checkpointing works by first copying models into internal CPU-buffers.
1632
This is helpful since it ensures model and optimizer weights are not changing while the model is still checkpointing,
1733
but does raise CPU memory by a factor of checkpoint size times the number of process on the host.
1834

19-
* Checkpoint Management - Since checkpointing is Asynchronous, it is up to the user to manage concurrently run checkpoints. In general users can
20-
employ their own management strategies by handling the future object returned form `async_save`. For most users, we recommend limiting
35+
* Checkpoint Management - Since checkpointing is asynchronous, it is up to the user to manage concurrently run checkpoints. In general, users can
36+
employ their own management strategies by handling the future object returned form ``async_save``. For most users, we recommend limiting
2137
checkpoints to one asynchronous request at a time, avoiding additional memory pressure per request.
2238

2339

@@ -134,12 +150,12 @@ Speciically:
134150
135151
Even more performance with Pinned Memory
136152
-----------------------------------------
137-
If the above optimization is still not performant enough, users may wish to take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging.
138-
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaing a pinned memory buffer between
153+
If the above optimization is still not performant enough, you can take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging.
154+
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between
139155
checkpoint requests users can take advantage of direct memory access to speed up this copy.
140156

141-
Note: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
142-
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained in between steps, leading to the same
157+
.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
158+
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same
143159
peak memory pressure being sustained through the application life.
144160

145161

@@ -257,3 +273,13 @@ peak memory pressure being sustained through the application life.
257273
nprocs=world_size,
258274
join=True,
259275
)
276+
277+
278+
Conclusion
279+
----------
280+
In conclusion, we have learned how to use DCP's :func:`async_save` API to generate checkpoints off the critical training path. We've also learned about the
281+
additional memory and concurrency overhead introduced by using this API, as well as additional optimizations which utilize pinned memory to speed things up
282+
even further.
283+
284+
- `Saving and loading models tutorial <https://pytorch.org/tutorials/beginner/saving_loading_models.html>`__
285+
- `Getting started with FullyShardedDataParallel tutorial <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__

recipes_source/distributed_checkpoint_recipe.rst

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -330,14 +330,17 @@ Formats
330330
----------
331331
One drawback not yet mentioned is that DCP saves checkpoints in a format which is inherently different then those generated using torch.save.
332332
Since this can be an issue when users wish to share models with users used to the torch.save format, or in general just want to add format flexibility
333-
to their applications. For this case, we provide the `format_utils` module in `torch.distributed.checkpoint.format_utils`.
333+
to their applications. For this case, we provide the ``format_utils`` module in ``torch.distributed.checkpoint.format_utils``.
334334

335335
A command line utility is provided for the users convenience, which follows the following format:
336-
`python -m torch.distributed.checkpoint.format_utils -m <checkpoint location> <location to write formats to> <mode>` where mode is one of `torch_to_dcp` or `dcp_to_torch`.
336+
.. code-block:: bash
337337
338-
Alternatively, methods are also provided for users who may wish to convert checkpoints directly.
338+
python -m torch.distributed.checkpoint.format_utils -m <checkpoint location> <location to write formats to> <mode>
339+
340+
In the command above, ``mode`` is one of ``torch_to_dcp``` or ``dcp_to_torch``.
339341

340342

343+
Alternatively, methods are also provided for users who may wish to convert checkpoints directly.
341344
.. code-block:: python
342345
343346
import os

0 commit comments

Comments
 (0)