You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: recipes_source/distributed_async_checkpoint_recipe.rst
+35-9Lines changed: 35 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,20 +4,36 @@ Asynchronous Saving with Distributed Checkpoint (DCP)
4
4
Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
5
5
One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
6
6
from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
7
-
to show how this can be integrated quite easily with `torch.distributed.checkpoint.async_save`.
7
+
to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``.
* `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
24
+
25
+
26
+
Asynchronous Checkpointing Overview
11
27
------------------------------------
12
-
Before getting started with Asynchronous Checkpointing, it's important that we discuss some differences and limitations as compared to synchronous checkpointing.
28
+
Before getting started with Asynchronous Checkpointing, it's important to understand it's differences and limitations as compared to synchronous checkpointing.
13
29
Speciically:
14
30
15
31
* Memory requirements - Asynchronous checkpointing works by first copying models into internal CPU-buffers.
16
32
This is helpful since it ensures model and optimizer weights are not changing while the model is still checkpointing,
17
33
but does raise CPU memory by a factor of checkpoint size times the number of process on the host.
18
34
19
-
* Checkpoint Management - Since checkpointing is Asynchronous, it is up to the user to manage concurrently run checkpoints. In general users can
20
-
employ their own management strategies by handling the future object returned form `async_save`. For most users, we recommend limiting
35
+
* Checkpoint Management - Since checkpointing is asynchronous, it is up to the user to manage concurrently run checkpoints. In general, users can
36
+
employ their own management strategies by handling the future object returned form ``async_save``. For most users, we recommend limiting
21
37
checkpoints to one asynchronous request at a time, avoiding additional memory pressure per request.
22
38
23
39
@@ -134,12 +150,12 @@ Speciically:
134
150
135
151
Even more performance with Pinned Memory
136
152
-----------------------------------------
137
-
If the above optimization is still not performant enough, users may wish to take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging.
138
-
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaing a pinned memory buffer between
153
+
If the above optimization is still not performant enough, you can take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging.
154
+
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between
139
155
checkpoint requests users can take advantage of direct memory access to speed up this copy.
140
156
141
-
Note: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
142
-
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained in between steps, leading to the same
157
+
.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
158
+
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same
143
159
peak memory pressure being sustained through the application life.
144
160
145
161
@@ -257,3 +273,13 @@ peak memory pressure being sustained through the application life.
257
273
nprocs=world_size,
258
274
join=True,
259
275
)
276
+
277
+
278
+
Conclusion
279
+
----------
280
+
In conclusion, we have learned how to use DCP's :func:`async_save` API to generate checkpoints off the critical training path. We've also learned about the
281
+
additional memory and concurrency overhead introduced by using this API, as well as additional optimizations which utilize pinned memory to speed things up
282
+
even further.
283
+
284
+
- `Saving and loading models tutorial <https://pytorch.org/tutorials/beginner/saving_loading_models.html>`__
285
+
- `Getting started with FullyShardedDataParallel tutorial <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
Copy file name to clipboardExpand all lines: recipes_source/distributed_checkpoint_recipe.rst
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -330,14 +330,17 @@ Formats
330
330
----------
331
331
One drawback not yet mentioned is that DCP saves checkpoints in a format which is inherently different then those generated using torch.save.
332
332
Since this can be an issue when users wish to share models with users used to the torch.save format, or in general just want to add format flexibility
333
-
to their applications. For this case, we provide the `format_utils` module in `torch.distributed.checkpoint.format_utils`.
333
+
to their applications. For this case, we provide the ``format_utils`` module in ``torch.distributed.checkpoint.format_utils``.
334
334
335
335
A command line utility is provided for the users convenience, which follows the following format:
336
-
`python -m torch.distributed.checkpoint.format_utils -m <checkpoint location> <location to write formats to> <mode>` where mode is one of `torch_to_dcp` or `dcp_to_torch`.
336
+
.. code-block:: bash
337
337
338
-
Alternatively, methods are also provided for users who may wish to convert checkpoints directly.
338
+
python -m torch.distributed.checkpoint.format_utils -m <checkpoint location><location to write formats to><mode>
339
+
340
+
In the command above, ``mode`` is one of ``torch_to_dcp``` or ``dcp_to_torch``.
339
341
340
342
343
+
Alternatively, methods are also provided for users who may wish to convert checkpoints directly.
0 commit comments