You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Checkpointing is often a bottle-neck in the critical distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
5
-
One excellent strategy to offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
4
+
Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
5
+
One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
6
6
from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
7
7
to show how this can be integrated quite easily with `torch.distributed.checkpoint.async_save`.
print(f"Running fsdp checkpoint example on {world_size} devices.")
126
+
print(f"Running async checkpoint example on {world_size} devices.")
126
127
mp.spawn(
127
128
run_fsdp_checkpoint_save_example,
128
129
args=(world_size,),
@@ -133,8 +134,9 @@ Speciically:
133
134
134
135
Even more performance with Pinned Memory
135
136
-----------------------------------------
136
-
If the above optimization is still not performant enough for a use case, PyTorch offers an additional optimization for GPU models by utilizing a pinned memory buffer.
137
-
This optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers.
137
+
If the above optimization is still not performant enough, users may wish to take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging.
138
+
Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaing a pinned memory buffer between
139
+
checkpoint requests users can take advantage of direct memory access to speed up this copy.
138
140
139
141
Note: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
140
142
any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained in between steps, leading to the same
0 commit comments