Skip to content

Commit 54a6a30

Browse files
committed
formatting
1 parent 0e3c3ec commit 54a6a30

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

recipes_source/distributed_async_checkpoint_recipe.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ Speciically:
3030

3131
* Memory requirements - Asynchronous checkpointing works by first copying models into internal CPU-buffers.
3232
This is helpful since it ensures model and optimizer weights are not changing while the model is still checkpointing,
33-
but does raise CPU memory by a factor of checkpoint size times the number of process on the host.
33+
but does raise CPU memory by a factor of ``checkpoint_size_per_rank X number_of_ranks``. Additionally, users should take care to understand
34+
the memory constraints of their systems. Specifically, pinned memory implies the usage of ``page-lock`` memory, which can be scarce as compared to
35+
``pageable`` memory.
3436

3537
* Checkpoint Management - Since checkpointing is asynchronous, it is up to the user to manage concurrently run checkpoints. In general, users can
3638
employ their own management strategies by handling the future object returned form ``async_save``. For most users, we recommend limiting

0 commit comments

Comments
 (0)