You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166)
We are a team at Argonne National Laboratory working on low-overhead
asynchronous checkpointing approaches for LLMs and transformers. As part
of these efforts, we have developed DataStates-LLM, a library that we
would like to contribute to the DeepSpeed community:
https://github.com/datastates/datastates-llm
The key idea we leverage is to allow non-blocking tensor copies during
the forward and backward pass from the GPU to the host. Only if these
copies do not finish until the update phase, then we block. Meanwhile,
from the host memory, the tensors are flushed asynchronously to durable
storage (parallel file systems, local SSDs, etc).
To enable this capability, our initial implementation makes the
scheduler aware of checkpointing, calling a ckpt.wait() primitive before
starting the update phase. We illustrated this with the pipeline
scheduler. We are also considering a scheduler-independent solution that
integrates with DeepSpeed/Megatron and provides a hook for the start of
the update phase, which we can leverage to run ckpt.wait().
We appreciate your feedback and look forward to a collaboration in this
space.
---------
Signed-off-by: amaurya <[email protected]>
Co-authored-by: amaurya <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
This feature is not enabled by default. To enable, set the following options in ds_config.json and download the [DataStates-LLM checkpointing library](https://github.com/DataStates/datastates-llm/). A detailed tutorial is available [here](../../docs/_tutorials/datastates-async-checkpointing.md).
0 commit comments