Skip to content

Commit f9cfc28

Browse files
authored
Merge pull request #230433 from fbsolo-ms1/tutorial-for-SK
Build the new Nebula doc product.
2 parents 87c294b + 14c5df3 commit f9cfc28

File tree

5 files changed

+264
-0
lines changed

5 files changed

+264
-0
lines changed
71.4 KB
Loading
62.7 KB
Loading
382 KB
Loading
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
---
2+
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula (Preview)
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how Nebula can save time, resources, and money for large model training applications
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.topic: reference
8+
ms.custom: ----, ----, ----
9+
10+
author: ziqiwang
11+
ms.author: ziqiwang
12+
ms.date: 03/06/2023
13+
ms.reviewer: franksolomon
14+
---
15+
16+
# Large-model Checkpoint Optimization Matters (Preview)
17+
18+
Learn how to boost checkpoint speed and shrink checkpoint cost for large Azure Machine Learning training models.
19+
20+
## Overview
21+
22+
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
23+
24+
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
25+
26+
> [!IMPORTANT]
27+
> The `nebulaml` package is not available in the public PyPI python package index. This package is only available in the Azure Container for PyTorch (ACPT) curated environment on Azure Machine Learning. To avoid problems, please don't try to install `nebulaml` from PyPI, or the `pip` command.
28+
29+
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure Machine Learning suffers interruption, failure, or termination.
30+
31+
> [!NOTE]
32+
> Nebula is currently under preview. This means Nebula is not production-ready. At this time, Nebula has no support as a generally-available product. Nebula will have constant updates and improvements to its functions and features. Feel free to offer feedback and suggestions to us at [email protected].
33+
>
34+
> Please visit [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) to learn more.
35+
36+
## Why checkpoint optimization for large model training matters
37+
38+
Machine learning models have become more complex because of growing volumes of data, and the format of that data. Training these complex models can become challenging because of GPU memory capacity limits and lengthy training times. As a result, complex model training, on large datasets, usually involves distributed training. However, distributed architectures often have unexpected faults and node failures. These faults and node failures become increasingly painful as the machine learning model node counts increase.
39+
40+
Checkpoints can help deal with these problems. Periodic checkpoints snapshot the complete model state at a given time. After a failure, the system can use that snapshot to rebuild the model in its state at the time of the snapshot. The training process can then resume at a given epoch.
41+
42+
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results. Shorter checkpoint intervals could solve this problem. The following diagram shows the time cost to restore a training process from checkpoints:
43+
44+
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
45+
46+
However, the checkpoint saves process itself generates large overheads. A TB-sized checkpoint save can often become a training process bottleneck. The synchronized checkpoint process blocks the training process for hours. Checkpoint-related overheads can take up 12% of total training time, on average, and can rise to 43% [(Maeng et al., 2021)](https://cs.stanford.edu/people/trippel/pubs/cpr-mlsys-21.pdf).
47+
48+
To summarize, large model checkpoint management involves heavy storage, and job recovery time overheads. Frequent checkpoint saves, combined with training job resumptions from the latest available checkpoints, become a great challenge.
49+
50+
## Nebula to the Rescue
51+
52+
To train large, distributed models, a reliable and efficient way to save and resume training progress that avoids data loss and waste of resources becomes helpful. Nebula reduces checkpoint save times and training GPU hour demands. For large model Azure Machine Learning training jobs, Nebula offers faster and easier checkpoint management and saves. In turn, it helps shrink large-scale model training time demands.
53+
54+
Nebula can
55+
56+
* **Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99%.
57+
58+
:::image type="content" source="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
59+
60+
This example shows the checkpoint and end-to-end training time reduction for four checkpoint saves of Huggingface GPT2, GPT2-Large, and GPT-XL training jobs. For the medium-sized Huggingface GPT2-XL checkpoint saves (20.6 GB), Nebula achieved a 96.9% time reduction for one checkpoint.
61+
62+
The checkpoint speed gain can still increase with model size and GPU numbers. For example, testing a training point checkpoint save of 97 GB on 128 A100 Nvidia GPUs can shrink from 20 minutes to 1 second.
63+
64+
* **Reduce end-to-end large model training time and computation costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula saves checkpoints asynchronously, and unblocks the training process, to shrink the end-to-end training time. It also allows for more frequent checkpoint saves. This way, you can resume your training from the latest checkpoint after any interruption, and save time and money wasted on job recovery and GPU training hours.
65+
66+
* **Provide full compatibility in PyTorch**. Nebula offers full compatibility with PyTorch, and offers full integration with distributed training frameworks, including DeepSpeed (>=0.7.3), and PyTorch-Lightning (>=1.5.0). You can also use it with different Azure Machine Learning compute targets, such as AmlCompute or AKS.
67+
68+
* **Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpoint lifecycle, Nebula also provides comprehensive logs on Azure Machine Learning studio. You can choose to save your checkpoints to a local or remote storage location
69+
70+
- Azure Blob Storage
71+
- Azure Data Lake Storage
72+
- NFS
73+
74+
and access them at any time with a few lines of code.
75+
76+
## Prerequisites
77+
78+
* An Azure subscription and an Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md) for more information about workspace resource creation
79+
* An Azure Machine Learning compute target. See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
80+
* A training script that uses **PyTorch**.
81+
* ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
82+
* An Azure Machine Learning script run configuration file. If you don’t have one, you can follow [this resource](./how-to-set-up-training-targets.md)
83+
84+
## How to Use Nebula
85+
86+
Nebula provides a fast, easy checkpoint experience, right in your existing training script.
87+
Nebula use involves:
88+
- [The ACPT environment](#using-acpt-environment)
89+
- [Nebula initialization](#initializing-nebula)
90+
- [API calls to save and load checkpoints](#call-apis-to-save-and-load-checkpoints)
91+
92+
### Using ACPT environment
93+
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a preinstalled, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
94+
95+
### Initializing Nebula
96+
97+
To enable Nebula in the ACPT environment, you only need to modify your training script to import the `nebulaml` package, and then call the Nebula APIs in the appropriate places. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
98+
99+
Nebula needs initialization to run in your training script. At the initialization phase, specify the variables that determine the checkpoint save location and frequency, as shown in this code snippet:
100+
101+
```python
102+
import nebulaml as nm
103+
nm.init(persistent_storage_path=<YOUR STORAGE PATH>) # initialize Nebula
104+
```
105+
106+
Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result, initialization becomes simple and easy. These [examples](#examples) show how to integrate Nebula into your training scripts.
107+
108+
### Call APIs to save and load checkpoints
109+
110+
Nebula provides APIs to handle checkpoint saves. You can use these APIs in your training scripts, similar to the PyTorch `torch.save()` API. These [examples](#examples) show how to use Nebula in your training scripts.
111+
112+
### View your checkpointing histories
113+
When your training job finishes, navigate to the Job `Name> Outputs + logs` pane. In the left panel, expand the **Nebula** folder, and select `checkpointHistories.csv` to see detailed information about Nebula checkpoint saves - duration, throughput, and checkpoint size.
114+
115+
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" alt-text="Screenshot that shows metadata about the checkpoint saves.":::
116+
117+
## Examples
118+
119+
These examples show how to use Nebula with different framework types. You can choose the example that best fits your training script.
120+
121+
# [Using PyTorch Natively](#tab/PYTORCH)
122+
123+
To enable full Nebula compatibility with PyTorch-based training scripts, modify your training script as needed.
124+
125+
1. First, import the required `nebulaml` package:
126+
```python
127+
# Import the Nebula package for fast-checkpointing
128+
import nebulaml as nm
129+
```
130+
131+
1. To initialize Nebula, call the `nm.init()` function in `main()`, as shown here:
132+
```python
133+
# Initialize Nebula with variables that helps Nebula to know where and how often to save your checkpoints
134+
persistent_storage_path="/tmp/test",
135+
nm.init(persistent_storage_path, persistent_time_interval=2)
136+
```
137+
138+
1. To save checkpoints, replace the original `torch.save()` statement to save your checkpoint with Nebula:
139+
140+
```python
141+
checkpoint = nm.Checkpoint()
142+
checkpoint.save(<'CKPT_NAME'>, model)
143+
```
144+
> [!NOTE]
145+
> ``<'CKPT_TAG_NAME'>`` is the unique ID for the checkpoint. A tag is usually the number of steps, the epoch number, or any user-defined name. The optional ``<'NUM_OF_FILES'>`` optional parameter specifies the state number which you would save for this tag.
146+
147+
1. Load the latest valid checkpoint, as shown here:
148+
149+
```python
150+
latest_ckpt = nm.get_latest_checkpoint()
151+
p0 = latest_ckpt.load(<'CKPT_NAME'>)
152+
```
153+
154+
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. With the latest checkpoint, the training state can be restored to the state saved by the last checkpoint.
155+
156+
Other APIs can handle checkpoint management
157+
158+
- list all checkpoints
159+
- get latest checkpoints
160+
161+
```python
162+
# Managing checkpoints
163+
## List all checkpoints
164+
ckpts = nm.list_checkpoints()
165+
## Get Latest checkpoint path
166+
latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint", persisted_storage_path)
167+
```
168+
169+
# [Using DeepSpeed](#tab/DEEPSPEED)
170+
171+
A training script based on DeepSpeed (>=0.7.3) can lever Nebula, if you enable Nebula in your `ds_config.json` configuration file, as shown:
172+
173+
```python
174+
"nebula": {
175+
"enabled": true,
176+
"persistent_storage_path": "<YOUR STORAGE PATH>",
177+
"persistent_time_interval": 100,
178+
"num_of_version_in_retention": 2,
179+
"enable_nebula_load": true
180+
}
181+
```
182+
183+
This JSON snippets function works like the `nebulaml.init()` function.
184+
185+
Initialization with `ds_config.json` file configuration enables Nebula, which enables checkpoint saves in turn. The original DeepSpeed save method, with the model checkpointing API `model_engine.save_checkpoint()`, automatically uses Nebula. This save method avoids the need for code modification.
186+
187+
# [Using PyTorch Lightning](#tab/LIGHTNING)
188+
Pytorch Lightning **(Nebula supports version >=1.5.0)** checkpoints automatically when Trainer is used. As you would often save checkpoints with customized behaviors for fine-grained control, Pytorch Lightning provides two ways to save checkpoint: conditional saves with ``ModelCheckpoint()``, and manual saves with ``trainer.save_checkpoint()``. These techniques apply to PyTorch (>=0.15.0) training scripts.
189+
190+
If you use `ModelCheckpoint` to conditionally save your checkpoints, you can use `NebulaCallback` instead of `ModelCheckpoint` for initialization.
191+
192+
```python
193+
# import Nebula package
194+
import nebulaml as nm
195+
196+
# define NebulaCallback
197+
config_params = dict()
198+
config_params["persistent_storage_path"] = "<YOUR STORAGE PATH>"
199+
config_params["persistent_time_interval"] = 10
200+
201+
nebula_checkpoint_callback = nm.NebulaCallback(
202+
****, # Original ModelCheckpoint params
203+
config_params=config_params, # customize the config of init nebula
204+
)
205+
```
206+
207+
Next, add `nm.NebulaCheckpointIO()` as a plugin to your `Trainer`, and modify the `trainer.save_checkpoint()` storage parameters as shown:
208+
209+
```python
210+
trainer = Trainer(plugins=[nm.NebulaCheckpointIO()], # add NebulaCheckpointIO as a plugin
211+
callbacks=[nebula_checkpoint_callback]) # use NebulaCallback as a plugin
212+
```
213+
214+
If you use `trainer.save_checkpoint()` to manually save your checkpoints, you can use the `NebulaCheckpointIO` plugin in your `Trainer`, and modify the storage parameters in `trainer.save_checkpoint()` as follows:
215+
216+
```python
217+
# import Nebula package
218+
import nebulaml as nm
219+
220+
# initialize Nebula
221+
nm.init(persistent_storage_path=<YOUR STORAGE PATH>)
222+
223+
trainer = Trainer(plugins=[nm.NebulaCheckpointIO()]) # add NebulaCheckpointIO as a plugin
224+
225+
# Saving checkpoints
226+
storage_options = {}
227+
storage_options['is_best'] = True
228+
storage_options['persist_path'] = "/tmp/tier3/checkpoint"
229+
230+
trainer.save_checkpoint("example.ckpt",
231+
storage_options=storage_options, # customize the config of Nebula saving checkpoint
232+
)
233+
```
234+
**Load Checkpoint**
235+
236+
We load checkpoints consistent with PyTorch and Pytorch Lightning. The only modification specifies the storage path.
237+
238+
To load the latest checkpoint, ``MyLightningModule.load_from_checkpoint()`` still works, as shown:
239+
240+
```python
241+
persistent_path = "/tmp/tier3/checkpoint"
242+
latest_ckpt_path = nebulaml.get_latest_checkpoint_path("checkpoint", persist_path)
243+
model = MyLightningModule.load_from_checkpoint(latest_ckpt_path)
244+
```
245+
246+
If you used ``nebula_checkpoint_callback`` in your ``Trainer()``, your original unchanged script, as shown, still works:
247+
248+
```python
249+
trainer = Trainer(
250+
default_root_dir=tmpdir,
251+
max_steps=100,
252+
plugins=[NebulaCheckpointIO()],
253+
callbacks=[nebula_checkpoint_callback],
254+
)
255+
trainer.fit(model, ckpt_path="/path/example.ckpt")
256+
```
257+
258+
## Next steps
259+
260+
* [Track ML experiments and models with MLflow](how-to-use-mlflow-cli-runs.md)
261+
* [Log and view metrics](how-to-log-view-metrics.md)

articles/machine-learning/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,9 @@
357357
- name: Attach and Manage a Synapse Spark pool
358358
displayName: Attach and Manage a Synapse Spark pool
359359
href: how-to-manage-synapse-spark-pool.md
360+
- name: Reference checkpoint performance with Nebula
361+
displayName: Reference checkpoint performance with Nebula
362+
href: reference-checkpoint-performance-with-Nebula.md
360363
- name: AKS and Azure Arc-enabled Kubernetes
361364
items:
362365
- name: What is Kubernetes compute target

0 commit comments

Comments
 (0)