Skip to content

Commit 068faf2

Browse files
committed
Ziqi requested these updates.
1 parent 1d77321 commit 068faf2

File tree

2 files changed

+40
-44
lines changed

2 files changed

+40
-44
lines changed
937 KB
Loading

articles/machine-learning/reference-checkpoint-performance-with-Nebula.md

Lines changed: 40 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula
2+
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula (Preview)
33
titleSuffix: Azure Machine Learning
44
description: Learn how Nebula can save time, resources, and money for large model training applications
55
services: machine-learning
@@ -19,15 +19,13 @@ Learn how to boost checkpoint speed and shrink checkpoint cost for large Azure M
1919

2020
## Overview
2121

22-
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. With Nebula, you can checkpoint distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
22+
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
2323

24-
To make Nebula available for your training jobs, import the `torch_nebula` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
24+
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
2525

2626
> [!IMPORTANT]
2727
> The `torch-nebula` package is not available in the public PyPI python package index. This package is only available in the Azure Container for PyTorch (ACPT) curated environment on Azure Machine Learning. To avoid problems, please don't try to install `torch-nebula` from PyPI, or the `pip` command.
2828
29-
To maintain stability and to avoid confusion, the next ACPT version release will rename this package to `nebula-ml`.
30-
3129
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure Machine Learning suffers interruption, failure, or termination.
3230

3331
> [!NOTE]
@@ -99,20 +97,20 @@ Nebula use involves:
9997
- [API calls to save and load checkpoints](#call-apis-to-save-and-load-checkpoints)
10098

10199
### Using ACPT environment
102-
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, offers Nebula pre-installed. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to learn more about the curated enviroment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
100+
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a pre-installed, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
103101

104102
### Initializing Nebula
105103

106-
To enable Nebula in the ACPT environment, you must only modify your training script to import the `torch_nebula` package, and then call the Nebula APIs in the appropriate places. That's it. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
104+
To enable Nebula in the ACPT environment, you only need to modify your training script to import the `nebulaml` package, and then call the Nebula APIs in the appropriate places. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
107105

108106
Nebula needs initialization to run in your training script. At the initialization phase, specify the variables that determine the checkpoint save location and frequency, as shown in this code snippet:
109107

110108
```python
111-
import torch_nebula as tn
112-
tn.init(persistent_storage_path=<YOUR STORAGE PATH>) # initialize Nebula
109+
import nebulaml as nm
110+
nm.init(persistent_storage_path=<YOUR STORAGE PATH>) # initialize Nebula
113111
```
114112

115-
We plan to integrate Nebula into some trainers, to make initialization simple and easy. If you use a distributed trainer like DeepSpeed, or PyTorch Lightning, this process becomes easier. See these [examples](#examples) to learn how to integrate Nebula in your training scripts.
113+
Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result, initialization becomes simple and easy. These [examples](#examples) show how to integrate Nebula into your training scripts.
116114

117115
### Call APIs to save and load checkpoints
118116

@@ -131,48 +129,46 @@ These examples show how to use Nebula with different framework types. You can ch
131129

132130
To enable full Nebula compatibility with PyTorch-based training scripts, modify your training script as needed.
133131

134-
1. First, import the required `torch_nebula` package:
135-
136-
```python
137-
# Import the Nebula package for fast-checkpointing
138-
import torch_nebula as tn
139-
```
140-
141-
1. To initialize Nebula, call the `tn.init()` function in `main()`, as shown here:
132+
1. First, import the required `nebulaml` package:
133+
```python
134+
# Import the Nebula package for fast-checkpointing
135+
import nebulaml as nm
136+
```
142137

143-
```python
144-
# Initialize Nebula with variables that helps Nebula to know where and how often to save your checkpoints
145-
persistent_storage_path="/tmp/test",
146-
tn.init(persistent_storage_path, persistent_time_interval=2)
147-
```
138+
1. To initialize Nebula, call the `nm.init()` function in `main()`, as shown here:
139+
```python
140+
# Initialize Nebula with variables that helps Nebula to know where and how often to save your checkpoints
141+
persistent_storage_path="/tmp/test",
142+
nm.init(persistent_storage_path, persistent_time_interval=2)
143+
```
148144

149145
1. To save checkpoints, replace the original `torch.save()` statement to save your checkpoint with Nebula:
150146

151-
```python
152-
checkpoint = tn.Checkpoint()
153-
checkpoint.save(<'CKPT_NAME'>, model)
154-
```
155-
> [!NOTE]
156-
> ``<'CKPT_TAG_NAME'>`` is the unique ID for the checkpoint. A tag is usually the number of steps, the epoch number, or any user-defined name. The optional ``<'NUM_OF_FILES'>`` optional parameter specifies the state number which you would save for this tag.
147+
```python
148+
checkpoint = nm.Checkpoint()
149+
checkpoint.save(<'CKPT_NAME'>, model)
150+
```
151+
> [!NOTE]
152+
> ``<'CKPT_TAG_NAME'>`` is the unique ID for the checkpoint. A tag is usually the number of steps, the epoch number, or any user-defined name. The optional ``<'NUM_OF_FILES'>`` optional parameter specifies the state number which you would save for this tag.
157153

158154
1. Load the latest valid checkpoint, as shown here:
159155

160-
```python
161-
latest_ckpt = tn.get_latest_checkpoint()
162-
p0 = latest_ckpt.load(<'CKPT_NAME'>)
163-
```
156+
```python
157+
latest_ckpt = nm.get_latest_checkpoint()
158+
p0 = latest_ckpt.load(<'CKPT_NAME'>)
159+
```
164160

165-
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. In this way, the training state would become the last one saved.
161+
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. By loading back the latest checkpoint, the training state can be restored to that saved by the last checkpoint.
166162

167-
Other APIs can handle checkpoint management
163+
Other APIs can handle checkpoint management
168164

169-
- list all checkpoints
170-
- get latest checkpoints
165+
- list all checkpoints
166+
- get latest checkpoints
171167

172168
```python
173169
# Managing checkpoints
174170
## List all checkpoints
175-
ckpts = tn.list_checkpoints()
171+
ckpts = nm.list_checkpoints()
176172
## Get Latest checkpoint path
177173
latest_ckpt_path = tn.get_latest_checkpoint_path("checkpoint", persisted_storage_path)
178174
```
@@ -191,7 +187,7 @@ latest_ckpt_path = tn.get_latest_checkpoint_path("checkpoint", persisted_storage
191187
}
192188
```
193189

194-
This JSON snippets function works like the `torch_nebula.init()` function.
190+
This JSON snippets function works like the `nebulaml.init()` function.
195191

196192
Initialization with `ds_config.json` file configuration enables Nebula, which enables checkpoint saves in turn. The original DeepSpeed save method, with the model checkpointing API `model_engine.save_checkpoint()`, automatically uses Nebula. This save method avoids the need for code modification.
197193

@@ -202,7 +198,7 @@ latest_ckpt_path = tn.get_latest_checkpoint_path("checkpoint", persisted_storage
202198

203199
```python
204200
# import Nebula package
205-
import torch_nebula as tn
201+
import nebulaml as nm
206202

207203
# define NebulaCallback
208204
config_params = dict()
@@ -226,12 +222,12 @@ latest_ckpt_path = tn.get_latest_checkpoint_path("checkpoint", persisted_storage
226222

227223
```python
228224
# import Nebula package
229-
import torch_nebula as tn
225+
import nebulaml as nm
230226

231227
# initialize Nebula
232-
tn.init(persistent_storage_path=<YOUR STORAGE PATH>)
228+
nm.init(persistent_storage_path=<YOUR STORAGE PATH>)
233229

234-
trainer = Trainer(plugins=[tn.NebulaCheckpointIO()]) # add NebulaCheckpointIO as a plugin
230+
trainer = Trainer(plugins=[nm.NebulaCheckpointIO()]) # add NebulaCheckpointIO as a plugin
235231

236232
# Saving checkpoints
237233
storage_options = {}
@@ -250,7 +246,7 @@ To load the latest checkpoint, ``MyLightningModule.load_from_checkpoint()`` stil
250246

251247
```python
252248
persistent_path = "/tmp/tier3/checkpoint"
253-
latest_ckpt_path = torch_nebula.get_latest_checkpoint_path("checkpoint", persist_path)
249+
latest_ckpt_path = nebulaml.get_latest_checkpoint_path("checkpoint", persist_path)
254250
model = MyLightningModule.load_from_checkpoint(latest_ckpt_path)
255251
```
256252

0 commit comments

Comments
 (0)