You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-with-Nebula.md
+40-44Lines changed: 40 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula
2
+
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula (Preview)
3
3
titleSuffix: Azure Machine Learning
4
4
description: Learn how Nebula can save time, resources, and money for large model training applications
5
5
services: machine-learning
@@ -19,15 +19,13 @@ Learn how to boost checkpoint speed and shrink checkpoint cost for large Azure M
19
19
20
20
## Overview
21
21
22
-
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. With Nebula, you can checkpoint distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
22
+
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
23
23
24
-
To make Nebula available for your training jobs, import the `torch_nebula` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
24
+
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
25
25
26
26
> [!IMPORTANT]
27
27
> The `torch-nebula` package is not available in the public PyPI python package index. This package is only available in the Azure Container for PyTorch (ACPT) curated environment on Azure Machine Learning. To avoid problems, please don't try to install `torch-nebula` from PyPI, or the `pip` command.
28
28
29
-
To maintain stability and to avoid confusion, the next ACPT version release will rename this package to `nebula-ml`.
30
-
31
29
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure Machine Learning suffers interruption, failure, or termination.
32
30
33
31
> [!NOTE]
@@ -99,20 +97,20 @@ Nebula use involves:
99
97
-[API calls to save and load checkpoints](#call-apis-to-save-and-load-checkpoints)
100
98
101
99
### Using ACPT environment
102
-
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, offers Nebula pre-installed. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to learn more about the curated enviroment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
100
+
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a pre-installed, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
103
101
104
102
### Initializing Nebula
105
103
106
-
To enable Nebula in the ACPT environment, you must only modify your training script to import the `torch_nebula` package, and then call the Nebula APIs in the appropriate places. That's it. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
104
+
To enable Nebula in the ACPT environment, you only need to modify your training script to import the `nebulaml` package, and then call the Nebula APIs in the appropriate places. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
107
105
108
106
Nebula needs initialization to run in your training script. At the initialization phase, specify the variables that determine the checkpoint save location and frequency, as shown in this code snippet:
We plan to integrate Nebula into some trainers, to make initialization simple and easy. If you use a distributed trainer like DeepSpeed, or PyTorch Lightning, this process becomes easier. See these [examples](#examples)to learn how to integrate Nebula in your training scripts.
113
+
Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result, initialization becomes simple and easy. These [examples](#examples)show how to integrate Nebula into your training scripts.
116
114
117
115
### Call APIs to save and load checkpoints
118
116
@@ -131,48 +129,46 @@ These examples show how to use Nebula with different framework types. You can ch
131
129
132
130
To enable full Nebula compatibility with PyTorch-based training scripts, modify your training script as needed.
133
131
134
-
1. First, import the required `torch_nebula` package:
135
-
136
-
```python
137
-
# Import the Nebula package for fast-checkpointing
138
-
import torch_nebula as tn
139
-
```
140
-
141
-
1. To initialize Nebula, call the `tn.init()` function in `main()`, as shown here:
132
+
1. First, import the required `nebulaml` package:
133
+
```python
134
+
# Import the Nebula package for fast-checkpointing
135
+
import nebulaml as nm
136
+
```
142
137
143
-
```python
144
-
# Initialize Nebula with variables that helps Nebula to know where and how often to save your checkpoints
1. To save checkpoints, replace the original `torch.save()` statement to save your checkpoint with Nebula:
150
146
151
-
```python
152
-
checkpoint =tn.Checkpoint()
153
-
checkpoint.save(<'CKPT_NAME'>, model)
154
-
```
155
-
> [!NOTE]
156
-
> ``<'CKPT_TAG_NAME'>`` is the unique ID for the checkpoint. A tag is usually the number of steps, the epoch number, or any user-defined name. The optional ``<'NUM_OF_FILES'>`` optional parameter specifies the state number which you would save for this tag.
147
+
```python
148
+
checkpoint =nm.Checkpoint()
149
+
checkpoint.save(<'CKPT_NAME'>, model)
150
+
```
151
+
> [!NOTE]
152
+
>``<'CKPT_TAG_NAME'>``is the unique IDfor the checkpoint. A tag is usually the number of steps, the epoch number, orany user-defined name. The optional ``<'NUM_OF_FILES'>`` optional parameter specifies the state number which you would save for this tag.
157
153
158
154
1. Load the latest valid checkpoint, as shown here:
159
155
160
-
```python
161
-
latest_ckpt =tn.get_latest_checkpoint()
162
-
p0 = latest_ckpt.load(<'CKPT_NAME'>)
163
-
```
156
+
```python
157
+
latest_ckpt = nm.get_latest_checkpoint()
158
+
p0 = latest_ckpt.load(<'CKPT_NAME'>)
159
+
```
164
160
165
-
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. In this way, the training state would become the last one saved.
161
+
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. By loading back the latest checkpoint, the training state can be restored to that saved by the last checkpoint.
This JSON snippets function works like the `torch_nebula.init()` function.
190
+
This JSON snippets function works like the `nebulaml.init()` function.
195
191
196
192
Initialization with`ds_config.json`file configuration enables Nebula, which enables checkpoint saves in turn. The original DeepSpeed save method, with the model checkpointing API`model_engine.save_checkpoint()`, automatically uses Nebula. This save method avoids the need for code modification.
0 commit comments