Skip to content

Commit a7182ed

Browse files
authored
Merge pull request #232331 from ZikeiWong/ziqi/Nebula/03.28
Update Document with some wording and instruction.
2 parents 1b468d8 + 5555e27 commit a7182ed

File tree

1 file changed

+37
-30
lines changed

1 file changed

+37
-30
lines changed

articles/machine-learning/reference-checkpoint-performance-for-large-models.md

Lines changed: 37 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -9,61 +9,59 @@ ms.custom: ----, ----, ----
99

1010
author: ziqiwang
1111
ms.author: ziqiwang
12-
ms.date: 03/14/2023
13-
ms.reviewer: franksolomon
12+
ms.date: 03/28/2023
1413
---
1514

16-
# Large-model Checkpoint Optimization Matters (Preview)
15+
# Boost Checkpoint Speed and Reduce Cost with Nebula (Preview)
1716

18-
Learn how to boost checkpoint speed and shrink checkpoint cost for large Azure Machine Learning training models.
17+
Learn how to boost checkpoint speed and reduce checkpoint cost for large Azure Machine Learning training models using Nebula.
1918

2019
## Overview
2120

22-
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs with PyTorch. Nebula levers the latest distributed computing technologies to shrink checkpoint times from hours to seconds - a potential 95% to 99.9% time savings. Large-scale training jobs especially benefit from Nebula checkpoint performance.
21+
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs using PyTorch. By utilizing the latest distributed computing technologies, Nebula can reduce checkpoint times from hours to seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly benefit from Nebula’s performance.
2322

2423
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
2524

2625
> [!IMPORTANT]
27-
> The `nebulaml` package is not available in the public PyPI python package index. This package is only available in the Azure Container for PyTorch (ACPT) curated environment on Azure Machine Learning. To avoid problems, please don't try to install `nebulaml` from PyPI, or the `pip` command.
26+
> The `nebulaml` package is not available on the public PyPI python package index. It is only available in the Azure Container for PyTorch (ACPT) curated environment on Azure Machine Learning. To avoid issues, do not attempt to install `nebulaml` from PyPI or using the `pip` command.
2827
29-
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure Machine Learning suffers interruption, failure, or termination.
28+
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if there is interruption, failure or termination of Azure Machine Learning.
3029

3130
> [!NOTE]
32-
> Nebula is currently under preview. This means Nebula is not production-ready. At this time, Nebula has no support as a generally-available product. Nebula will have constant updates and improvements to its functions and features. Feel free to offer feedback and suggestions to us at [email protected].
33-
>
34-
> Please visit [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) to learn more.
31+
> Nebula is currently in preview. This means that it is not yet production-ready and does not have support as a generally available product. Nebula will have constant updates and improvements to its functions and features. We welcome your feedback and suggestions at [email protected].
32+
> For more information, please visit [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) to learn more.
3533
3634
## Why checkpoint optimization for large model training matters
3735

38-
Machine learning models have become more complex because of growing volumes of data, and the format of that data. Training these complex models can become challenging because of GPU memory capacity limits and lengthy training times. As a result, complex model training, on large datasets, usually involves distributed training. However, distributed architectures often have unexpected faults and node failures. These faults and node failures become increasingly painful as the machine learning model node counts increase.
36+
As data volumes grow and data formats become more complex, machine learning models have also become more sophisticated. Training these complex models can be challenging due to GPU memory capacity limits and lengthy training times. As a result, distributed training is often used when working with large datasets and complex models. However, distributed architectures can experience unexpected faults and node failures, which can become increasingly problematic as the number of nodes in a machine learning model increases.
3937

40-
Checkpoints can help deal with these problems. Periodic checkpoints snapshot the complete model state at a given time. After a failure, the system can use that snapshot to rebuild the model in its state at the time of the snapshot. The training process can then resume at a given epoch.
38+
Checkpoints can help mitigate these issues by periodically saving a snapshot of the complete model state at a given time. In the event of a failure, this snapshot can be used to rebuild the model to its state at the time of the snapshot so that training can resume from that point.
4139

42-
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results. Shorter checkpoint intervals could solve this problem. The following diagram shows the time cost to restore a training process from checkpoints:
40+
When large model training operations experience failures or terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. However, any progress made between the checkpoint and termination is lost as computations must be re-executed to recover unsaved intermediate results. Shorter checkpoint intervals could help reduce this loss. The diagram illustrates the time wasted between the training process from checkpoints and termination:
4341

4442
:::image type="content" source="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
4543

46-
However, the checkpoint saves process itself generates large overheads. A TB-sized checkpoint save can often become a training process bottleneck. The synchronized checkpoint process blocks the training process for hours. Checkpoint-related overheads can take up 12% of total training time, on average, and can rise to 43% [(Maeng et al., 2021)](https://cs.stanford.edu/people/trippel/pubs/cpr-mlsys-21.pdf).
44+
However, the process of saving checkpoints itself can generate significant overhead. Saving a TB-sized checkpoint can often become a bottleneck in the training process, with the synchronized checkpoint process blocking training for hours. On average, checkpoint-related overheads can account for 12% of total training time and can rise to as much as 43% [(Maeng et al., 2021)](https://cs.stanford.edu/people/trippel/pubs/cpr-mlsys-21.pdf).
4745

4846
To summarize, large model checkpoint management involves heavy storage, and job recovery time overheads. Frequent checkpoint saves, combined with training job resumptions from the latest available checkpoints, become a great challenge.
4947

5048
## Nebula to the Rescue
5149

52-
To train large, distributed models, a reliable and efficient way to save and resume training progress that avoids data loss and waste of resources becomes helpful. Nebula reduces checkpoint save times and training GPU hour demands. For large model Azure Machine Learning training jobs, Nebula offers faster and easier checkpoint management and saves. In turn, it helps shrink large-scale model training time demands.
50+
To effectively train large distributed models, it is important to have a reliable and efficient way to save and resume training progress that minimizes data loss and waste of resources. Nebula helps reduce checkpoint save times and GPU hour demands for large model Azure Machine Learning training jobs by providing faster and easier checkpoint management.
5351

54-
Nebula can
52+
With Nebula you can:
5553

56-
* **Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99%.
54+
* **Boost checkpoint speeds by up to 1000 times** with a simple API that works asynchronously with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99%.
5755

5856
:::image type="content" source="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
5957

60-
This example shows the checkpoint and end-to-end training time reduction for four checkpoint saves of Huggingface GPT2, GPT2-Large, and GPT-XL training jobs. For the medium-sized Huggingface GPT2-XL checkpoint saves (20.6 GB), Nebula achieved a 96.9% time reduction for one checkpoint.
58+
This example shows the checkpoint and end-to-end training time reduction for four checkpoints saving of Hugging Face GPT2, GPT2-Large, and GPT-XL training jobs. For the medium-sized Hugging Face GPT2-XL checkpoint saves (20.6 GB), Nebula achieved a 96.9% time reduction for one checkpoint.
6159

6260
The checkpoint speed gain can still increase with model size and GPU numbers. For example, testing a training point checkpoint save of 97 GB on 128 A100 Nvidia GPUs can shrink from 20 minutes to 1 second.
6361

64-
* **Reduce end-to-end large model training time and computation costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula saves checkpoints asynchronously, and unblocks the training process, to shrink the end-to-end training time. It also allows for more frequent checkpoint saves. This way, you can resume your training from the latest checkpoint after any interruption, and save time and money wasted on job recovery and GPU training hours.
62+
* **Reduce end-to-end training time and computation costs for large models** by minimizing checkpoint overhead and reducing the number of GPU hours wasted on job recovery. Nebula saves checkpoints asynchronously, and unblocks the training process, to shrink the end-to-end training time. It also allows for more frequent checkpoint saves. This way, you can resume your training from the latest checkpoint after any interruption, and save time and money wasted on job recovery and GPU training hours.
6563

66-
* **Provide full compatibility in PyTorch**. Nebula offers full compatibility with PyTorch, and offers full integration with distributed training frameworks, including DeepSpeed (>=0.7.3), and PyTorch-Lightning (>=1.5.0). You can also use it with different Azure Machine Learning compute targets, such as AmlCompute or AKS.
64+
* **Provide full compatibility with PyTorch**. Nebula offers full compatibility with PyTorch, and offers full integration with distributed training frameworks, including DeepSpeed (>=0.7.3), and PyTorch Lightning (>=1.5.0). You can also use it with different Azure Machine Learning compute targets, such as Azure Machine Learning Compute or AKS.
6765

6866
* **Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpoint lifecycle, Nebula also provides comprehensive logs on Azure Machine Learning studio. You can choose to save your checkpoints to a local or remote storage location
6967

@@ -78,23 +76,24 @@ Nebula can
7876
* An Azure subscription and an Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md) for more information about workspace resource creation
7977
* An Azure Machine Learning compute target. See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
8078
* A training script that uses **PyTorch**.
81-
* ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments.md#azure-container-for-pytorch-acpt) to obtain the ACPT image. Learn how to [use the curated environment](./how-to-use-environments.md)
79+
* ACPT-curated (Azure Container for PyTorch) environment. See [Curated environments](resource-curated-environments.md#azure-container-for-pytorch-acpt) to obtain the ACPT image. Learn how to [use the curated environment](./how-to-use-environments.md)
8280
* An Azure Machine Learning script run configuration file. If you don’t have one, you can follow [this resource](./how-to-set-up-training-targets.md)
8381

8482
## How to Use Nebula
8583

8684
Nebula provides a fast, easy checkpoint experience, right in your existing training script.
87-
Nebula use involves:
88-
- [The ACPT environment](#using-acpt-environment)
89-
- [Nebula initialization](#initializing-nebula)
90-
- [API calls to save and load checkpoints](#call-apis-to-save-and-load-checkpoints)
85+
The steps to quick start Nebula include:
86+
- [Using ACPT environment](#using-acpt-environment)
87+
- [Initializing Nebula](#initializing-nebula)
88+
- [Calling APIs to save and load checkpoints](#calling-apis-to-save-and-load-checkpoints)
9189

9290
### Using ACPT environment
91+
9392
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a preinstalled, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
9493

9594
### Initializing Nebula
9695

97-
To enable Nebula in the ACPT environment, you only need to modify your training script to import the `nebulaml` package, and then call the Nebula APIs in the appropriate places. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
96+
To enable Nebula with the ACPT environment, you only need to modify your training script to import the `nebulaml` package, and then call the Nebula APIs in the appropriate places. You can avoid Azure Machine Learning SDK or CLI modification. You can also avoid modification of other steps to train your large model on Azure Machine Learning Platform.
9897

9998
Nebula needs initialization to run in your training script. At the initialization phase, specify the variables that determine the checkpoint save location and frequency, as shown in this code snippet:
10099

@@ -105,7 +104,14 @@ Nebula needs initialization to run in your training script. At the initializatio
105104

106105
Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result, initialization becomes simple and easy. These [examples](#examples) show how to integrate Nebula into your training scripts.
107106

108-
### Call APIs to save and load checkpoints
107+
> [!IMPORTANT]
108+
> Saving checkpoints with Nebula requires some memory to store checkpoints. Please make sure your memory is larger than at least three copies of the checkpoints.
109+
>
110+
> If the memory is not enough to hold checkpoints, you are suggested to set up an environment variable `NEBULA_MEMORY_BUFFER_SIZE` in the command to limit the use of the memory per each node when saving checkpoints. When setting this variable, Nebula will use this memory as buffer to save checkpoints. If the memory usage is not limited, Nebula will use the memory as much as possible to store the checkpoints.
111+
>
112+
> If multiple processes are running on the same node, the maximum memory for saving checkpoints will be half of the limit divided by the number of processes. Nebula will use the other half for multi-process coordination. For example, if you want to limit the memory usage per each node to 200MB, you can set the environment variable as `export NEBULA_MEMORY_BUFFER_SIZE=200000000` (in bytes, around 200MB) in the command. In this case, Nebula will only use 200MB memory to store the checkpoints in each node. If there are 4 processes running on the same node, Nebula will use 25MB memory per each process to store the checkpoints.
113+
114+
### Calling APIs to save and load checkpoints
109115

110116
Nebula provides APIs to handle checkpoint saves. You can use these APIs in your training scripts, similar to the PyTorch `torch.save()` API. These [examples](#examples) show how to use Nebula in your training scripts.
111117

@@ -168,7 +174,7 @@ latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint", persisted_storage
168174

169175
# [Using DeepSpeed](#tab/DEEPSPEED)
170176

171-
A training script based on DeepSpeed (>=0.7.3) can lever Nebula, if you enable Nebula in your `ds_config.json` configuration file, as shown:
177+
A training script based on DeepSpeed (>=0.7.3) can use Nebula, if you enable Nebula in your `ds_config.json` configuration file, as shown:
172178

173179
```python
174180
"nebula": {
@@ -185,7 +191,7 @@ latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint", persisted_storage
185191
Initialization with `ds_config.json` file configuration enables Nebula, which enables checkpoint saves in turn. The original DeepSpeed save method, with the model checkpointing API `model_engine.save_checkpoint()`, automatically uses Nebula. This save method avoids the need for code modification.
186192

187193
# [Using PyTorch Lightning](#tab/LIGHTNING)
188-
Pytorch Lightning **(Nebula supports version >=1.5.0)** checkpoints automatically when Trainer is used. As you would often save checkpoints with customized behaviors for fine-grained control, Pytorch Lightning provides two ways to save checkpoint: conditional saves with ``ModelCheckpoint()``, and manual saves with ``trainer.save_checkpoint()``. These techniques apply to PyTorch (>=0.15.0) training scripts.
194+
PyTorch Lightning **(Nebula supports version >=1.5.0)** checkpoints automatically when Trainer is used. As you would often save checkpoints with customized behaviors for fine-grained control, PyTorch Lightning provides two ways to save checkpoint: conditional saves with ``ModelCheckpoint()``, and manual saves with ``trainer.save_checkpoint()``. These techniques apply to PyTorch (>=0.15.0) training scripts.
189195

190196
If you use `ModelCheckpoint` to conditionally save your checkpoints, you can use `NebulaCallback` instead of `ModelCheckpoint` for initialization.
191197

@@ -233,7 +239,8 @@ latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint", persisted_storage
233239
```
234240
**Load Checkpoint**
235241

236-
We load checkpoints consistent with PyTorch and Pytorch Lightning. The only modification specifies the storage path.
242+
We load checkpoints consistent with PyTorch and PyTorch Lightning. The only modification specifies the storage path.
243+
237244

238245
To load the latest checkpoint, ``MyLightningModule.load_from_checkpoint()`` still works, as shown:
239246

0 commit comments

Comments
 (0)