You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-with-Nebula.md
+28-28Lines changed: 28 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,25 +26,25 @@ The Azure Container for PyTorch (ACPT) now offers **Nebula**, a fast, disk-less,
26
26
27
27
The Nebula API also offers a simple way to monitor and view checkpointing lifecycles. This API supports various model types, and ensures checkpoint consistency and reliability.
28
28
29
-
In this document, you'll learn how to use Nebula with ACPT on Azure ML, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint in case of interruption, failure, or termination.
29
+
In this document, you'll learn how to use Nebula with ACPT on Azure ML, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure ML suffers interruption, failure, or termination.
30
30
31
31
## Why checkpoint optimization for large model training matters
32
32
33
33
Machine learning models have become more complex because of the format and growing size of data. Training these complex models can become challenging because of GPU memory capacity limits and lengthy training times. As a result, training complex models on large datasets usually involves distributed training. However, distributed architectures often enough have faults and node failures. These faults and node failures become increasingly painful as the machine learning model node counts increase.
34
34
35
35
Checkpointing can help deal with these problems. Checkpoint periodically snapshots the complete model state at a given time. After a failure, the system can use that snapshot to rebuild the model in its state at the time of the snapshot. The training process can then resume at a given epoch.
36
36
37
-
However, systems need heavy resources to save checkpoints. Research shows that on average, checkpointing-related overheads can take up to 12% of total training time, and sometimes as much as 43% [1]. Additionally, when training large models, like GPT-3, TB-scale checkpoints are saved in a synchronized way. In these cases, serialization will stop the training process for a long time - 1 to 2 hours, maybe more for mounted storage.
37
+
However, systems need heavy resources to save checkpoints. Research shows that on average, checkpointing-related overheads can take up to 12% of total training time, and sometimes as much as 43% [1]. Additionally, when training large models, like GPT-3, systems often save TB-scale checkpoints in a synchronized way. In these cases, serialization stops the training process for a long time - 1 to 2 hours, maybe more for mounted storage.
38
38
39
-
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself would be considered lost and wasted, because the computation must re-execute operations to cover the unsaved, intermediate results.
39
+
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results.
40
40
41
41
To summarize, large model checkpoint management involves heavy job recover time and storage overheads.
42
42
43
43
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
44
44
45
45
## Nebula to the Rescue
46
46
47
-
Nebula reduces checkpoint save and process recovery times, to reduce training GPU hour demands. In turn, this helps shrinks large-scale model training time demands. In this way, we expect Nebula to increase large model training process resilience and stability. After a training process failure, instead of a recover process that restarts from the very beginning, when nodes experience failures, Nebula allows for a recovery from a more recent checkpoint. This reduces both E2E training time, and AzureML GPU time resource demands when nodes fail.
47
+
Nebula reduces checkpoint save and process recovery times, to reduce training GPU hour demands. In turn, this reduction helps shrinks large-scale model training time demands. We can expect Nebula to increase large model training process resilience and stability after a training process failure. Instead of a recovery process that restarts from the very beginning - the time when nodes experience failures - Nebula allows for a recovery from a more recent checkpoint. The recovery reduces both E2E training time, and AzureML GPU time resource demands when nodes fail.
48
48
49
49
Nebula can
50
50
@@ -54,17 +54,17 @@ Nebula can
54
54
55
55
***Shrink end-to-end training time and computation costs**. Nebula can help you complete large-scale model training jobs faster and cheaper by reducing checkpoint and recovery time demands.
56
56
57
-
***Reduce large model training costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula allows more frequent checkpoint saves, with zero impact on your training process or training accuracy. You can resume your training from the latest checkpoint in case of any interruption, and save your time and money.
57
+
***Reduce large model training costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula allows more frequent checkpoint saves, with zero effect on your training process or training accuracy. You can resume your training from the latest checkpoint if the training process suffers an interruption, and you'll save time and money.
58
58
59
59
***Provide a more stable and resilient experience** training large models on Azure Machine Learning. Nebula avoids data loss and resource waste due to interruptions, which can improve the reliability and performance of your training process.
60
60
61
-
***Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpointing lifecycle, Nebula also provides more comprehensive logs on Azure Machine Learning Studio. You can choose to save your checkpoints to a local or remote storage location
61
+
***Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpointing lifecycle, Nebula also provides more comprehensive logs on Azure Machine Learning studio. You can choose to save your checkpoints to a local or remote storage location
62
62
63
63
- Azure Blob Storage
64
64
- Azure Data Lake Storage
65
65
- NFS
66
66
67
-
and access them at anytime with a few lines of code.
67
+
and access them at any time with a few lines of code.
68
68
69
69
**LARGER IMG_3 VERSION NEEDED**
70
70
@@ -92,9 +92,9 @@ Nebula offers full compatibility with any distributed training framework that su
92
92
93
93
See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
94
94
95
-
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. Please visit[Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
95
+
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See[Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
96
96
97
-
* An Azure ML script run config, which defines the
97
+
* An Azure ML script run configuration file, which defines the
98
98
- source directory
99
99
- the entry script
100
100
- environment
@@ -107,13 +107,13 @@ To save checkpoints with Nebula, you must modify your training scripts in two wa
107
107
108
108
* Initialize Nebula
109
109
110
-
At the initialization phase, specify the variables that will determine the checkpoint save location and frequency. A distributed trainer - like DeepSpeed - will make this process easier.
110
+
At the initialization phase, specify the variables that determine the checkpoint save location and frequency. A distributed trainer - like DeepSpeed - makes this process easier.
111
111
112
112
* Call the save APIs to save the checkpoints
113
113
114
114
Similar to the way that the PyTorch `torch.save()` API works, Nebula provides checkpoint save APIs that you can use in your training scripts.
115
115
116
-
You won't need to modify other steps to train your large model on Azure Machine Learning Platform. You'll only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
116
+
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
117
117
118
118
119
119
## Examples
@@ -171,9 +171,9 @@ You won't need to modify other steps to train your large model on Azure Machine
171
171
}
172
172
```
173
173
174
-
This JSON snippets functions works like the `torch_nebula.init()` function.
174
+
This JSON snippets function works like the `torch_nebula.init()` function.
175
175
176
-
Initialization with ds_config.json file configuration enables Nebula, so you can save checkpoints. The original DeepSpeed saving method `model_engine.save_checkpoint()`would automatically leverage Nebula, which avoids the need for code modification.
176
+
Initialization with ds_config.json file configuration enables Nebula, so you can save checkpoints. The original DeepSpeed saving method `model_engine.save_checkpoint()` automatically uses Nebula, which avoids the need for code modification.
177
177
178
178
* Example 3 - PyTorch Lightning
179
179
@@ -223,7 +223,7 @@ You won't need to modify other steps to train your large model on Azure Machine
223
223
224
224
225
225
226
-
adding tn.NebulaCheckpointIO() in your Trainer as a plugin will enable Nebula to save and load checkpoints.
226
+
adding tn.NebulaCheckpointIO() in your Trainer as a plugin enables Nebula to save and load checkpoints.
227
227
228
228
229
229
@@ -233,11 +233,11 @@ If the training script is based on DeepSpeed (>=0.7.3), you can enjoy Nebula by
233
233
234
234
## Why MLflow?
235
235
236
-
MLflow, with over 13 million monthly downloads, has become the standard platform for end-to-end MLOps, enabling teams of all sizes to track, share, package and deploy any model for batch or real-time inference. By integrating with MLflow, your training code will not need to hold any specific code related to Azure Machine Learning, achieving true portability and seamless integration with other open-source platforms.
236
+
MLflow, with over 13 million monthly downloads, has become the standard platform for end-to-end MLOps, enabling teams of all sizes to track, share, package and deploy any model for batch or real-time inference. Because of MLflow integration, your training code can avoid any specific code related to Azure Machine Learning, achieving true portability and seamless integration with other open-source platforms.
237
237
238
238
## Prepare for migrating to MLflow
239
239
240
-
To use MLflow tracking, you will need to install `mlflow` and `azureml-mlflow` Python packages. All Azure Machine Learning environments have these packages already available for you but you will need to include them if creating your own environment.
240
+
To use MLflow tracking, you must install the `mlflow` and `azureml-mlflow` Python packages. All Azure Machine Learning environments have these packages already available for you but you must include them if creating your own environment.
Azure Machine Learning allows users to perform tracking in training jobs running on your workspace or running remotely (tracking experiments running outside Azure Machine Learning). If performing remote tracking, you will need to indicate the workspace you want to connect MLflow to.
251
+
Azure Machine Learning allows users to perform tracking in training jobs running on your workspace or running remotely (tracking experiments running outside Azure Machine Learning). For remote tracking, you must indicate the workspace to which you want to connect to MLflow.
252
252
253
253
# [Azure Machine Learning compute](#tab/aml)
254
254
255
-
You are already connected to your workspace when running on Azure Machine Learning compute.
255
+
You already connected to your workspace when running on Azure Machine Learning compute.
256
256
257
257
# [Remote compute](#tab/remote)
258
258
@@ -262,7 +262,7 @@ You are already connected to your workspace when running on Azure Machine Learni
262
262
263
263
**Configure authentication**
264
264
265
-
Once the tracking is configured, you'll also need to configure how the authentication needs to happen to the associated workspace. By default, the Azure Machine Learning plugin for MLflow will perform interactive authentication by opening the default browser to prompt for credentials. Refer to[Configure MLflow for Azure Machine Learning: Configure authentication](how-to-use-mlflow-configure-tracking.md#configure-authentication) for more ways to configure authentication for MLflow in Azure Machine Learning workspaces.
265
+
Once you configure the tracking, you must configure the way that the authentication to the associated workspace happens. By default, the Azure Machine Learning plugin opens the default browser, and prompts for credentials, to handle interactive authentication. See[Configure MLflow for Azure Machine Learning: Configure authentication](how-to-use-mlflow-configure-tracking.md#configure-authentication) for more ways to configure authentication for MLflow in Azure Machine Learning workspaces.
* The string will be logged as an _artifact_, not as a metric. In Azure Machine Learning studio, the value will be displayed in the __Outputs + logs__ tab.
335
+
* The string logs as an _artifact_, not as a metric. In Azure Machine Learning studio, the value displays in the __Outputs + logs__ tab.
336
336
337
337
### Log an image to a PNG or JPEG file
338
338
@@ -348,7 +348,7 @@ __SDK v2 with MLflow__
348
348
mlflow.log_artifact("Azure.png")
349
349
```
350
350
351
-
The image is logged as an artifact and will appear in the __Images__ tab in Azure Machine Learning Studio.
351
+
The image logs as an artifact, and appears in the Azure Machine Learning studio __Images__ tab.
352
352
353
353
### Log a matplotlib.pyplot
354
354
@@ -372,7 +372,7 @@ ax.plot([0, 1], [2, 3])
372
372
mlflow.log_figure(fig, "sample_pyplot.png")
373
373
```
374
374
375
-
* The image is logged as an artifact and will appear in the __Images__ tab in Azure Machine Learning Studio.
375
+
* The image logs as an artifact, and appears in the Azure Machine Learning studio __Images__ tab.
376
376
* The `mlflow.log_figure` method is __experimental__.
0 commit comments