Skip to content

Commit 65e1350

Browse files
committed
Cleaned detected grammar / style issues . . .
1 parent a8dfc40 commit 65e1350

File tree

1 file changed

+28
-28
lines changed

1 file changed

+28
-28
lines changed

articles/machine-learning/reference-checkpoint-performance-with-Nebula.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -26,25 +26,25 @@ The Azure Container for PyTorch (ACPT) now offers **Nebula**, a fast, disk-less,
2626

2727
The Nebula API also offers a simple way to monitor and view checkpointing lifecycles. This API supports various model types, and ensures checkpoint consistency and reliability.
2828

29-
In this document, you'll learn how to use Nebula with ACPT on Azure ML, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint in case of interruption, failure, or termination.
29+
In this document, you'll learn how to use Nebula with ACPT on Azure ML, to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if Azure ML suffers interruption, failure, or termination.
3030

3131
## Why checkpoint optimization for large model training matters
3232

3333
Machine learning models have become more complex because of the format and growing size of data. Training these complex models can become challenging because of GPU memory capacity limits and lengthy training times. As a result, training complex models on large datasets usually involves distributed training. However, distributed architectures often enough have faults and node failures. These faults and node failures become increasingly painful as the machine learning model node counts increase.
3434

3535
Checkpointing can help deal with these problems. Checkpoint periodically snapshots the complete model state at a given time. After a failure, the system can use that snapshot to rebuild the model in its state at the time of the snapshot. The training process can then resume at a given epoch.
3636

37-
However, systems need heavy resources to save checkpoints. Research shows that on average, checkpointing-related overheads can take up to 12% of total training time, and sometimes as much as 43% [1]. Additionally, when training large models, like GPT-3, TB-scale checkpoints are saved in a synchronized way. In these cases, serialization will stop the training process for a long time - 1 to 2 hours, maybe more for mounted storage.
37+
However, systems need heavy resources to save checkpoints. Research shows that on average, checkpointing-related overheads can take up to 12% of total training time, and sometimes as much as 43% [1]. Additionally, when training large models, like GPT-3, systems often save TB-scale checkpoints in a synchronized way. In these cases, serialization stops the training process for a long time - 1 to 2 hours, maybe more for mounted storage.
3838

39-
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself would be considered lost and wasted, because the computation must re-execute operations to cover the unsaved, intermediate results.
39+
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results.
4040

4141
To summarize, large model checkpoint management involves heavy job recover time and storage overheads.
4242

4343
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
4444

4545
## Nebula to the Rescue
4646

47-
Nebula reduces checkpoint save and process recovery times, to reduce training GPU hour demands. In turn, this helps shrinks large-scale model training time demands. In this way, we expect Nebula to increase large model training process resilience and stability. After a training process failure, instead of a recover process that restarts from the very beginning, when nodes experience failures, Nebula allows for a recovery from a more recent checkpoint. This reduces both E2E training time, and AzureML GPU time resource demands when nodes fail.
47+
Nebula reduces checkpoint save and process recovery times, to reduce training GPU hour demands. In turn, this reduction helps shrinks large-scale model training time demands. We can expect Nebula to increase large model training process resilience and stability after a training process failure. Instead of a recovery process that restarts from the very beginning - the time when nodes experience failures - Nebula allows for a recovery from a more recent checkpoint. The recovery reduces both E2E training time, and AzureML GPU time resource demands when nodes fail.
4848

4949
Nebula can
5050

@@ -54,17 +54,17 @@ Nebula can
5454

5555
* **Shrink end-to-end training time and computation costs**. Nebula can help you complete large-scale model training jobs faster and cheaper by reducing checkpoint and recovery time demands.
5656

57-
* **Reduce large model training costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula allows more frequent checkpoint saves, with zero impact on your training process or training accuracy. You can resume your training from the latest checkpoint in case of any interruption, and save your time and money.
57+
* **Reduce large model training costs** through checkpoint overhead reduction, and reduction of GPU hours wasted on job recovery. Nebula allows more frequent checkpoint saves, with zero effect on your training process or training accuracy. You can resume your training from the latest checkpoint if the training process suffers an interruption, and you'll save time and money.
5858

5959
* **Provide a more stable and resilient experience** training large models on Azure Machine Learning. Nebula avoids data loss and resource waste due to interruptions, which can improve the reliability and performance of your training process.
6060

61-
* **Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpointing lifecycle, Nebula also provides more comprehensive logs on Azure Machine Learning Studio. You can choose to save your checkpoints to a local or remote storage location
61+
* **Easily manage your checkpoints** with a Python package that helps list, get, save and load your checkpoints. To show the checkpointing lifecycle, Nebula also provides more comprehensive logs on Azure Machine Learning studio. You can choose to save your checkpoints to a local or remote storage location
6262

6363
- Azure Blob Storage
6464
- Azure Data Lake Storage
6565
- NFS
6666

67-
and access them at anytime with a few lines of code.
67+
and access them at any time with a few lines of code.
6868

6969
**LARGER IMG_3 VERSION NEEDED**
7070

@@ -92,9 +92,9 @@ Nebula offers full compatibility with any distributed training framework that su
9292

9393
See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
9494

95-
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. Please visit [Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
95+
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
9696

97-
* An Azure ML script run config, which defines the
97+
* An Azure ML script run configuration file, which defines the
9898
- source directory
9999
- the entry script
100100
- environment
@@ -107,13 +107,13 @@ To save checkpoints with Nebula, you must modify your training scripts in two wa
107107

108108
* Initialize Nebula
109109

110-
At the initialization phase, specify the variables that will determine the checkpoint save location and frequency. A distributed trainer - like DeepSpeed - will make this process easier.
110+
At the initialization phase, specify the variables that determine the checkpoint save location and frequency. A distributed trainer - like DeepSpeed - makes this process easier.
111111

112112
* Call the save APIs to save the checkpoints
113113

114114
Similar to the way that the PyTorch `torch.save()` API works, Nebula provides checkpoint save APIs that you can use in your training scripts.
115115

116-
You won't need to modify other steps to train your large model on Azure Machine Learning Platform. You'll only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
116+
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
117117

118118

119119
## Examples
@@ -171,9 +171,9 @@ You won't need to modify other steps to train your large model on Azure Machine
171171
}
172172
```
173173

174-
This JSON snippets functions works like the `torch_nebula.init()` function.
174+
This JSON snippets function works like the `torch_nebula.init()` function.
175175

176-
Initialization with ds_config.json file configuration enables Nebula, so you can save checkpoints. The original DeepSpeed saving method `model_engine.save_checkpoint()` would automatically leverage Nebula, which avoids the need for code modification.
176+
Initialization with ds_config.json file configuration enables Nebula, so you can save checkpoints. The original DeepSpeed saving method `model_engine.save_checkpoint()` automatically uses Nebula, which avoids the need for code modification.
177177

178178
* Example 3 - PyTorch Lightning
179179

@@ -223,7 +223,7 @@ You won't need to modify other steps to train your large model on Azure Machine
223223

224224

225225

226-
adding tn.NebulaCheckpointIO() in your Trainer as a plugin will enable Nebula to save and load checkpoints.
226+
adding tn.NebulaCheckpointIO() in your Trainer as a plugin enables Nebula to save and load checkpoints.
227227

228228

229229

@@ -233,11 +233,11 @@ If the training script is based on DeepSpeed (>=0.7.3), you can enjoy Nebula by
233233

234234
## Why MLflow?
235235

236-
MLflow, with over 13 million monthly downloads, has become the standard platform for end-to-end MLOps, enabling teams of all sizes to track, share, package and deploy any model for batch or real-time inference. By integrating with MLflow, your training code will not need to hold any specific code related to Azure Machine Learning, achieving true portability and seamless integration with other open-source platforms.
236+
MLflow, with over 13 million monthly downloads, has become the standard platform for end-to-end MLOps, enabling teams of all sizes to track, share, package and deploy any model for batch or real-time inference. Because of MLflow integration, your training code can avoid any specific code related to Azure Machine Learning, achieving true portability and seamless integration with other open-source platforms.
237237

238238
## Prepare for migrating to MLflow
239239

240-
To use MLflow tracking, you will need to install `mlflow` and `azureml-mlflow` Python packages. All Azure Machine Learning environments have these packages already available for you but you will need to include them if creating your own environment.
240+
To use MLflow tracking, you must install the `mlflow` and `azureml-mlflow` Python packages. All Azure Machine Learning environments have these packages already available for you but you must include them if creating your own environment.
241241

242242
```bash
243243
pip install mlflow azureml-mlflow
@@ -248,11 +248,11 @@ pip install mlflow azureml-mlflow
248248
249249
## Connect to your workspace
250250

251-
Azure Machine Learning allows users to perform tracking in training jobs running on your workspace or running remotely (tracking experiments running outside Azure Machine Learning). If performing remote tracking, you will need to indicate the workspace you want to connect MLflow to.
251+
Azure Machine Learning allows users to perform tracking in training jobs running on your workspace or running remotely (tracking experiments running outside Azure Machine Learning). For remote tracking, you must indicate the workspace to which you want to connect to MLflow.
252252

253253
# [Azure Machine Learning compute](#tab/aml)
254254

255-
You are already connected to your workspace when running on Azure Machine Learning compute.
255+
You already connected to your workspace when running on Azure Machine Learning compute.
256256

257257
# [Remote compute](#tab/remote)
258258

@@ -262,7 +262,7 @@ You are already connected to your workspace when running on Azure Machine Learni
262262

263263
**Configure authentication**
264264

265-
Once the tracking is configured, you'll also need to configure how the authentication needs to happen to the associated workspace. By default, the Azure Machine Learning plugin for MLflow will perform interactive authentication by opening the default browser to prompt for credentials. Refer to [Configure MLflow for Azure Machine Learning: Configure authentication](how-to-use-mlflow-configure-tracking.md#configure-authentication) for more ways to configure authentication for MLflow in Azure Machine Learning workspaces.
265+
Once you configure the tracking, you must configure the way that the authentication to the associated workspace happens. By default, the Azure Machine Learning plugin opens the default browser, and prompts for credentials, to handle interactive authentication. See [Configure MLflow for Azure Machine Learning: Configure authentication](how-to-use-mlflow-configure-tracking.md#configure-authentication) for more ways to configure authentication for MLflow in Azure Machine Learning workspaces.
266266

267267
[!INCLUDE [configure-mlflow-auth](../../includes/machine-learning-mlflow-configure-auth.md)]
268268

@@ -332,7 +332,7 @@ __SDK v2 with MLflow__
332332
mlflow.log_text("sample_string_text", "string.txt")
333333
```
334334

335-
* The string will be logged as an _artifact_, not as a metric. In Azure Machine Learning studio, the value will be displayed in the __Outputs + logs__ tab.
335+
* The string logs as an _artifact_, not as a metric. In Azure Machine Learning studio, the value displays in the __Outputs + logs__ tab.
336336

337337
### Log an image to a PNG or JPEG file
338338

@@ -348,7 +348,7 @@ __SDK v2 with MLflow__
348348
mlflow.log_artifact("Azure.png")
349349
```
350350

351-
The image is logged as an artifact and will appear in the __Images__ tab in Azure Machine Learning Studio.
351+
The image logs as an artifact, and appears in the Azure Machine Learning studio __Images__ tab.
352352

353353
### Log a matplotlib.pyplot
354354

@@ -372,7 +372,7 @@ ax.plot([0, 1], [2, 3])
372372
mlflow.log_figure(fig, "sample_pyplot.png")
373373
```
374374

375-
* The image is logged as an artifact and will appear in the __Images__ tab in Azure Machine Learning Studio.
375+
* The image logs as an artifact, and appears in the Azure Machine Learning studio __Images__ tab.
376376
* The `mlflow.log_figure` method is __experimental__.
377377

378378

@@ -414,7 +414,7 @@ metrics = {"sample_table.col1": 5, "sample_table.col2": 10}
414414
mlflow.log_metrics(metrics)
415415
```
416416

417-
* Metrics do not render as a table in Azure Machine Learning studio.
417+
* Metrics don't render as a table in Azure Machine Learning studio.
418418
* Text values are not supported.
419419
* Logged as an _artifact_, not as a metric.
420420

@@ -447,7 +447,7 @@ mlflow.log_artifact("table.json")
447447
```
448448

449449
* Logs metrics for each column.
450-
* Metrics do not render as a table in Azure Machine Learning studio.
450+
* Metrics don't render as a table in Azure Machine Learning studio.
451451
* Text values are not supported.
452452
* Logged as an _artifact_, not as a metric.
453453

@@ -477,7 +477,7 @@ ACCURACY_TABLE = '{"schema_type": "accuracy_table", "schema_version": "v1", "dat
477477
mlflow.log_dict(ACCURACY_TABLE, 'mlflow_accuracy_table.json')
478478
```
479479

480-
* Metrics do not render as an accuracy table in Azure Machine Learning studio.
480+
* Metrics don't render as an accuracy table in Azure Machine Learning studio.
481481
* Logged as an _artifact_, not as a metric.
482482
* The `mlflow.log_dict` method is _experimental_.
483483

@@ -501,7 +501,7 @@ CONF_MATRIX = '{"schema_type": "confusion_matrix", "schema_version": "v1", "data
501501
mlflow.log_dict(CONF_MATRIX, 'mlflow_confusion_matrix.json')
502502
```
503503

504-
* Metrics do not render as a confusion matrix in Azure Machine Learning studio.
504+
* Metrics don't render as a confusion matrix in Azure Machine Learning studio.
505505
* Logged as an _artifact_, not as a metric.
506506
* The `mlflow.log_dict` method is _experimental_.
507507

@@ -525,7 +525,7 @@ PREDICTIONS = '{"schema_type": "predictions", "schema_version": "v1", "data": {"
525525
mlflow.log_dict(PREDICTIONS, 'mlflow_predictions.json')
526526
```
527527

528-
* Metrics do not render as a confusion matrix in Azure Machine Learning studio.
528+
* Metrics don't render as a confusion matrix in Azure Machine Learning studio.
529529
* Logged as an _artifact_, not as a metric.
530530
* The `mlflow.log_dict` method is _experimental_.
531531

@@ -549,7 +549,7 @@ RESIDUALS = '{"schema_type": "residuals", "schema_version": "v1", "data": {"bin_
549549
mlflow.log_dict(RESIDUALS, 'mlflow_residuals.json')
550550
```
551551

552-
* Metrics do not render as a confusion matrix in Azure Machine Learning studio.
552+
* Metrics don't render as a confusion matrix in Azure Machine Learning studio.
553553
* Logged as an _artifact_, not as a metric.
554554
* The `mlflow.log_dict` method is _experimental_.
555555

0 commit comments

Comments
 (0)