Skip to content

Commit afd0fa7

Browse files
committed
Fix detected issues . . .
1 parent 65e1350 commit afd0fa7

File tree

1 file changed

+8
-9
lines changed

1 file changed

+8
-9
lines changed

articles/machine-learning/reference-checkpoint-performance-with-Nebula.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ titleSuffix: Azure Machine Learning
44
description: Learn how Nebula can save time, resources, and money for large model training applications
55
services: machine-learning
66
ms.service: machine-learning
7-
ms.subservice: ----
87
ms.topic: reference
98
ms.custom: ----, ----, ----
109

@@ -40,7 +39,7 @@ When large model training operations experience failures and terminations, data
4039

4140
To summarize, large model checkpoint management involves heavy job recover time and storage overheads.
4241

43-
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
42+
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
4443

4544
## Nebula to the Rescue
4645

@@ -50,7 +49,7 @@ Nebula can
5049

5150
* **Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99.5%.
5251

53-
:::image type="content" source="media/quickstart-spark-jobs/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
52+
:::image type="content" source="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
5453

5554
* **Shrink end-to-end training time and computation costs**. Nebula can help you complete large-scale model training jobs faster and cheaper by reducing checkpoint and recovery time demands.
5655

@@ -68,16 +67,16 @@ and access them at any time with a few lines of code.
6867

6968
**LARGER IMG_3 VERSION NEEDED**
7069

71-
:::image type="content" source="media/quickstart-spark-jobs/IMG_3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_3.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
70+
:::image type="content" source="media/quickstart-spark-jobs/IMG_3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_3.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
7271

7372
**LARGER IMG_4 VERSION NEEDED**
7473

75-
:::image type="content" source="media/quickstart-spark-jobs/IMG_4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_4.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
74+
:::image type="content" source="media/quickstart-spark-jobs/IMG_4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_4.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
7675

7776

7877
**LARGER IMG_5 VERSION NEEDED**
7978

80-
:::image type="content" source="media/quickstart-spark-jobs/IMG_5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_5.png" alt-text="LARGER IMG_5 VERSION NEEDED":::
79+
:::image type="content" source="media/quickstart-spark-jobs/IMG_5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_5.png" alt-text="LARGER IMG_5 VERSION NEEDED":::
8180

8281
Nebula offers full compatibility with any distributed training framework that supports PyTorch, and any compute target that supports ACPT. Nebula is designed to work with different distributed training strategies. You can use Nebula with PyTorch, PyTorch Lightning, DeepSpeed, and more. You can also use it with different Azure Machine Learning compute target, such as AmlCompute or AKS.
8382

@@ -92,7 +91,7 @@ Nebula offers full compatibility with any distributed training framework that su
9291

9392
See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
9493

95-
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
94+
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
9695

9796
* An Azure ML script run configuration file, which defines the
9897
- source directory
@@ -101,7 +100,7 @@ Nebula offers full compatibility with any distributed training framework that su
101100
- compute target
102101
for your model training job. To create a compute target, see [this resource](./how-to-set-up-training-targets.md)
103102

104-
## Next steps
103+
## Save Nebula checkpoints
105104

106105
To save checkpoints with Nebula, you must modify your training scripts in two ways. That's it:
107106

@@ -113,7 +112,7 @@ To save checkpoints with Nebula, you must modify your training scripts in two wa
113112

114113
Similar to the way that the PyTorch `torch.save()` API works, Nebula provides checkpoint save APIs that you can use in your training scripts.
115114

116-
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
115+
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](how-to-manage-environments-v2?tabs=cli#curated-environments)
117116

118117

119118
## Examples

0 commit comments

Comments
 (0)