You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how Nebula can save time, resources, and money for large model training applications
5
5
services: machine-learning
6
6
ms.service: machine-learning
7
-
ms.subservice: ----
8
7
ms.topic: reference
9
8
ms.custom: ----, ----, ----
10
9
@@ -40,7 +39,7 @@ When large model training operations experience failures and terminations, data
40
39
41
40
To summarize, large model checkpoint management involves heavy job recover time and storage overheads.
42
41
43
-
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
42
+
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
44
43
45
44
## Nebula to the Rescue
46
45
@@ -50,7 +49,7 @@ Nebula can
50
49
51
50
***Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99.5%.
52
51
53
-
:::image type="content" source="media/quickstart-spark-jobs/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
52
+
:::image type="content" source="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-Nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
54
53
55
54
***Shrink end-to-end training time and computation costs**. Nebula can help you complete large-scale model training jobs faster and cheaper by reducing checkpoint and recovery time demands.
56
55
@@ -68,16 +67,16 @@ and access them at any time with a few lines of code.
68
67
69
68
**LARGER IMG_3 VERSION NEEDED**
70
69
71
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_3.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
70
+
:::image type="content" source="media/quickstart-spark-jobs/IMG_3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_3.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
72
71
73
72
**LARGER IMG_4 VERSION NEEDED**
74
73
75
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_4.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
74
+
:::image type="content" source="media/quickstart-spark-jobs/IMG_4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_4.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
76
75
77
76
78
77
**LARGER IMG_5 VERSION NEEDED**
79
78
80
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_5.png" alt-text="LARGER IMG_5 VERSION NEEDED":::
79
+
:::image type="content" source="media/quickstart-spark-jobs/IMG_5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_5.png" alt-text="LARGER IMG_5 VERSION NEEDED":::
81
80
82
81
Nebula offers full compatibility with any distributed training framework that supports PyTorch, and any compute target that supports ACPT. Nebula is designed to work with different distributed training strategies. You can use Nebula with PyTorch, PyTorch Lightning, DeepSpeed, and more. You can also use it with different Azure Machine Learning compute target, such as AmlCompute or AKS.
83
82
@@ -92,7 +91,7 @@ Nebula offers full compatibility with any distributed training framework that su
92
91
93
92
See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
94
93
95
-
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](./resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
94
+
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
96
95
97
96
* An Azure ML script run configuration file, which defines the
98
97
- source directory
@@ -101,7 +100,7 @@ Nebula offers full compatibility with any distributed training framework that su
101
100
- compute target
102
101
for your model training job. To create a compute target, see [this resource](./how-to-set-up-training-targets.md)
103
102
104
-
## Next steps
103
+
## Save Nebula checkpoints
105
104
106
105
To save checkpoints with Nebula, you must modify your training scripts in two ways. That's it:
107
106
@@ -113,7 +112,7 @@ To save checkpoints with Nebula, you must modify your training scripts in two wa
113
112
114
113
Similar to the way that the PyTorch `torch.save()` API works, Nebula provides checkpoint save APIs that you can use in your training scripts.
115
114
116
-
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](./how-to-manage-environments-v2?tabs=cli#curated-environments)
115
+
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](how-to-manage-environments-v2?tabs=cli#curated-environments)
0 commit comments