You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-with-Nebula.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ When large model training operations experience failures and terminations, data
39
39
40
40
To summarize, large model checkpoint management involves heavy job recover time and storage overheads.
41
41
42
-
:::image type="content" source="media/quickstart-spark-jobs/checkpoint-time-flow-diagram.png" lightbox="media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
42
+
:::image type="content" source="./media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-with-Nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time waste of duplicated data training.":::
43
43
44
44
## Nebula to the Rescue
45
45
@@ -65,18 +65,18 @@ Nebula can
65
65
66
66
and access them at any time with a few lines of code.
67
67
68
-
**LARGER IMG_3 VERSION NEEDED**
68
+
**LARGER img-3 VERSION NEEDED**
69
69
70
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_3.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
70
+
:::image type="content" source="media/reference-checkpoint-performance-with-Nebula/img-3.png" lightbox="media/reference-checkpoint-performance-with-Nebula/img-3.png" alt-text="LARGER img-3 VERSION NEEDED":::
71
71
72
-
**LARGER IMG_4 VERSION NEEDED**
72
+
**LARGER img-4 VERSION NEEDED**
73
73
74
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_4.png" alt-text="LARGER IMG_3 VERSION NEEDED":::
74
+
:::image type="content" source="media/reference-checkpoint-performance-with-Nebula/img-4.png" lightbox="media/reference-checkpoint-performance-with-Nebula/img-4.png" alt-text="LARGER img-4 VERSION NEEDED":::
75
75
76
76
77
-
**LARGER IMG_5 VERSION NEEDED**
77
+
**LARGER img-5 VERSION NEEDED**
78
78
79
-
:::image type="content" source="media/quickstart-spark-jobs/IMG_5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/IMG_5.png" alt-text="LARGER IMG_5 VERSION NEEDED":::
79
+
:::image type="content" source="media/reference-checkpoint-performance-with-Nebula/img-5.png" lightbox="media/reference-checkpoint-performance-with-Nebula/img-5.png" alt-text="LARGER img-5 VERSION NEEDED":::
80
80
81
81
Nebula offers full compatibility with any distributed training framework that supports PyTorch, and any compute target that supports ACPT. Nebula is designed to work with different distributed training strategies. You can use Nebula with PyTorch, PyTorch Lightning, DeepSpeed, and more. You can also use it with different Azure Machine Learning compute target, such as AmlCompute or AKS.
82
82
@@ -91,7 +91,7 @@ Nebula offers full compatibility with any distributed training framework that su
91
91
92
92
See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation
93
93
94
-
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
94
+
* The required dependency included in an ACPT-curated (Azure Container for Pytorch) environment. See [Curated environments](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to obtain the ACPT image. Learn how to use the curated environment [here](./how-to-use-environments.md)
95
95
96
96
* An Azure ML script run configuration file, which defines the
97
97
- source directory
@@ -112,7 +112,7 @@ To save checkpoints with Nebula, you must modify your training scripts in two wa
112
112
113
113
Similar to the way that the PyTorch `torch.save()` API works, Nebula provides checkpoint save APIs that you can use in your training scripts.
114
114
115
-
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](how-to-manage-environments-v2?tabs=cli#curated-environments)
115
+
You don't need to modify other steps to train your large model on Azure Machine Learning Platform. You only need to use the [Azure Container PyTorch (ACPT) curated environment](how-to-manage-environments-v2.md?tabs=cli#curated-environments)
0 commit comments