You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-with-Nebula.md
+3-10Lines changed: 3 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ To summarize, large model checkpoint management involves heavy storage, and job
49
49
50
50
## Nebula to the Rescue
51
51
52
-
To train large, distributed models, a reliable and efficient way to save and resume training progress, that avoids data loss and waste of resources, becomes helpful. Nebula reduces checkpoint save times and training GPU hour demands. For large model Azure Machine Learning training jobs, Nebula offers faster and easier checkpoint management and saves. In turn, it helps shrink large-scale model training time demands.
52
+
To train large, distributed models, a reliable and efficient way to save and resume training progress that avoids data loss and waste of resources becomes helpful. Nebula reduces checkpoint save times and training GPU hour demands. For large model Azure Machine Learning training jobs, Nebula offers faster and easier checkpoint management and saves. In turn, it helps shrink large-scale model training time demands.
53
53
54
54
Nebula can
55
55
@@ -73,13 +73,6 @@ Nebula can
73
73
74
74
and access them at any time with a few lines of code.
75
75
76
-
THIS DOCUMENT ALREADY USED THESE IMAGES:
77
-
78
-
The following images show the Nebula checkpointing lifecycle on Azure Machine Learning studio as an example.
* An Azure subscription and an Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md) for more information about workspace resource creation
@@ -97,7 +90,7 @@ Nebula use involves:
97
90
-[API calls to save and load checkpoints](#call-apis-to-save-and-load-checkpoints)
98
91
99
92
### Using ACPT environment
100
-
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a pre-installed, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
93
+
[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a preinstalled, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image.
101
94
102
95
### Initializing Nebula
103
96
@@ -158,7 +151,7 @@ To enable full Nebula compatibility with PyTorch-based training scripts, modify
158
151
p0 = latest_ckpt.load(<'CKPT_NAME'>)
159
152
```
160
153
161
-
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. By loading back the latest checkpoint, the training state can be restored to that saved by the last checkpoint.
154
+
Since a checkpoint or snapshot may contain many files, you can load one or more of them by the name. With the latest checkpoint, the training state can be restored to the state saved by the last checkpoint.
0 commit comments