You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-for-large-models.md
+2-6Lines changed: 2 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,13 +12,13 @@ ms.author: ziqiwang
12
12
ms.date: 03/28/2023
13
13
---
14
14
15
-
# Boost Checkpoint Speed and Reduce Cost with Nebula (Preview)
15
+
# Boost Checkpoint Speed and Reduce Cost with Nebula
16
16
17
17
Learn how to boost checkpoint speed and reduce checkpoint cost for large Azure Machine Learning training models using Nebula.
18
18
19
19
## Overview
20
20
21
-
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs using PyTorch. By utilizing the latest distributed computing technologies, Nebula can reduce checkpoint times from hours to seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly benefit from Nebula's performance.
21
+
**Nebula** is a fast, simple, disk-less, model-aware checkpoint tool in Azure Container for PyTorch (ACPT). Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs using PyTorch. By utilizing the latest distributed computing technologies, Nebula can reduce checkpoint times from hours to seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly benefit from Nebula's performance.
22
22
23
23
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
24
24
@@ -27,10 +27,6 @@ To make Nebula available for your training jobs, import the `nebulaml` python pa
27
27
28
28
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if there is interruption, failure or termination of Azure Machine Learning.
29
29
30
-
> [!NOTE]
31
-
> Nebula is currently in preview. This means that it is not yet production-ready and does not have support as a generally available product. Nebula will have constant updates and improvements to its functions and features. We welcome your feedback and suggestions at [email protected].
32
-
> For more information, please visit [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) to learn more.
33
-
34
30
## Why checkpoint optimization for large model training matters
35
31
36
32
As data volumes grow and data formats become more complex, machine learning models have also become more sophisticated. Training these complex models can be challenging due to GPU memory capacity limits and lengthy training times. As a result, distributed training is often used when working with large datasets and complex models. However, distributed architectures can experience unexpected faults and node failures, which can become increasingly problematic as the number of nodes in a machine learning model increases.
0 commit comments