Skip to content

Commit e07b084

Browse files
committed
Remove Nebula Preview label
1 parent 546d20e commit e07b084

File tree

1 file changed

+2
-6
lines changed

1 file changed

+2
-6
lines changed

articles/machine-learning/reference-checkpoint-performance-for-large-models.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@ ms.author: ziqiwang
1212
ms.date: 03/28/2023
1313
---
1414

15-
# Boost Checkpoint Speed and Reduce Cost with Nebula (Preview)
15+
# Boost Checkpoint Speed and Reduce Cost with Nebula
1616

1717
Learn how to boost checkpoint speed and reduce checkpoint cost for large Azure Machine Learning training models using Nebula.
1818

1919
## Overview
2020

21-
Azure Container for PyTorch (ACPT) now includes **Nebula**, a fast, simple, disk-less, model-aware checkpoint tool. Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs using PyTorch. By utilizing the latest distributed computing technologies, Nebula can reduce checkpoint times from hours to seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly benefit from Nebula's performance.
21+
**Nebula** is a fast, simple, disk-less, model-aware checkpoint tool in Azure Container for PyTorch (ACPT). Nebula offers a simple, high-speed checkpointing solution for distributed large-scale model training jobs using PyTorch. By utilizing the latest distributed computing technologies, Nebula can reduce checkpoint times from hours to seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly benefit from Nebula's performance.
2222

2323
To make Nebula available for your training jobs, import the `nebulaml` python package in your script. Nebula has full compatibility with different distributed PyTorch training strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a simple way to monitor and view checkpoint lifecycles. The APIs support various model types, and ensure checkpoint consistency and reliability.
2424

@@ -27,10 +27,6 @@ To make Nebula available for your training jobs, import the `nebulaml` python pa
2727
2828
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning to quickly checkpoint your model training jobs. Additionally, you'll learn how to view and manage Nebula checkpoint data. You'll also learn how to resume the model training jobs from the last available checkpoint if there is interruption, failure or termination of Azure Machine Learning.
2929

30-
> [!NOTE]
31-
> Nebula is currently in preview. This means that it is not yet production-ready and does not have support as a generally available product. Nebula will have constant updates and improvements to its functions and features. We welcome your feedback and suggestions at [email protected].
32-
> For more information, please visit [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) to learn more.
33-
3430
## Why checkpoint optimization for large model training matters
3531

3632
As data volumes grow and data formats become more complex, machine learning models have also become more sophisticated. Training these complex models can be challenging due to GPU memory capacity limits and lengthy training times. As a result, distributed training is often used when working with large datasets and complex models. However, distributed architectures can experience unexpected faults and node failures, which can become increasingly problematic as the number of nodes in a machine learning model increases.

0 commit comments

Comments
 (0)