Skip to content

Commit 93c09b6

Browse files
authored
Merge pull request #230673 from fbsolo-ms1/tutorial-for-SK
Edits & redirect for the Nebula doc.
2 parents 6122b1a + 0d1c968 commit 93c09b6

File tree

6 files changed

+13
-8
lines changed

6 files changed

+13
-8
lines changed

articles/machine-learning/.openpublishing.redirection.machine-learning.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
{
22
"redirections": [
3+
{
4+
"source_path_from_root": "/articles/machine-learning/reference-checkpoint-performance-with-Nebula.md",
5+
"redirect_url": "/articles/machine-learning/reference-checkpoint-performance-for-large-models",
6+
"redirect_document_id": true
7+
},
38
{
49
"source_path_from_root": "/articles/machine-learning/referencemanaged-online-endpoints-vm-sku-list.md",
510
"redirect_url": "/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list",

articles/machine-learning/reference-checkpoint-performance-with-Nebula.md renamed to articles/machine-learning/reference-checkpoint-performance-for-large-models.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula (Preview)
2+
title: Optimize Checkpoint Performance for Large Models
33
titleSuffix: Azure Machine Learning
44
description: Learn how Nebula can save time, resources, and money for large model training applications
55
services: machine-learning
@@ -9,7 +9,7 @@ ms.custom: ----, ----, ----
99

1010
author: ziqiwang
1111
ms.author: ziqiwang
12-
ms.date: 03/06/2023
12+
ms.date: 03/14/2023
1313
ms.reviewer: franksolomon
1414
---
1515

@@ -41,7 +41,7 @@ Checkpoints can help deal with these problems. Periodic checkpoints snapshot the
4141

4242
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results. Shorter checkpoint intervals could solve this problem. The following diagram shows the time cost to restore a training process from checkpoints:
4343

44-
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
44+
:::image type="content" source="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
4545

4646
However, the checkpoint saves process itself generates large overheads. A TB-sized checkpoint save can often become a training process bottleneck. The synchronized checkpoint process blocks the training process for hours. Checkpoint-related overheads can take up 12% of total training time, on average, and can rise to 43% [(Maeng et al., 2021)](https://cs.stanford.edu/people/trippel/pubs/cpr-mlsys-21.pdf).
4747

@@ -55,7 +55,7 @@ Nebula can
5555

5656
* **Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99%.
5757

58-
:::image type="content" source="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
58+
:::image type="content" source="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
5959

6060
This example shows the checkpoint and end-to-end training time reduction for four checkpoint saves of Huggingface GPT2, GPT2-Large, and GPT-XL training jobs. For the medium-sized Huggingface GPT2-XL checkpoint saves (20.6 GB), Nebula achieved a 96.9% time reduction for one checkpoint.
6161

@@ -112,7 +112,7 @@ Nebula provides APIs to handle checkpoint saves. You can use these APIs in your
112112
### View your checkpointing histories
113113
When your training job finishes, navigate to the Job `Name> Outputs + logs` pane. In the left panel, expand the **Nebula** folder, and select `checkpointHistories.csv` to see detailed information about Nebula checkpoint saves - duration, throughput, and checkpoint size.
114114

115-
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" alt-text="Screenshot that shows metadata about the checkpoint saves.":::
115+
:::image type="content" source="./media/reference-checkpoint-performance-for-large-models/checkpoint-save-metadata.png" lightbox="./media/reference-checkpoint-performance-for-large-models/checkpoint-save-metadata.png" alt-text="Screenshot that shows metadata about the checkpoint saves.":::
116116

117117
## Examples
118118

articles/machine-learning/toc.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -357,9 +357,6 @@
357357
- name: Attach and Manage a Synapse Spark pool
358358
displayName: Attach and Manage a Synapse Spark pool
359359
href: how-to-manage-synapse-spark-pool.md
360-
- name: Reference checkpoint performance with Nebula
361-
displayName: Reference checkpoint performance with Nebula
362-
href: reference-checkpoint-performance-with-nebula.md
363360
- name: AKS and Azure Arc-enabled Kubernetes
364361
items:
365362
- name: What is Kubernetes compute target
@@ -503,6 +500,9 @@
503500
- name: Debug jobs and monitor training progress
504501
displayName: automl
505502
href: how-to-interactive-jobs.md
503+
- name: Optimize Checkpoint Performance for Large Models
504+
displayName: Optimize Checkpoint Performance for Large Models
505+
href: reference-checkpoint-performance-for-large-models.md
506506
- name: Train with the Python SDK
507507
items:
508508
- name: Tune hyperparameters

0 commit comments

Comments
 (0)