You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-checkpoint-performance-for-large-models.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Optimize Checkpoint Performance for Large Model Training Jobs with Nebula (Preview)
2
+
title: Optimize Checkpoint Performance for Large Models
3
3
titleSuffix: Azure Machine Learning
4
4
description: Learn how Nebula can save time, resources, and money for large model training applications
5
5
services: machine-learning
@@ -9,7 +9,7 @@ ms.custom: ----, ----, ----
9
9
10
10
author: ziqiwang
11
11
ms.author: ziqiwang
12
-
ms.date: 03/06/2023
12
+
ms.date: 03/14/2023
13
13
ms.reviewer: franksolomon
14
14
---
15
15
@@ -41,7 +41,7 @@ Checkpoints can help deal with these problems. Periodic checkpoints snapshot the
41
41
42
42
When large model training operations experience failures and terminations, data scientists and researchers can restore the training process from a previously saved checkpoint. Unfortunately, the process between the checkpoint and the termination itself is wasted, because the computation must re-execute operations to cover the unsaved, intermediate results. Shorter checkpoint intervals could solve this problem. The following diagram shows the time cost to restore a training process from checkpoints:
43
43
44
-
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
44
+
:::image type="content" source="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" lightbox="./media/reference-checkpoint-performance-for-large-models/checkpoint-time-flow-diagram.png" alt-text="Screenshot that shows the time cost to restore a training process from checkpoints.":::
45
45
46
46
However, the checkpoint saves process itself generates large overheads. A TB-sized checkpoint save can often become a training process bottleneck. The synchronized checkpoint process blocks the training process for hours. Checkpoint-related overheads can take up 12% of total training time, on average, and can rise to 43% [(Maeng et al., 2021)](https://cs.stanford.edu/people/trippel/pubs/cpr-mlsys-21.pdf).
47
47
@@ -55,7 +55,7 @@ Nebula can
55
55
56
56
***Boost checkpoint speeds as much as 1000 times** with a simple API that asynchronously works with your training process. Nebula can reduce checkpoint times from hours to seconds - a potential reduction of 95% to 99%.
57
57
58
-
:::image type="content" source="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-with-nebula/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
58
+
:::image type="content" source="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" lightbox="media/reference-checkpoint-performance-for-large-models/nebula-checkpoint-time-savings.png" alt-text="Screenshot that shows the time savings benefit of Nebula.":::
59
59
60
60
This example shows the checkpoint and end-to-end training time reduction for four checkpoint saves of Huggingface GPT2, GPT2-Large, and GPT-XL training jobs. For the medium-sized Huggingface GPT2-XL checkpoint saves (20.6 GB), Nebula achieved a 96.9% time reduction for one checkpoint.
61
61
@@ -112,7 +112,7 @@ Nebula provides APIs to handle checkpoint saves. You can use these APIs in your
112
112
### View your checkpointing histories
113
113
When your training job finishes, navigate to the Job `Name> Outputs + logs` pane. In the left panel, expand the **Nebula** folder, and select `checkpointHistories.csv` to see detailed information about Nebula checkpoint saves - duration, throughput, and checkpoint size.
114
114
115
-
:::image type="content" source="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" lightbox="./media/reference-checkpoint-performance-with-nebula/checkpoint-save-metadata.png" alt-text="Screenshot that shows metadata about the checkpoint saves.":::
115
+
:::image type="content" source="./media/reference-checkpoint-performance-for-large-models/checkpoint-save-metadata.png" lightbox="./media/reference-checkpoint-performance-for-large-models/checkpoint-save-metadata.png" alt-text="Screenshot that shows metadata about the checkpoint saves.":::
0 commit comments