Skip to content

Commit b7c1b86

Browse files
Merge pull request #276373 from v-thepet/debug2
243805 - Debug pipeline performance
2 parents 7dfb01e + ff2fbb2 commit b7c1b86

File tree

4 files changed

+36
-56
lines changed

4 files changed

+36
-56
lines changed
Lines changed: 36 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,88 +1,68 @@
11
---
2-
title: 'How to debug pipeline performance issues'
2+
title: Debug pipeline performance issues
33
titleSuffix: Azure Machine Learning
4-
description: How to debug pipeline performance issues by using profile feature
4+
description: Learn how to debug pipeline performance issues by using the profile feature in Azure Machine Learning studio.
55
ms.reviewer: lagayhar
66
author: zhanxia
77
ms.author: zhanxia
88
services: machine-learning
99
ms.service: machine-learning
1010
ms.subservice: core
1111
ms.topic: how-to
12-
ms.date: 05/27/2023
12+
ms.date: 05/24/2024
1313
ms.custom: designer, pipeline UI
1414
---
15-
# View profiling to debug pipeline performance issues
15+
# Use profiling to debug pipeline performance issues
1616

17-
Profiling feature can help you debug pipeline performance issues such as hang, long pole etc. Profiling will list the duration information of each step in a pipeline and provide a Gantt chart for visualization.
17+
The profiling feature in Azure Machine Learning studio can help you debug pipeline performance issues such as hanging or long durations. Profiling lists the duration of each pipeline step and provides a Gantt chart for visualization. You can see the time spent on each job status and quickly find steps that take longer than expected.
1818

19-
Profiling enables you to:
19+
## Find the node that runs the longest overall
2020

21-
- Quickly find which node takes longer time than expected.
22-
- Identify the time spent of job on each status
21+
1. On the **Jobs** page of Azure Machine Learning studio, select the job name to open the job detail page.
22+
1. In the action bar, select **View profiling**. Profiling works only for root pipelines. The profiler page can take a few minutes to load.
2323

24-
## How to find the node that runs totally the longest
24+
:::image type="content" source="./media/how-to-debug-pipeline-performance/view-profiling-detail.png" alt-text="Screenshot showing the pipeline at root level with the View profiling button highlighted." lightbox= "./media/how-to-debug-pipeline-performance/view-profiling.png":::
2525

26-
1. On the Jobs page, select the job name and enter the job detail page.
27-
1. In the action bar, select **View profiling**. Profiling only works for root level pipeline. It will take a few minutes to load the next page.
26+
To identify the step that takes the longest, view the Gantt chart at the top of the profiler page. The length of each bar in the Gantt chart shows how long the step takes. The step with the longest bar length took the most time.
2827

29-
:::image type="content" source="./media/how-to-debug-pipeline-performance/view-profiling.png" alt-text="Screenshot showing the pipeline at root level with the view profiling button highlighted." lightbox= "./media/how-to-debug-pipeline-performance/view-profiling.png":::
28+
:::image type="content" source="./media/how-to-debug-pipeline-performance/critical-path.png" alt-text="Screenshot showing the Gantt chart and the critical path." lightbox= "./media/how-to-debug-pipeline-performance/critical-path.png":::
3029

31-
1. After the profiler loads, you'll see a Gantt chart. By Default the critical path of a pipeline is shown. A critical path is a subsequence of steps that determine a pipeline job's total duration.
30+
The Gantt chart has the following views:
3231

33-
:::image type="content" source="./media/how-to-debug-pipeline-performance/critical-path.png" alt-text="Screenshot showing the Gantt chart and the critical path." lightbox= "./media/how-to-debug-pipeline-performance/critical-path.png":::
32+
- **Critical path** is the sequence of steps that determine the job's total duration. This view is shown by default. Only step jobs that have a dependency appear in a pipeline's critical path.
33+
- **Flatten view** shows all step jobs, and shows more nodes than critical path.
34+
- **Compact view** shows only step jobs that took longer than 30 seconds.
35+
- **Hierarchical view** shows all jobs, including pipeline component jobs and step jobs.
3436

35-
1. To find the step that takes the longest, you can either view the Gantt chart or the table below it.
37+
You can also determine durations by using the table at the bottom of the profiler page. When you select a row in the table, it highlights the corresponding node in the Gantt chart, and vice versa.
3638

37-
In the Gantt chart, the length of each bar shows how long the step takes, steps with a longer bar length take more time. You can also filter the table below by "total duration". When you select a row in the table, it shows you the node in the Gantt chart too. When you select a bar on the Gantt chart it will also highlight it in the table.
39+
- In the table, you can sort by the **Total duration** column to find the longest-running nodes.
40+
- Select the **View details** icon next to a node name in the table to open the node detail pane, which shows parameters, inputs and outputs, command code, logs, and other information.
3841

39-
In the table, reuse is denoted with the recycling icon.
42+
:::image type="content" source="./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png" alt-text="Screenshot highlighting the log icon and showing the detail page." lightbox= "./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png":::
4043

41-
If you select the log icon next the node name it opens the detail page, which shows parameter, code, outputs, logs etc.
44+
- To export the duration table, select **Export CSV** at upper right on the profiler page.
4245

43-
:::image type="content" source="./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png" alt-text="Screenshot highlighting the log icon and showing the detail page." lightbox= "./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png":::
46+
:::image type="content" source="./media/how-to-debug-pipeline-performance/export-csv.png" alt-text="Screenshot showing Export CSV in profiling." lightbox= "./media/how-to-debug-pipeline-performance/export-csv.png":::
4447

45-
If you're trying to make the queue time shorter for a node, you can change the compute node number and modify job priority to get more compute resources on this one.
48+
## Find the node that runs the longest in each status
4649

47-
## How to find the node that runs the longest in each status
50+
Besides total duration, you can also sort the duration table by durations for each status. For example, you can sort by the **Preparing** column to see which step spends the most time on image building. You can open the detail pane for that step to see whether image building fails because of timeout issues.
4851

49-
Besides the total duration, you can also sort by durations for each status. For example, you can sort by *Preparing* duration to see which step spends the most time on image building. Then you can open the detail page to find that image building fails because of timeout issue.
52+
## Address status duration issues
5053

51-
### What do I do if a duration issue identified
54+
The following table presents the definition of each job status, the estimated time it takes, and suggestions for addressing issues with that status.
5255

53-
Status and definitions:
54-
55-
| Status | What does it mean? | Time estimation | Next step |
56+
| Status | Definition | Time estimation | Next steps |
5657
|------|--------------|-------------|----------|
57-
| Not started | Job is submitted from client side and accepted in Azure Machine Learning services. Time spent in this stage is mainly in Azure Machine Learning service scheduling and preprocessing. | If there's no backend service issue, this time should be short.| Open support case via Azure portal. |
58-
|Preparing | In this status, job is pending for some preparation on job dependencies, for example, environment image building.| If you're using curated or registered custom environment, this time should be short. | Check image building log. |
59-
|Inqueue | Job is pending for compute resource allocation. Time spent in this stage is mainly depending on the status of your compute cluster.| If you're using a cluster with enough compute resource, this time should be short. | Check with workspace admin whether to increase the max nodes of the target compute or change the job to another less busy compute. |
60-
|Running | Job is executing on remote compute. Time spent in this stage is mainly in two parts: <br> Runtime preparation: image pulling, docker starting and data preparation (mount or download). <br> User script execution. | This status is expected to be most time consuming one. | 1. Go to the source code check if there's any user error. <br> 2. View the monitoring tab of compute metrics (CPU, memory, networking etc.) to identify the bottleneck. <br> 3. Try online debug with [interactive endpoints](how-to-interactive-jobs.md) if the job is running or locally debug of your code. |
61-
| Finalizing | Job is in post processing after execution complete. Time spent in this stage is mainly for some post processes like: output uploading, metric/logs uploading and resources clean up.| It will be short for command job. However, might be very long for PRS/MPI job because for a distributed job, the finalizing status is from the first node starting finalizing to the last node done finalizing. | Change your step job output mode from upload to mount if you find unexpected long finalizing time, or open support case via Azure portal. |
62-
63-
## Different view of Gantt chart
64-
65-
- Critical path
66-
- You'll see only the step jobs in the pipeline's critical path (jobs that have a dependency).
67-
- By default the critical path of the pipeline job is shown.
68-
- Flatten view
69-
- You'll see all step jobs.
70-
- In this view, you'll see more nodes than in critical path.
71-
- Compact view
72-
- You'll only see step jobs that are longer than 30 seconds.
73-
- Hierarchical view.
74-
- You'll see all jobs including pipeline component jobs and step jobs.
75-
76-
## Download the duration table
77-
78-
To export the table, select **Export CSV**.
79-
80-
:::image type="content" source="./media/how-to-debug-pipeline-performance/export-csv.png" alt-text="Screenshot show export csv in profiling." lightbox= "./media/how-to-debug-pipeline-performance/export-csv.png":::
81-
82-
## Next steps
58+
| Not started | The job is submitted from the client and accepted in Azure Machine Learning services. Most time is spent in service scheduling and preprocessing. | If there's no backend service issue, this time should be short.| Open a support case via the Azure portal. |
59+
|Preparing | In this status, the job is pending for preparation of job dependencies, for example environment image building.| If you're using a curated or registered custom environment, this time should be short. | Check the image building log. |
60+
|Inqueue | The job is pending for compute resource allocation. Duration of this stage mainly depends on the status of your compute cluster.| If you're using a cluster with enough compute resource, this time should be short. | Increase the max nodes of the target compute, change the job to another less busy compute, or modify job priority to get more compute resources for the job. |
61+
|Running | The job is executing on the remote compute. This stage consists of: <br> 1. Runtime preparation, such as image pulling, docker starting, and data mounting or download. 2. User script execution. | This status is expected to be the most time consuming. | 1. Check the source code for any user error. <br> 2. View the monitoring tab for compute metrics like CPU, memory, and networking to identify any bottlenecks. <br> 3. If the job is running, try online debug with [interactive endpoints](how-to-interactive-jobs.md), or locally debug your code. |
62+
| Finalizing | Job is in post-processing after execution completes. Time spent in this stage is mainly for post processes like uploading output, uploading metrics and logs, and cleaning up resources.| Time is expected to be short for command jobs. Duration might be long for polygenic risk score (PRS) or Message Passing Interface (MPI) jobs because for distributed jobs, finalizing lasts from the first node starting to the last node finishing. | Change your step job output mode from upload to mount if you find unexpected long finalizing time, or open a support case via the Azure portal. |
8363

84-
In this article, you learned how to debug pipeline failures. To learn more about how you can use the pipeline, see the following articles:
64+
## Related content
8565

86-
- [How to build pipeline using python sdk v2](./how-to-create-component-pipeline-python.md)
87-
- [How to build pipeline using python CLI v2](./how-to-create-component-pipelines-cli.md)
88-
- [What is machine learning component](./concept-component.md)
66+
- [Build pipelines with components by using the Python SDK v2](./how-to-create-component-pipeline-python.md)
67+
- [Build pipelines with components by using the Azure Machine Learning CLI](./how-to-create-component-pipelines-cli.md)
68+
- [What is an Azure Machine Learning component?](./concept-component.md)
-46.1 KB
Loading
25 Bytes
Loading
28.2 KB
Loading

0 commit comments

Comments
 (0)