Skip to content

Commit d532dab

Browse files
committed
acro
1 parent 22e7815 commit d532dab

File tree

1 file changed

+14
-20
lines changed

1 file changed

+14
-20
lines changed

articles/machine-learning/how-to-debug-pipeline-performance.md

Lines changed: 14 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -14,21 +14,17 @@ ms.custom: designer, pipeline UI
1414
---
1515
# Use profiling to debug pipeline performance issues
1616

17-
The profiling feature can help you debug Azure Machine Learning pipeline performance issues such as hanging or long pole issues. Profiling lists the duration of each step in a pipeline and provides a Gantt chart for visualization.
18-
19-
Profiling enables you to:
20-
21-
- Quickly find which nodes take a longer time than expected.
22-
- Identify the job time spent on each status.
17+
The profiling feature in Azure Machine Learning studio can help you debug pipeline performance issues such as hanging or long durations. Profiling lists the duration of each pipeline step and provides a Gantt chart for visualization. You can see the time spent on each job status and quickly find steps that take longer than expected.
18+
- .
2319

2420
## Find the node that runs the longest overall
2521

2622
1. On the **Jobs** page of Azure Machine Learning studio, select the job name to open the job detail page.
27-
1. In the action bar, select **View profiling**. Profiling only works for a root level pipeline. It can take a few minutes to load the next page.
23+
1. In the action bar, select **View profiling**. Profiling works only for root pipelines. It can take a few minutes to load the profiler page.
2824

29-
:::image type="content" source="./media/how-to-debug-pipeline-performance/view-profiling-detail.png" alt-text="Screenshot showing the pipeline at root level with the View profiling button highlighted." lightbox= "./media/how-to-debug-pipeline-performance/view-profiling.png":::
25+
:::image type="content" source="./media/how-to-debug-pipeline-performance/view-profiling-detail.png" alt-text="Screenshot showing the pipeline at root level with the View profiling button highlighted." lightbox= "./media/how-to-debug-pipeline-performance/view-profiling.png":::
3026

31-
To find the step that takes the longest, view the Gantt chart on the profiler page. The length of each bar in the Gantt chart shows how long the step takes. The step with the longest bar length takes the most time.
27+
To identify the step that takes the longest, view the Gantt chart on the profiler page. The length of each bar in the Gantt chart shows how long the step takes. The step with the longest bar length took the most time.
3228

3329
:::image type="content" source="./media/how-to-debug-pipeline-performance/critical-path.png" alt-text="Screenshot showing the Gantt chart and the critical path." lightbox= "./media/how-to-debug-pipeline-performance/critical-path.png":::
3430

@@ -46,27 +42,25 @@ You can also determine durations by using the table at the bottom of the profile
4642

4743
:::image type="content" source="./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png" alt-text="Screenshot highlighting the log icon and showing the detail page." lightbox= "./media/how-to-debug-pipeline-performance/detail-page-from-log-icon.png":::
4844

49-
## Find the node that runs the longest in each status
45+
- To export the duration table, select **Export CSV** at upper right on the profiler page.
5046

51-
Besides total duration, you can also sort the duration table by durations for each status. For example, you can sort by the **Preparing** column to see which step spends the most time on image building. You can open the detail pane for that step to see whether image building fails because of timeout issues.
47+
:::image type="content" source="./media/how-to-debug-pipeline-performance/export-csv.png" alt-text="Screenshot showing Export CSV in profiling." lightbox= "./media/how-to-debug-pipeline-performance/export-csv.png":::
5248

53-
## Download the duration table
54-
55-
To export the duration table, select **Export CSV** at upper right on the profiler page.
49+
## Find the node that runs the longest in each status
5650

57-
:::image type="content" source="./media/how-to-debug-pipeline-performance/export-csv.png" alt-text="Screenshot showing Export CSV in profiling." lightbox= "./media/how-to-debug-pipeline-performance/export-csv.png":::
51+
Besides total duration, you can also sort the duration table by durations for each status. For example, you can sort by the **Preparing** column to see which step spends the most time on image building. You can open the detail pane for that step to see whether image building fails because of timeout issues.
5852

5953
## Address status duration issues
6054

61-
The following table presents the definition of each job status, the time it's expected to take, and suggested next steps to address issues with that status.
55+
The following table presents the definition of each job status, the estimated time it takes, and suggestions for addressing issues with that status.
6256

6357
| Status | Definition | Time estimation | Next steps |
6458
|------|--------------|-------------|----------|
65-
| Not started | The job is submitted from the client and accepted in Azure Machine Learning services. Time is mainly spent in service scheduling and preprocessing. | If there's no backend service issue, this time should be short.| Open a support case via the Azure portal. |
59+
| Not started | The job is submitted from the client and accepted in Azure Machine Learning services. Most time is spent in service scheduling and preprocessing. | If there's no backend service issue, this time should be short.| Open a support case via the Azure portal. |
6660
|Preparing | In this status, the job is pending for preparation of job dependencies, for example environment image building.| If you're using a curated or registered custom environment, this time should be short. | Check the image building log. |
67-
|Inqueue | The job is pending for compute resource allocation. Time spent in this stage mainly depends on the status of your compute cluster.| If you're using a cluster with enough compute resource, this time should be short. | Check with the workspace admin whether to increase the max nodes of the target compute, change the job to another less busy compute, or modify job priority to get more compute resources for the job. |
68-
|Running | The job is executing on the remote compute. Time spent in this stage is mainly in two parts: <br> Runtime preparation: image pulling, docker starting and data preparation (mount or download). <br> User script execution. | This status is expected to be the most time consuming. | 1. Check the source code for any user error. <br> 2. View the monitoring tab for compute metrics like CPU, memory, and networking to identify the bottleneck. <br> 3. If the job is running, try online debug with [interactive endpoints](how-to-interactive-jobs.md), or locally debug your code. |
69-
| Finalizing | Job is in post-processing after execution completes. Time spent in this stage is mainly for post processes like uploading output, uploading metrics and logs, and cleaning up resources.| Time is expected to be short for command jobs, but might be long for Polygenic risk score (PRS) or Message Passing Interface (MPI) jobs because for distributed jobs, finalizing is from the first node starting finalizing to the last node done finalizing. | Change your step job output mode from upload to mount if you find unexpected long finalizing time, or open a support case via the Azure portal. |
61+
|Inqueue | The job is pending for compute resource allocation. Duration of this stage mainly depends on the status of your compute cluster.| If you're using a cluster with enough compute resource, this time should be short. | Increase the max nodes of the target compute, change the job to another less busy compute, or modify job priority to get more compute resources for the job. |
62+
|Running | The job is executing on the remote compute. This stage consists of runtime preparation, such as image pulling, docker starting, and data mounting or download, followed by user script execution. | This status is expected to be the most time consuming. | 1. Check the source code for any user error. <br> 2. View the monitoring tab for compute metrics like CPU, memory, and networking to identify any bottlenecks. <br> 3. If the job is running, try online debug with [interactive endpoints](how-to-interactive-jobs.md), or locally debug your code. |
63+
| Finalizing | Job is in post-processing after execution completes. Time spent in this stage is mainly for post processes like uploading output, uploading metrics and logs, and cleaning up resources.| Time is expected to be short for command jobs. Duration might be long for polygenic risk score (PRS) or Message Passing Interface (MPI) jobs because for distributed jobs, finalizing lasts from the first node starting to the last node finishing. | Change your step job output mode from upload to mount if you find unexpected long finalizing time, or open a support case via the Azure portal. |
7064

7165
## Related content
7266

0 commit comments

Comments
 (0)