You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-use-pipeline-ui.md
+166-2Lines changed: 166 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ services: machine-learning
9
9
ms.service: machine-learning
10
10
ms.subservice: core
11
11
ms.topic: how-to
12
-
ms.date: 05/10/2022
12
+
ms.date: 12/22/2022
13
13
ms.custom: designer, event-tier1-build-2022
14
14
---
15
15
@@ -42,7 +42,7 @@ Then you can drag and drop either built-in components or custom components to th
42
42
43
43
Now you've built your pipeline. Select **Submit** button above the canvas, and configure your pipeline job.
44
44
45
-
:::image type="content" source="./media/how-to-use-pipeline-ui/submit-pipeline.png" alt-text="Screenshot showing set up pipeline job with the submit button highlighted." lightbox= "./media/how-to-use-pipeline-ui/submit-pipeline.png":::
45
+
:::image type="content" source="./media/how-to-use-pipeline-ui/submit-pipeline.png" alt-text="Screenshot showing setup pipeline job with the submit button highlighted." lightbox= "./media/how-to-use-pipeline-ui/submit-pipeline.png":::
46
46
47
47
After you submit your pipeline job, you'll see a submitted job list in the left pane, which shows all the pipeline job you create from the current pipeline draft in the same session. There's also notification popping up from the notification center. You can select through the pipeline job link in the submission list or the notification to check pipeline job status or debugging.
48
48
@@ -106,6 +106,170 @@ After cloning, you can also know which pipeline job it's cloned from by selectin
106
106
107
107
You can edit your pipeline and then submit again. After submitting, you can see the lineage between the job you submit and the original job by selecting **Show lineage** in the job detail page.
108
108
109
+
## Compare different pipelines to debug failure or other unexpected issues (preview)
110
+
111
+
Pipeline comparison identifies the differences (including topology, component properties, and job properties) between multiple jobs. For example you can compare a successful pipeline and a failed pipeline, which helps you find what modifications make your pipeline fail.
112
+
113
+
Two major scenarios where you can use pipeline comparison to help with debugging:
114
+
115
+
- Debug your failed pipeline job by comparing it to a completed one.
116
+
- Debug your failed node in a pipeline by comparing it to a similar completed one.
117
+
118
+
To enable this feature:
119
+
120
+
1. Navigate to Azure Machine Learning studio UI.
121
+
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
122
+
3. In **Managed preview feature** panel, toggle on **Compare pipeline jobs to debug failures or unexpected issues** feature.
123
+
124
+
:::image type="content" source="./media/how-to-use-pipeline-ui/enable-preview.png" alt-text="Screenshot of manage preview features toggled on." lightbox= "./media/how-to-use-pipeline-ui/enable-preview.png":::
125
+
126
+
### How to debug your failed pipeline job by comparing it to a completed one
127
+
128
+
During iterative model development, you may have a baseline pipeline, and then do some modifications such as changing a parameter, dataset or compute resource, etc. If your new pipeline failed, you can use pipeline comparison to identify what has changed by comparing it to the baseline pipeline, which could help with figuring out why it failed.
129
+
130
+
#### Compare a pipeline with its parent
131
+
132
+
The first thing you should check when debugging is to locate the failed node and check the logs.
133
+
134
+
For example, you may get an error message showing that your pipeline failed due to out-of-memory. If your pipeline is cloned from a completed parent pipeline, you can use pipeline comparison to see what has changed.
135
+
136
+
1. Select **Show lineage**.
137
+
1. Select the link under "Cloned From". This will open a new browser tab with the parent pipeline.
138
+
139
+
:::image type="content" source="./media/how-to-use-pipeline-ui/cloned-from.png" alt-text="Screenshot showing the cloned from link, with the previous step, the lineage button highlighted." lightbox= "./media/how-to-use-pipeline-ui/cloned-from.png":::
140
+
141
+
1. Select **Add to compare** on the failed pipeline and the parent pipeline. This will add them in the comparison candidate list.
142
+
143
+
:::image type="content" source="./media/how-to-use-pipeline-ui/comparison-list.png" alt-text="Screenshot showing the comparison list with a parent and child pipeline added." lightbox= "./media/how-to-use-pipeline-ui/comparison-list.png":::
144
+
145
+
### Compare topology
146
+
147
+
Once the two pipelines are added to the comparison list, you'll have two options: **Compare detail** and **Compare graph**. **Compare graph** allows you to compare pipeline topology.
148
+
149
+
**Compare graph** shows you the graph topology changes between pipeline A and B. The special nodes in pipeline A are highlighted in red and marked with "A only". The special nodes in pipeline B are in green and marked with "B only". The shared nodes are in gray. If there are differences on the shared nodes, what has been changed is shown on the top of node.
150
+
151
+
There are three categories of changes with summaries viewable in the detail page, parameter change, input source, pipeline component. When the pipeline component is changed this means that there's a topology change inside or an inner node parameter change, you can select the folder icon on the pipeline component node to dig down into the details. Other changes can be detected by viewing the colored nodes in the compare graph.
152
+
153
+
:::image type="content" source="./media/how-to-use-pipeline-ui/parameter-changed.png" alt-text="Screenshot showing the parameter changed and the component information tab." lightbox= "./media/how-to-use-pipeline-ui/parameter-changed.png":::
154
+
155
+
### Compare pipeline meta info and properties
156
+
157
+
If you investigate the dataset difference and find that data or topology doesn't seem to be the root cause of failure, you can also check the pipeline details like pipeline parameter, output or run settings.
158
+
159
+
**Compare graph** is used to compare pipeline topology, **Compare detail** is used to compare pipeline properties link meta info or settings.
160
+
161
+
To access the detail comparison, go to the comparison list, select **Compare details** or select **Show compare details** on the pipeline comparison page.
162
+
163
+
You'll see *Pipeline properties* and *Run properties*.
164
+
165
+
- Pipeline properties include pipeline parameters, run and output setting, etc.
166
+
- Run properties include job status, submit time and duration, etc.
167
+
168
+
The following screenshot shows an example of using the detail comparison, where the default compute setting might have been the reason for failure.
169
+
170
+
:::image type="content" source="./media/how-to-use-pipeline-ui/compute.png" alt-text="Screenshot showing the comparison overview of the default compute." lightbox= "./media/how-to-use-pipeline-ui/compute.png":::
171
+
172
+
To quickly check the topology comparison, select the pipeline name and select **Compare graph**.
173
+
174
+
:::image type="content" source="./media/how-to-use-pipeline-ui/compare-graph.png" alt-text="Screenshot of detail comparison with compare graph highlighted." lightbox= "./media/how-to-use-pipeline-ui/compare-graph.png":::
175
+
176
+
### How to debug your failed node in a pipeline by comparing to similar completed node
177
+
178
+
If you only updated node properties and changed nothing in the pipeline, then you can debug the node by comparing it with the jobs that are submitted from the same component.
179
+
180
+
#### Find the job to compare with
181
+
182
+
1. Find a successful job to compare with by viewing all runs submitted from the same component.
183
+
1. Right select the failed node and select *View Jobs*. This will give you a list of all the jobs.
184
+
185
+
:::image type="content" source="./media/how-to-use-pipeline-ui/view-jobs.png" alt-text="Screenshot that shows a failed node with view jobs highlighted." lightbox= "./media/how-to-use-pipeline-ui/view-jobs.png":::
186
+
187
+
1. Choose a completed job as a comparison target.
188
+
1. After you found a failed and completed job to compare with, add the two jobs to the comparison candidate list.
189
+
1. For the failed node, right select and select *Add to compare*.
190
+
1. For the completed job, go to its parent pipeline and located the completed job. Then select *Add to compare*.
191
+
1. Once the two jobs are in the comparison list, select **Compare detail** to show the differences.
192
+
193
+
### Share the comparison results
194
+
195
+
To share your comparison results select **Share** and copying the link. For example, you might find out that the dataset difference might of lead to the failure but you aren't a dataset specialist, you can share the comparison result with a data engineer on your team.
196
+
197
+
:::image type="content" source="./media/how-to-use-pipeline-ui/share.png" alt-text="Screenshot showing the share button and the link you should copy." lightbox= "./media/how-to-use-pipeline-ui/share.png":::
198
+
199
+
## View profiling to debug pipeline performance issues (preview)
200
+
201
+
Profiling (preview) can help you debug pipeline performance issues such as hang, long pole etc. Profiling will list the duration information of each step in a pipeline and provide a Gantt chart for visualization.
202
+
203
+
Profiling enables you to:
204
+
205
+
- Quickly find which node takes longer time than expected.
206
+
- Identify the time spent of job on each status
207
+
208
+
To enable this feature:
209
+
210
+
1. Navigate to Azure Machine Learning studio UI.
211
+
2. Select **Manage preview features** (megaphone icon) among the icons on the top right side of the screen.
212
+
3. In **Managed preview feature** panel, toggle on **View profiling to debug pipeline performance issues** feature.
213
+
214
+
### How to find the node that runs totally the longest
215
+
216
+
1. On the Jobs page, select the job name and enter the job detail page.
217
+
1. In the action bar, select **View profiling**. Profiling only works for root level pipeline. It will take a few minutes to load the next page.
218
+
219
+
:::image type="content" source="./media/how-to-use-pipeline-ui/view-profiling.png" alt-text="Screenshot showing the pipeline at root level with the view profiling button highlighted." lightbox= "./media/how-to-use-pipeline-ui/view-profiling.png":::
220
+
221
+
1. After the profiler loads, you'll see a Gantt chart. By Default the critical path of a pipeline is shown. A critical path is a subsequence of steps that determine a pipeline job's total duration.
222
+
223
+
:::image type="content" source="./media/how-to-use-pipeline-ui/critical-path.png" alt-text="Screenshot showing the Gantt chart and the critical path." lightbox= "./media/how-to-use-pipeline-ui/critical-path.png":::
224
+
225
+
1. To find the step that takes the longest, you can either view the Gantt chart or the table below it.
226
+
227
+
In the Gantt chart, the length of each bar shows how long the step takes, steps with a longer bar length take more time. You can also filter the table below by "total duration". When you select a row in the table, it will show you the node in the Gantt chart too. When you select a bar on the Gantt chart it will also highlight it in the table.
228
+
229
+
In the table, reuse is denoted with the recycling icon.
230
+
231
+
If you select the log icon next the node name it will open the detail page, which shows parameter, code, outputs, logs etc.
232
+
233
+
:::image type="content" source="./media/how-to-use-pipeline-ui/detail-page-from-log-icon.png" alt-text="Screenshot highlighting the log icon and showing the detail page." lightbox= "./media/how-to-use-pipeline-ui/detail-page-from-log-icon.png":::
234
+
235
+
If you're trying to make the queue time shorter for a node, you can change the compute node number and modify job priority to get more compute resources on this one.
236
+
237
+
### How to find the node that runs the longest in each status
238
+
239
+
Besides the total duration, you can also sort by durations for each status. For example, you can sort by *Preparing* duration to see which step spends the most time on image building. Then you can open the detail page to find that image building fails because of timeout issue.
240
+
241
+
#### What do I do if a duration issue identified
242
+
243
+
Status and definitions:
244
+
245
+
| Status | What does it mean? | Time estimation | Next step |
246
+
|------|--------------|-------------|----------|
247
+
| Not started | Job is submitted from client side and accepted in Azure ML services. Time spent in this stage is mainly in Azure ML service scheduling and preprocessing. | If there's no backend service issue, this time should be very short.| Open support case via Azure portal. |
248
+
|Preparing | In this status, job is pending for some preparation on job dependencies, for example, environment image building.| If you're using curated or registered custom environment, this time should be very short. | Check image building log. |
249
+
|Inqueue | Job is pending for compute resource allocation. Time spent in this stage is mainly depending on the status of your compute cluster or job yield policy for scope job.| If you're using a cluster with enough compute resource, this time should be short. | Check with workspace admin whether to increase the max nodes of the target compute or change the job to another less busy compute. |
250
+
|Running | Job is executing on remote compute. Time spent in this stage is mainly in two parts: <br> Runtime preparation: image pulling, docker starting and data preparation (mount or download). <br> User script execution. | This status is expected to be most time consuming one. | 1. Go to the source code check if there's any user error. <br> 2. View the monitoring tab of compute metrics (CPU, memory, networking etc.) to identify the bottleneck. <br> 3. Try online debug with [interactive endpoints](how-to-interactive-jobs.md) if the job is running or locally debug of your code. |
251
+
| Finalizing | Job is in post processing after execution complete. Time spent in this stage is mainly for some post processes like: output uploading, metric/logs uploading and resources clean up.| It will be short for command job. However, might be very long for PRS/MPI job because for a distributed job, the finalizing status is from the first node starting finalizing to the last node done finalizing. | Change your step run output mode from upload to mount if you find unexpected long finalizing time, or open support case via Azure portal. |
252
+
253
+
254
+
Along with the profiling, you can also use the *Output + logs* (on the details page), the Common Runtime enabled monitoring metric for PRS/MPI jobs.
255
+
256
+
### Different view of Gantt chart
257
+
258
+
- Critical path
259
+
- You'll see only the step jobs in the pipeline's critical path (jobs that have a dependency).
260
+
- By default the critical path of the pipeline job is shown.
261
+
- Flatten view
262
+
- You'll see all step jobs.
263
+
- In this view, you'll see more nodes than in critical path.
264
+
- Compact view
265
+
- You'll only see step jobs that are longer than 30 seconds.
266
+
- Structured view.
267
+
- You'll see all jobs including pipeline component jobs and step jobs.
268
+
269
+
### Download the duration table
270
+
271
+
To export the table, select **Export CSV**.
272
+
109
273
## Next steps
110
274
111
275
In this article, you learned the key features in how to create, explore, and debug a pipeline in UI. To learn more about how you can use the pipeline, see the following articles:
0 commit comments