Skip to content

Commit d973b06

Browse files
committed
touchups
1 parent a653338 commit d973b06

File tree

1 file changed

+18
-18
lines changed

1 file changed

+18
-18
lines changed

articles/machine-learning/how-to-use-parallel-job-in-pipeline.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: How to use parallel jobs in pipelines
33
titleSuffix: Azure Machine Learning
4-
description: Learn how to run parallel jobs in Azure Machine Learning pipelines by using the CLI v2 and Python SDK v2.
4+
description: Learn how to configure and run parallel jobs in Azure Machine Learning pipelines by using the CLI v2 and Python SDK v2.
55
services: machine-learning
66
ms.service: azure-machine-learning
77
ms.subservice: mlops
@@ -71,12 +71,12 @@ The following sections describe how to prepare the parallel job.
7171

7272
A parallel job requires one major input to be split and processed in parallel. The major input data format can be either tabular data or a list of files.
7373

74-
Different data formats have different input types, input modes, and data division methods. The following table illustrates the options:
74+
Different data formats have different input types, input modes, and data division methods. The following table describes the options:
7575

7676
| Data format | Input type | Input mode | Data division method |
7777
|: ---------- |: ------------- |: ------------- |: --------------- |
78-
| File list | `mltable` or `uri_folder` | ro_mount or download | By size (number of files) or by partition |
79-
| Tabular data | `mltable` | direct | By size (estimated physical size) or by partition |
78+
| File list | `mltable` or `uri_folder` | `ro_mount` or `download` | By size (number of files) or by partition |
79+
| Tabular data | `mltable` | `direct` | By size (estimated physical size) or by partition |
8080

8181
> [!NOTE]
8282
> If you use tabular `mltable` as your major input data, you need to:
@@ -154,20 +154,20 @@ The following code declares the `job_data_path` as input, binds it to the `input
154154

155155
Azure Machine Learning parallel job exposes many optional settings that can automatically control the job without manual intervention. The following table describes these settings.
156156

157-
| Key | Type | Description | Allowed values | Default value | Set in attribute | Set in program arguments |
158-
|--|--|--|--|--|--|--|
159-
| `mini_batch_error_threshold` | integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code. | `[-1, int.max]` | `-1`, meaning ignore all failed mini-batches | `mini_batch_error_threshold` | N/A |
160-
| `mini_batch_max_retries` | integer | Number of retries when the mini-batch fails or times out. If all retries fail, the mini-batch is marked as failed per the `mini_batch_error_threshold` calculation. | `[0, int.max]` | `2` | `retry_settings.max_retries` | N/A |
161-
| `mini_batch_timeout` | integer | Timeout in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. | `(0, 259200]` | `60` | `retry_settings.timeout` | N/A |
162-
| `item_error_threshold` | integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed. | `[-1, int.max]` | `-1`, meaning ignore all failures during parallel job. | N/A | `--error_threshold` |
163-
| `allowed_failed_percent` | integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. | `[0, 100]` | `100` | N/A | `--allowed_failed_percent` |
164-
| `overhead_timeout` | integer | Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. | `(0, 259200]` | `600` | N/A | `--task_overhead_timeout` |
165-
| `progress_update_timeout` | integer | Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. | `(0, 259200]` | Dynamically calculated by other settings. | N/A | `--progress_update_timeout` |
166-
| `first_task_creation_timeout` | integer | Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. | `(0, 259200]` | `600` | N/A | --first_task_creation_timeout |
167-
| `logging_level` | string | The level of logs to dump to user log files. | `INFO`, `WARNING`, or `DEBUG` | `INFO` | `logging_level` | N/A |
168-
| `append_row_to` | string | Aggregate all returns from each run of the mini-batch and output it into this file. May refer to one of the outputs of the parallel job by using the expression `${{outputs.<output_name>}}` | | | `task.append_row_to` | N/A |
157+
| Key | Type | Description | Allowed values | Default value | Set in attribute or program argument |
158+
|--|--|--|--|--|--|
159+
| `mini_batch_error_threshold` | integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code. | `[-1, int.max]` | `-1`, meaning ignore all failed mini-batches | Attribute `mini_batch_error_threshold` |
160+
| `mini_batch_max_retries` | integer | Number of retries when the mini-batch fails or times out. If all retries fail, the mini-batch is marked as failed per the `mini_batch_error_threshold` calculation. | `[0, int.max]` | `2` | Attribute `retry_settings.max_retries` |
161+
| `mini_batch_timeout` | integer | Timeout in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. | `(0, 259200]` | `60` | Attribute `retry_settings.timeout` |
162+
| `item_error_threshold` | integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed. | `[-1, int.max]` | `-1`, meaning ignore all failures during parallel job | Program argument `--error_threshold` |
163+
| `allowed_failed_percent` | integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. | `[0, 100]` | `100` | Program argument `--allowed_failed_percent` |
164+
| `overhead_timeout` | integer | Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. | `(0, 259200]` | `600` | Program argument `--task_overhead_timeout` |
165+
| `progress_update_timeout` | integer | Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. | `(0, 259200]` | Dynamically calculated by other settings | Program argument `--progress_update_timeout` |
166+
| `first_task_creation_timeout` | integer | Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. | `(0, 259200]` | `600` | Program argument `--first_task_creation_timeout` |
167+
| `logging_level` | string | The level of logs to dump to user log files. | `INFO`, `WARNING`, or `DEBUG` | `INFO` | Attribute `logging_level` |
168+
| `append_row_to` | string | Aggregate all returns from each run of the mini-batch and output it into this file. May refer to one of the outputs of the parallel job by using the expression `${{outputs.<output_name>}}` | | | Attribute `task.append_row_to` |
169169
| `copy_logs_to_parent` | string | Boolean option whether to copy the job progress, overview, and logs to the parent pipeline job. | `True` or `False` | `False` | N/A | `--copy_logs_to_parent` |
170-
| `resource_monitor_interval` | integer | Time interval in seconds to dump node resource usage (for example cpu or memory) to log folder under the *logs/sys/perf* path.<br><br>**Note:** Frequent dump resource logs slightly slow execution speed. Set this value to `0` to stop dumping resource usage. | `[0, int.max]` | `600` | N/A | `--resource_monitor_interval` |
170+
| `resource_monitor_interval` | integer | Time interval in seconds to dump node resource usage (for example cpu or memory) to log folder under the *logs/sys/perf* path.<br><br>**Note:** Frequent dump resource logs slightly slow execution speed. Set this value to `0` to stop dumping resource usage. | `[0, int.max]` | `600` | Program argument `--resource_monitor_interval` |
171171

172172
The following sample code updates these settings:
173173

@@ -235,7 +235,7 @@ To view parallel job results, double-click the parallel step in the pipeline gra
235235

236236
:::image type="content" source="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png" alt-text="Screenshot of Azure Machine Learning studio showing the parallel job settings." lightbox ="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png":::
237237

238-
To debug parallel job failure, select the **Outputs + logs** tab, expand the *logs* folder, and check *job_result.txt* to understand why the parallel job failed. For more details about the logging structure of parallel jobs, see *readme.txt* in the same folder.
238+
To debug parallel job failure, select the **Outputs + logs** tab, expand the *logs* folder, and check *job_result.txt* to understand why the parallel job failed. For information about the logging structure of parallel jobs, see *readme.txt* in the same folder.
239239

240240
:::image type="content" source="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-result.png" alt-text="Screenshot of Azure Machine Learning studio on the jobs tab showing the parallel job results." lightbox ="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-result.png":::
241241

0 commit comments

Comments
 (0)