You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-use-parallel-job-in-pipeline.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -156,18 +156,18 @@ Azure Machine Learning parallel job exposes many optional settings that can auto
156
156
157
157
| Key | Type | Description | Allowed values | Default value | Set in attribute or program argument |
158
158
|--|--|--|--|--|--|
159
-
|`mini_batch_error_threshold`| integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code. |`[-1, int.max]`|`-1`, meaning ignore all failed mini-batches | Attribute `mini_batch_error_threshold`|
159
+
|`mini_batch_error_threshold`| integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code. |`[-1, int.max]`|`-1`, meaning ignore all failed mini-batches | Attribute `mini_batch_error_threshold`|
160
160
|`mini_batch_max_retries`| integer | Number of retries when the mini-batch fails or times out. If all retries fail, the mini-batch is marked as failed per the `mini_batch_error_threshold` calculation. |`[0, int.max]`|`2`| Attribute `retry_settings.max_retries`|
161
161
|`mini_batch_timeout`| integer | Timeout in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. |`(0, 259200]`|`60`| Attribute `retry_settings.timeout`|
162
-
|`item_error_threshold`| integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed. |`[-1, int.max]`|`-1`, meaning ignore all failures during parallel job | Program argument`--error_threshold`|
163
-
|`allowed_failed_percent`| integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. |`[0, 100]`|`100`| Program argument`--allowed_failed_percent`|
164
-
|`overhead_timeout`| integer | Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. |`(0, 259200]`|`600`| Program argument`--task_overhead_timeout`|
165
-
|`progress_update_timeout`| integer | Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. |`(0, 259200]`| Dynamically calculated by other settings | Program argument`--progress_update_timeout`|
166
-
|`first_task_creation_timeout`| integer | Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. |`(0, 259200]`|`600`| Program argument`--first_task_creation_timeout`|
162
+
|`item_error_threshold`| integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed. |`[-1, int.max]`|`-1`, meaning ignore all failures during parallel job | Program argument<br>`--error_threshold`|
163
+
|`allowed_failed_percent`| integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. |`[0, 100]`|`100`| Program argument<br>`--allowed_failed_percent`|
164
+
|`overhead_timeout`| integer | Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. |`(0, 259200]`|`600`| Program argument<br>`--task_overhead_timeout`|
165
+
|`progress_update_timeout`| integer | Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. |`(0, 259200]`| Dynamically calculated by other settings | Program argument<br>`--progress_update_timeout`|
166
+
|`first_task_creation_timeout`| integer | Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. |`(0, 259200]`|`600`| Program argument<br>`--first_task_creation_timeout`|
167
167
|`logging_level`| string | The level of logs to dump to user log files. |`INFO`, `WARNING`, or `DEBUG`|`INFO`| Attribute `logging_level`|
168
168
|`append_row_to`| string | Aggregate all returns from each run of the mini-batch and output it into this file. May refer to one of the outputs of the parallel job by using the expression `${{outputs.<output_name>}}`||| Attribute `task.append_row_to`|
169
-
|`copy_logs_to_parent`| string | Boolean option whether to copy the job progress, overview, and logs to the parent pipeline job. |`True` or `False`|`False`|N/A |`--copy_logs_to_parent`|
170
-
|`resource_monitor_interval`| integer | Time interval in seconds to dump node resource usage (for example cpu or memory) to log folder under the *logs/sys/perf* path.<br><br>**Note:** Frequent dump resource logs slightly slow execution speed. Set this value to `0` to stop dumping resource usage. |`[0, int.max]`|`600`| Program argument`--resource_monitor_interval`|
169
+
|`copy_logs_to_parent`| string | Boolean option whether to copy the job progress, overview, and logs to the parent pipeline job. |`True` or `False`|`False`|Program argument<br>`--copy_logs_to_parent`|
170
+
|`resource_monitor_interval`| integer | Time interval in seconds to dump node resource usage (for example cpu or memory) to log folder under the *logs/sys/perf* path.<br><br>**Note:** Frequent dump resource logs slightly slow execution speed. Set this value to `0` to stop dumping resource usage. |`[0, int.max]`|`600`| Program argument<br>`--resource_monitor_interval`|
171
171
172
172
The following sample code updates these settings:
173
173
@@ -241,6 +241,6 @@ To debug parallel job failure, select the **Outputs + logs** tab, expand the *lo
0 commit comments