You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-use-parallel-job-in-pipeline.md
+17-10Lines changed: 17 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ For example, in a scenario where you're running an object detection model on a l
23
23
24
24
Machine learning engineers always have scale requirements on their training or inferencing tasks. For example, when a data scientist provides a single script to train a sales prediction model, machine learning engineers need to apply this training task to each individual data store. Challenges of this scale-out process include long execution times that cause delays, and unexpected issues that require manual intervention to keep the task running.
25
25
26
-
The core job of Azure Machine Learning parallelization is to split a single serial task into mini-batches and dispatch those mini-batches to multiple computes to execute in parallel. Parallel jobs significantly reduce end-to-end execution time and also handle errors automatically. Consider using Azure Machine Learning Parallel job if you plan to train many models on top of your partitioned data or you want to accelerate your large-scale batch inferencing tasks.
26
+
The core job of Azure Machine Learning parallelization is to split a single serial task into mini-batches and dispatch those mini-batches to multiple computes to execute in parallel. Parallel jobs significantly reduce end-to-end execution time and also handle errors automatically. Consider using Azure Machine Learning Parallel job to train many models on top of your partitioned data or to accelerate your large-scale batch inferencing tasks.
27
27
28
28
## Prerequisites
29
29
@@ -40,8 +40,12 @@ The core job of Azure Machine Learning parallelization is to split a single seri
40
40
- Install the [Azure Machine Learning SDK v2 for Python](/python/api/overview/azure/ai-ml-readme).
41
41
- Understand how to [create and run Azure Machine Learning pipelines and components with the Python SDK v2](how-to-create-component-pipeline-python.md).
42
42
43
+
---
44
+
43
45
## Create and run a pipeline with a parallel job step
44
46
47
+
An Azure Machine Learning parallel job can be used only as a step in a pipeline job.
48
+
45
49
# [Azure CLI](#tab/cli)
46
50
47
51
The following examples come from [Run a pipeline job using parallel job in pipeline](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/pipelines/iris-batch-prediction-using-parallel/) in the [Azure Machine Learning examples](https://github.com/Azure/azureml-examples) repository.
@@ -50,9 +54,11 @@ The following examples come from [Run a pipeline job using parallel job in pipel
50
54
51
55
The following examples come from the [Build a simple machine learning pipeline with parallel component](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/pipelines/1g_pipeline_with_parallel_nodes/pipeline_with_parallel_nodes.ipynb) notebook in the [Azure Machine Learning examples](https://github.com/Azure/azureml-examples) repository.
52
56
57
+
---
58
+
53
59
### Prepare for parallelization
54
60
55
-
An Azure Machine Learning parallel job can be used only as a step in a pipeline job. This parallel job step requires preparation. In your parallel job definition, you need to set attributes that:
61
+
This parallel job step requires preparation. In your parallel job definition, you need to set attributes that:
56
62
57
63
- Define and bind your input data.
58
64
- Set the data division method.
@@ -69,7 +75,7 @@ Different data formats have different input types, input modes, and data divisio
69
75
70
76
| Data format | Input type | Input mode | Data division method |
| File list |`mltable` or `uri_folder`| ro_mount or download | By size (number of files) or by partitios|
78
+
| File list |`mltable` or `uri_folder`| ro_mount or download | By size (number of files) or by partition|
73
79
| Tabular data |`mltable`| direct | By size (estimated physical size) or by partition |
74
80
75
81
> [!NOTE]
@@ -126,8 +132,8 @@ To use the entry script, set the following two attributes in your parallel job d
126
132
127
133
| Attribute name | Type | Description |
128
134
|: ------------- | ---- |: ---------- |
129
-
|`code`| string | Local path to the source code directory to upload and use for the job. ||
130
-
|`entry_script`| string | The Python file that contains the implementation of predefined parallel functions. ||
135
+
|`code`| string | Local path to the source code directory to upload and use for the job. |
136
+
|`entry_script`| string | The Python file that contains the implementation of predefined parallel functions. |
131
137
132
138
#### Examples
133
139
@@ -152,12 +158,12 @@ Azure Machine Learning parallel job exposes many settings that can automatically
152
158
|--|--|--|--|--|--|--|
153
159
|`mini_batch_error_threshold`| integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code.<br><br>`-1` is the default, meaning to ignore all failed mini-batches. |[-1, int.max]|`-1`|`mini_batch_error_threshold`| N/A |
154
160
|`mini_batch_max_retries`| integer | Number of retries when the mini-batch fails or times out. If all retries fail, the mini-batch is marked as failed per the `mini_batch_error_threshold` calculation. |`[0, int.max]`|`2`|`retry_settings.max_retries`| N/A |
155
-
|`mini_batch_timeout`| integer |Timeout in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. |`(0, 259200]`|`60`|`retry_settings.timeout`| N/A |
161
+
|`mini_batch_timeout`| integer |Time out in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. |`(0, 259200]`|`60`|`retry_settings.timeout`| N/A |
156
162
|`item_error_threshold`| integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed.<br><br>Note: `-1` is the default, meaning to ignore all failures during parallel job. |`[-1, int.max]`|`-1`| N/A |`--error_threshold`|
157
163
|`allowed_failed_percent`| integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. |`[0, 100]`|`100`| N/A |`--allowed_failed_percent`|
158
-
|`overhead_timeout`| integer |Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. |`(0, 259200]`|`600`| N/A |`--task_overhead_timeout`|
159
-
|`progress_update_timeout`| integer |Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. |`(0, 259200]`| Dynamically calculated by other settings. | N/A |`--progress_update_timeout`|
160
-
|`first_task_creation_timeout`| integer |Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. |`(0, 259200]`|`600`| N/A | --first_task_creation_timeout |
164
+
|`overhead_timeout`| integer |Time out in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. |`(0, 259200]`|`600`| N/A |`--task_overhead_timeout`|
165
+
|`progress_update_timeout`| integer |Time out in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. |`(0, 259200]`| Dynamically calculated by other settings. | N/A |`--progress_update_timeout`|
166
+
|`first_task_creation_timeout`| integer |Time out in seconds for monitoring the time between the job start and the run of the first mini-batch. |`(0, 259200]`|`600`| N/A | --first_task_creation_timeout |
161
167
|`logging_level`| string | The level of logs to dump to user log files. |`INFO`, `WARNING`, or `DEBUG`|`INFO`|`logging_level`| N/A |
162
168
|`append_row_to`| string | Aggregate all returns from each run of the mini-batch and output it into this file. May refer to one of the outputs of the parallel job by using the expression `${{outputs.<output_name>}}`|||`task.append_row_to`| N/A |
163
169
|`copy_logs_to_parent`| string | Boolean option whether to copy the job progress, overview, and logs to the parent pipeline job. |`True` or `False`|`False`| N/A |`--copy_logs_to_parent`|
@@ -171,7 +177,8 @@ Sample code to update these settings:
0 commit comments