Skip to content

Commit 4edadca

Browse files
committed
edits
1 parent 12efb96 commit 4edadca

File tree

3 files changed

+24
-24
lines changed

3 files changed

+24
-24
lines changed

articles/machine-learning/how-to-use-parallel-job-in-pipeline.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ ms.custom: devx-track-python, sdkv2, cliv2, update-code1
1919

2020
This article explains how to use the CLI v2 and Python SDK v2 to run parallel jobs in Azure Machine Learning pipelines. Parallel jobs accelerate job execution by distributing repeated tasks on powerful multinode compute clusters.
2121

22-
For example, in a scenario where you're running an object detection model on a large set of images, Azure Machine Learning parallel jobs let you easily distribute your images to run custom code in parallel on a specific compute cluster. Parallelization can significantly reduce time cost. Azure Machine Learning parallel jobs can also simplify and automate your process to make it more efficient.
23-
2422
Machine learning engineers always have scale requirements on their training or inferencing tasks. For example, when a data scientist provides a single script to train a sales prediction model, machine learning engineers need to apply this training task to each individual data store. Challenges of this scale-out process include long execution times that cause delays, and unexpected issues that require manual intervention to keep the task running.
2523

2624
The core job of Azure Machine Learning parallelization is to split a single serial task into mini-batches and dispatch those mini-batches to multiple computes to execute in parallel. Parallel jobs significantly reduce end-to-end execution time and also handle errors automatically. Consider using Azure Machine Learning Parallel job to train many models on top of your partitioned data or to accelerate your large-scale batch inferencing tasks.
2725

26+
For example, in a scenario where you're running an object detection model on a large set of images, Azure Machine Learning parallel jobs let you easily distribute your images to run custom code in parallel on a specific compute cluster. Parallelization can significantly reduce time cost. Azure Machine Learning parallel jobs can also simplify and automate your process to make it more efficient.
27+
2828
## Prerequisites
2929

3030
- Have an Azure Machine Learning account and workspace.
@@ -58,18 +58,18 @@ The following examples come from the [Build a simple machine learning pipeline w
5858

5959
### Prepare for parallelization
6060

61-
This parallel job step requires preparation. In your parallel job definition, you need to set attributes that:
61+
This parallel job step requires preparation. You need an entry script that implements the predefined functions. You also need to set attributes in your parallel job definition that:
6262

6363
- Define and bind your input data.
6464
- Set the data division method.
6565
- Configure your compute resources.
6666
- Call the entry script.
6767

68-
You also need an entry script that implements the predefined functions. The following sections describe how to prepare for the parallel job.
68+
The following sections describe how to prepare the parallel job.
6969

7070
#### Declare the inputs and data division setting
7171

72-
A parallel job requires one major input to be split and processed in parallel. The major input data can be either tabular data or a list of files.
72+
A parallel job requires one major input to be split and processed in parallel. The major input data format can be either tabular data or a list of files.
7373

7474
Different data formats have different input types, input modes, and data division methods. The following table illustrates the options:
7575

@@ -81,7 +81,7 @@ Different data formats have different input types, input modes, and data divisio
8181
> [!NOTE]
8282
> If you use tabular `mltable` as your major input data, you need to:
8383
>- Install the `mltable` library in your environment, as in line 9 of this [conda file](https://github.com/Azure/azureml-examples/blob/main/cli/jobs/parallel/1a_oj_sales_prediction/src/parallel_train/conda.yaml).
84-
>- Have a *MLTable* specification file under your specified path with the `transformations: - read_delimited:` section filled out. For examples, see [Create a mltable data asset](how-to-create-register-data-assets.md#create-a-mltable-data-asset) .
84+
>- Have a *MLTable* specification file under your specified path with the `transformations: - read_delimited:` section filled out. For examples, see [Create and manage data assets](how-to-create-data-assets.md).
8585
8686
You can declare your major input data with the `input_data` attribute in the parallel job YAML or Python, and bind the data with the defined `input` of your parallel job by using `${{inputs.<input name>}}`. Then you define the data division attribute for your major input depending on your data division method.
8787

@@ -110,7 +110,7 @@ The entry script is a single Python file that implements the following three pre
110110
| Function name | Required | Description | Input | Return |
111111
| :------------ | -------- | :---------- | :---- | :----- |
112112
| `Init()` | Y | Common preparation before starting to run mini-batches. For example, use this function to load the model into a global object. | -- | -- |
113-
| `Run(mini_batch)` | Y | Implement main execution logic for mini-batches. | `mini_batch` is pandas dataframe if input data is a tabular data, or file path list if input data is a directory. | Dataframe, list, or tuple. |
113+
| `Run(mini_batch)` | Y | Implements main execution logic for mini-batches. | `mini_batch` is pandas dataframe if input data is a tabular data, or file path list if input data is a directory. | Dataframe, list, or tuple. |
114114
| `Shutdown()` | N | Optional function to do custom cleanups before returning the compute to the pool. | -- | -- |
115115

116116
> [!IMPORTANT]
@@ -128,7 +128,7 @@ See the following entry script examples:
128128
- [Image identification for a list of image files](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/Code/digit_identification.py)
129129
- [Iris classification for a tabular iris data](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/Code/iris_score.py)
130130

131-
To use the entry script, set the following two attributes in your parallel job definition:
131+
To call the entry script, set the following two attributes in your parallel job definition:
132132

133133
| Attribute name | Type | Description |
134134
|: ------------- | ---- |: ---------- |
@@ -139,8 +139,8 @@ To use the entry script, set the following two attributes in your parallel job d
139139

140140
# [Azure CLI](#tab/cliv2)
141141

142-
The following parallel job step declares the input type and mode, binds the input and sets the `mini_batch_size` data division attribute, and calls the entry script.
143-
:::code language="yaml" source="~/azureml-examples-main/cli/jobs/pipelines/iris-batch-prediction-using-parallel/pipeline.yml" range="14-41" highlight="5-8,18-19,20-22,32-33":::
142+
The following parallel job step declares the input type, mode, and data division method, binds the input, configures the compute, and calls the entry script.
143+
:::code language="yaml" source="~/azureml-examples-main/cli/jobs/pipelines/iris-batch-prediction-using-parallel/pipeline.yml" range="14-51" highlight="5-8,18-22,32-33":::
144144

145145
# [Python](#tab/python)
146146

@@ -156,20 +156,20 @@ Azure Machine Learning parallel job exposes many settings that can automatically
156156

157157
| Key | Type | Description | Allowed values | Default value | Set in attribute | Set in program arguments |
158158
|--|--|--|--|--|--|--|
159-
| `mini_batch_error_threshold` | integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code.<br><br>`-1` is the default, meaning to ignore all failed mini-batches. | [-1, int.max] | `-1` | `mini_batch_error_threshold` | N/A |
159+
| `mini_batch_error_threshold` | integer | Number of failed mini-batches to ignore in this parallel job. If the count of failed mini-batches is higher than this threshold, the parallel job is marked as failed.<br><br>The mini-batch is marked as failed if:<br>- The count of return from `run()` is less than the mini-batch input count.<br>- Exceptions are caught in custom `run()` code. | `[-1, int.max]` | `-1`, meaning ignore all failed mini-batches | `mini_batch_error_threshold` | N/A |
160160
| `mini_batch_max_retries` | integer | Number of retries when the mini-batch fails or times out. If all retries fail, the mini-batch is marked as failed per the `mini_batch_error_threshold` calculation. | `[0, int.max]` | `2` | `retry_settings.max_retries` | N/A |
161-
| `mini_batch_timeout` | integer | Time-out in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. | `(0, 259200]` | `60` | `retry_settings.timeout` | N/A |
162-
| `item_error_threshold` | integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed.<br><br>Note: `-1` is the default, meaning to ignore all failures during parallel job. | `[-1, int.max]` | `-1` | N/A | `--error_threshold` |
161+
| `mini_batch_timeout` | integer | Timeout in seconds for executing the custom `run()` function. If execution time is higher than this threshold, the mini-batch is aborted and marked as failed to trigger retry. | `(0, 259200]` | `60` | `retry_settings.timeout` | N/A |
162+
| `item_error_threshold` | integer | The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job is marked as failed. | `[-1, int.max]` | `-1`, meaning ignore all failures during parallel job. | N/A | `--error_threshold` |
163163
| `allowed_failed_percent` | integer | Similar to `mini_batch_error_threshold`, but uses the percent of failed mini-batches instead of the count. | `[0, 100]` | `100` | N/A | `--allowed_failed_percent` |
164-
| `overhead_timeout` | integer | Time-out in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. | `(0, 259200]` | `600` | N/A | `--task_overhead_timeout` |
165-
| `progress_update_timeout` | integer | Time-out in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. | `(0, 259200]` | Dynamically calculated by other settings. | N/A | `--progress_update_timeout` |
166-
| `first_task_creation_timeout` | integer | Time-out in seconds for monitoring the time between the job start and the run of the first mini-batch. | `(0, 259200]` | `600` | N/A | --first_task_creation_timeout |
164+
| `overhead_timeout` | integer | Timeout in seconds for initialization of each mini-batch. For example, load mini-batch data and pass it to the `run()` function. | `(0, 259200]` | `600` | N/A | `--task_overhead_timeout` |
165+
| `progress_update_timeout` | integer | Timeout in seconds for monitoring the progress of mini-batch execution. If no progress updates are received within this timeout setting, the parallel job is marked as failed. | `(0, 259200]` | Dynamically calculated by other settings. | N/A | `--progress_update_timeout` |
166+
| `first_task_creation_timeout` | integer | Timeout in seconds for monitoring the time between the job start and the run of the first mini-batch. | `(0, 259200]` | `600` | N/A | --first_task_creation_timeout |
167167
| `logging_level` | string | The level of logs to dump to user log files. | `INFO`, `WARNING`, or `DEBUG` | `INFO` | `logging_level` | N/A |
168168
| `append_row_to` | string | Aggregate all returns from each run of the mini-batch and output it into this file. May refer to one of the outputs of the parallel job by using the expression `${{outputs.<output_name>}}` | | | `task.append_row_to` | N/A |
169169
| `copy_logs_to_parent` | string | Boolean option whether to copy the job progress, overview, and logs to the parent pipeline job. | `True` or `False` | `False` | N/A | `--copy_logs_to_parent` |
170170
| `resource_monitor_interval` | integer | Time interval in seconds to dump node resource usage (for example cpu or memory) to log folder under the *logs/sys/perf* path.<br><br>**Note:** Frequent dump resource logs slightly slow execution speed. Set this value to `0` to stop dumping resource usage. | `[0, int.max]` | `600` | N/A | `--resource_monitor_interval` |
171171

172-
Sample code to update these settings:
172+
The following sample code updates these settings:
173173

174174
# [Azure CLI](#tab/cliv2)
175175

@@ -199,7 +199,7 @@ First, import the required libraries, initiate the `ml_client` with proper crede
199199

200200
[!Notebook-python[] (~/azureml-examples-main/sdk/python/jobs/pipelines/1g_pipeline_with_parallel_nodes/pipeline_with_parallel_nodes.ipynb?name=workspace)]
201201

202-
Then, implement the parallel job by filling out the `parallel_run_function`:
202+
Then, implement the parallel job by completing the `parallel_run_function`:
203203

204204
[!Notebook-python[] (~/azureml-examples-main/sdk/python/jobs/pipelines/1g_pipeline_with_parallel_nodes/pipeline_with_parallel_nodes.ipynb?name=parallel-job-for-tabular-data)]
205205

@@ -231,16 +231,16 @@ Submit your pipeline job with parallel step by using the `jobs.create_or_update`
231231

232232
After you submit a pipeline job, the SDK or CLI widget gives you a web URL link to the pipeline graph in the Azure Machine Learning studio UI.
233233

234-
To view parallel job results, double-click the parallel step in the pipeline graph, select the **Parameters** tab in the details panel, expand **Run settings**, and check the **Parallel** section.
234+
To view parallel job results, double-click the parallel step in the pipeline graph, select the **Settings** tab in the details panel, expand **Run settings**, and then expand the **Parallel** section.
235235

236-
:::image type="content" source="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png" alt-text="Screenshot of Azure Machine Learning studio on the jobs tab showing the parallel job settings." lightbox ="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png":::
236+
:::image type="content" source="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png" alt-text="Screenshot of Azure Machine Learning studio showing the parallel job settings." lightbox ="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-settings.png":::
237237

238-
To debug parallel job failure, select the **Outputs + Logs** tab, expand the **logs** folder, and check *job_result.txt* to understand why the parallel job failed. For more detail about the logging structure of parallel jobs, see *readme.txt* in the same folder.
238+
To debug parallel job failure, select the **Outputs + logs** tab, expand the *logs* folder, and check *job_result.txt* to understand why the parallel job failed. For more details about the logging structure of parallel jobs, see *readme.txt* in the same folder.
239239

240240
:::image type="content" source="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-result.png" alt-text="Screenshot of Azure Machine Learning studio on the jobs tab showing the parallel job results." lightbox ="./media/how-to-use-parallel-job-in-pipeline/screenshot-for-parallel-job-result.png":::
241241

242242
## Related content
243243

244-
- For the detailed yaml schema of parallel job, see the [YAML reference for parallel job](reference-yaml-job-parallel.md).
245-
- For how to onboard your data into MLTABLE, see [Create a mltable data asset](how-to-create-register-data-assets.md#create-a-mltable-data-asset).
246-
- For how to regularly trigger your pipeline, see [how to schedule pipeline](how-to-schedule-pipeline-job.md).
244+
- [CLI (v2) parallel job YAML schema](reference-yaml-job-parallel.md).
245+
- [Create and manage data assets](how-to-create-data-assets.md).
246+
- [Schedule machine learning pipeline jobs](how-to-schedule-pipeline-job.md).
-85 Bytes
Loading
-251 Bytes
Loading

0 commit comments

Comments
 (0)