Skip to content

Commit 4328374

Browse files
Merge pull request #230495 from lgayhardt/pipelineupdate0322
Update how-to-use-parallel-job-in-pipeline.md
2 parents d64e8fb + 8e15126 commit 4328374

File tree

1 file changed

+18
-17
lines changed

1 file changed

+18
-17
lines changed

articles/machine-learning/how-to-use-parallel-job-in-pipeline.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: how-to
99
author: alainli
1010
ms.author: alainli
1111
ms.reviewer: lagayhar
12-
ms.date: 09/27/2022
12+
ms.date: 03/13/2023
1313
ms.custom: devx-track-python, sdkv2, cliv2, event-tier1-build-2022
1414
---
1515

@@ -47,23 +47,24 @@ You should consider using Azure Machine Learning Parallel job if:
4747

4848
Unlike other types of jobs, a parallel job requires preparation. Follow the next sections to prepare for creating your parallel job.
4949

50-
### Declare the inputs to be distributed and partition setting
50+
### Declare the inputs to be distributed and data division setting
5151

52-
Parallel job requires only one **major input data** to be split and processed with parallel. The major input data can be either tabular data or a set of files. Different input types can have a different partition method.
52+
Parallel job requires only one **major input data** to be split and processed with parallel. The major input data can be either tabular data or a set of files. Different input types can have a different data division method.
5353

54-
The following table illustrates the relation between input data and partition setting:
54+
The following table illustrates the relation between input data and data division method:
5555

56-
| Data format | Azure Machine Learning input type | Azure Machine Learning input mode | Partition method |
56+
| Data format | Azure Machine Learning input type | Azure Machine Learning input mode | Data division method |
5757
|: ---------- |: ------------- |: ------------- |: --------------- |
58-
| File list | `mltable` or<br>`uri_folder` | ro_mount or<br>download | By size (number of files) |
59-
| Tabular data | `mltable` | direct | By size (estimated physical size) |
58+
| File list | `mltable` or<br>`uri_folder` | ro_mount or<br>download | By size (number of files)<br>By partitions |
59+
| Tabular data | `mltable` | direct | By size (estimated physical size)<br>By partitions |
6060

61-
You can declare your major input data with `input_data` attribute in parallel job YAML or Python SDK. And you can bind it with one of your defined `inputs` of your parallel job by using `${{inputs.<input name>}}`. Then to define the partition method for your major input.
61+
You can declare your major input data with `input_data` attribute in parallel job YAML or Python SDK. And you can bind it with one of your defined `inputs` of your parallel job by using `${{inputs.<input name>}}`. Then you need to define the data division method for your major input by filling different attribute:
6262

63-
For example, you could set numbers to `mini_batch_size` to partition your data **by size**.
63+
| Data division method | Attribute name | Attribute type | Job example |
64+
|: ---------- |: ------------- |: ------------- |: --------------- |
65+
| By size | mini_batch_size | string | [Iris batch prediction](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/parallel/2a_iris_batch_prediction) |
66+
| By partitions | partition_keys | list of string | [Orange juice sales prediction](https://github.com/Azure/azureml-examples/blob/main/cli/jobs/parallel/1a_oj_sales_prediction) |
6467

65-
- When using file list input, this value defines the number of files for each mini-batch.
66-
- When using tabular input, this value defines the estimated physical size for each mini-batch.
6768

6869
# [Azure CLI](#tab/cliv2)
6970

@@ -81,7 +82,7 @@ Declare `job_data_path` as one of the inputs. Bind it to `input_data` attribute.
8182

8283
---
8384

84-
Once you have the partition setting defined, you can configure parallel setting by using two attributes below:
85+
Once you have the data division setting defined, you can configure how many resources for your parallelization by filling two attributes below:
8586

8687
| Attribute name | Type | Description | Default value |
8788
|:-|--|:-|--|
@@ -156,6 +157,9 @@ Sample code to set two attributes:
156157
> [!IMPORTANT]
157158
> If you want to parse arguments in Init() or Run(mini_batch) function, use "parse_known_args" instead of "parse_args" for avoiding exceptions. See the [iris_score](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/Code/iris_score.py) example for entry script with argument parser.
158159
160+
> [!IMPORTANT]
161+
> If you use `mltable` as your major input data, you need to install 'mltable' library into your environment. See the line 9 of this [conda file](https://github.com/Azure/azureml-examples/blob/main/cli/jobs/parallel/1a_oj_sales_prediction/src/parallel_train/conda.yml) example.
162+
159163
### Consider automation settings
160164

161165
Azure Machine Learning parallel job exposes numerous settings to automatically control the job without manual intervention. See the following table for the details.
@@ -255,11 +259,8 @@ To debug the failure of your parallel job, navigate to **Outputs + Logs** tab, e
255259

256260
## Parallel job in pipeline examples
257261

258-
- Azure CLI + YAML:
259-
- [Iris prediction using parallel](https://github.com/Azure/azureml-examples/tree/sdk-preview/cli/jobs/pipelines/iris-batch-prediction-using-parallel) (tabular input)
260-
- [mnist identification using parallel](https://github.com/Azure/azureml-examples/tree/sdk-preview/cli/jobs/pipelines/mnist-batch-identification-using-parallel) (file list input)
261-
- SDK:
262-
- [Pipeline with parallel run function](https://github.com/Azure/azureml-examples/blob/sdk-preview/sdk/jobs/pipelines/1g_pipeline_with_parallel_nodes/pipeline_with_parallel_nodes.ipynb)
262+
- [Azure CLI + YAML example repository](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/parallel)
263+
- [SDK example repository](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/parallel)
263264

264265
## Next steps
265266

0 commit comments

Comments
 (0)