You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
20
+
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides parallelism capabilities out of the box.
21
21
22
22
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of structured or unstructured data resulting in improved productivity and optimized cost.
23
23
@@ -36,13 +36,13 @@ In this article, you learn the following tasks:
36
36
37
37
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment.
38
38
39
-
## Set up machine learning resources
39
+
## Set up resources
40
40
41
41
The following actions set up the resources that you need to run a batch inference pipeline:
42
42
43
43
- Create a datastore that points to a blob container that has images to inference.
44
-
- Set up data references as inputs and outputs for the batch inference pipeline step.
45
-
- Set up a compute cluster to run the batch inference step.
44
+
- Set up data references as inputs and outputs.
45
+
- Set up a compute cluster to run batch inference.
46
46
47
47
### Create a datastore with sample images
48
48
@@ -72,32 +72,40 @@ When you create your workspace, [Azure Files](https://docs.microsoft.com/azure/s
72
72
def_data_store = ws.get_default_datastore()
73
73
```
74
74
75
-
### Configure data inputs and outputs
75
+
### Configure inputs and outputs
76
76
77
-
Now you need to configure data inputs and outputs, including:
77
+
Now you need to configure inputs and outputs, including:
78
78
79
79
- The directory that contains the input images.
80
-
- The directory where the pre-trained model is stored.
81
-
- The directory that contains the labels.
82
-
- The directory for output.
80
+
- The directory for inference output.
83
81
84
-
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs to the batch inference pipeline step.
85
-
86
-
You can also reference other datasets in your custom inference script. For example, you can use it to access labels in your script for labeling images by using `Dataset.register` and `Dataset.get_by_name`.
82
+
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs. `FileDataset` provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.
87
83
88
84
For more information about Azure Machine Learning datasets, see [Create and access datasets (preview)](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets).
89
85
90
-
[`PipelineData`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) objects are used for transferring intermediate data between pipeline steps. In this example, you use it for inference outputs.
In order to use a dynamic data input when run the batch inference pipeline, you can define the input `Dataset` as a [`PipelineParameter`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.graph.pipelineparameter?view=azure-ml-py). You can specify the input dataset each time when you resubmit the batch inference pipeline.
96
+
97
+
```python
98
+
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
99
+
from azureml.pipeline.core import PipelineParameter
[`PipelineData`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) objects are used for transferring intermediate data between pipeline steps. In this example, you use it for inference outputs.
106
+
107
+
```python
108
+
from azureml.pipeline.core import Pipeline, PipelineData
101
109
102
110
output_dir = PipelineData(name="inferences",
103
111
datastore=def_data_store,
@@ -182,7 +190,7 @@ model = Model.register(model_path="models/",
182
190
The script *must contain* two functions:
183
191
-`init()`: Use this function for any costly or common preparation for later inference. For example, use it to load the model into a global object. This function will be called only once at beginning of process.
184
192
-`run(mini_batch)`: The function will run for each `mini_batch` instance.
185
-
-`mini_batch`: Parallel run step will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a file path if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
193
+
-`mini_batch`: ParallelRunStep will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a file path if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
186
194
-`response`: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful run of input element in the input mini-batch. You should make sure that enough data is included in run result to map input to run result. Run output will be written in output file and not guaranteed to be in order, you should use some key in the output to map it to input.
187
195
188
196
```python
@@ -245,14 +253,20 @@ Now you have everything you need to build the pipeline.
245
253
246
254
### Prepare the run environment
247
255
248
-
First, specify the dependencies for your script. You use this object later when you create the pipeline step.
256
+
First, specify the dependencies for your script. You use this object later when you create the ParallelRunStep.
257
+
- Please always include **azureml-core** package.
258
+
- If your input is `FileDataset`, please include **azureml-dataprep[fuse]**.
259
+
- If your input is `TabularDataset`, please include **azureml-dataprep[pandas, fuse]**.
260
+
261
+
`FileDataset` is used in this example, you will need to include **azureml-dataprep[fuse]** package.
249
262
250
263
```python
251
264
from azureml.core.environment import Environment
252
265
from azureml.core.conda_dependencies import CondaDependencies
253
266
from azureml.core.runconfig importDEFAULT_GPU_IMAGE
### Specify the parameters for your batch inference pipeline step
278
+
### Specify the parameters for ParallelRunStep using ParallelRunConfig
265
279
266
280
`ParallelRunConfig` is the major configuration for the newly introduced batch inference `ParallelRunStep` instance within the Azure Machine Learning pipeline. You use it to wrap your script and configure necessary parameters, including all of the following parameters:
267
281
-`entry_script`: A user script as a local file path that will be run in parallel on multiple nodes. If `source_directory` is present, use a relative path. Otherwise, use any path that's accessible on the machine.
-`source_directory`: Paths to folders that contain all files to execute on the compute target (optional).
277
291
-`compute_target`: Only `AmlCompute` is supported.
278
292
-`node_count`: The number of compute nodes to be used for running the user script.
279
-
-`process_count_per_node`: The number of processes per node.
293
+
-`process_count_per_node`: The number of processes per node. Best practice is to set to the number of GPU or CPU one node has (optional; default value is `1`).
280
294
-`environment`: The Python environment definition. You can configure it to use an existing Python environment or to set up a temporary environment for the experiment. The definition is also responsible for setting the required application dependencies (optional).
281
295
-`logging_level`: Log verbosity. Values in increasing verbosity are: `WARNING`, `INFO`, and `DEBUG`. (optional; the default value is `INFO`)
282
296
-`run_invocation_timeout`: The `run()` method invocation timeout in seconds. (optional; default value is `60`)
283
-
-`run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out. (optional; default value is `3`)
297
+
-`run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out (optional; default value is `3`).
284
298
285
299
```python
286
-
from azureml.contrib.pipeline.steps import ParallelRunConfig
300
+
from azureml.pipeline.steps import ParallelRunConfig
Create the pipeline step by using the script, environment configuration, and parameters. Specify the compute target that you already attached to your workspace as the target of execution for the script. Use `ParallelRunStep` to create the batch inference pipeline step, which takes all the following parameters:
302
316
-`name`: The name of the step, with the following naming restrictions: unique, 3-32 characters, and regex ^\[a-z\]([-a-z0-9]*[a-z0-9])?$.
A batch inference job can take a long time to finish. This example monitors progress by using a Jupyter widget. You can also manage the job's progress by using:
0 commit comments