Skip to content

Commit 425fff3

Browse files
committed
batch inference GA update draft
1 parent 0f87358 commit 425fff3

File tree

1 file changed

+59
-28
lines changed

1 file changed

+59
-28
lines changed

articles/machine-learning/how-to-use-parallel-run-step.md

Lines changed: 59 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ms.custom: Build2019
1717
# Run batch inference on large amounts of data by using Azure Machine Learning
1818
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
1919

20-
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
20+
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides parallelism capabilities out of the box.
2121

2222
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of structured or unstructured data resulting in improved productivity and optimized cost.
2323

@@ -36,13 +36,13 @@ In this article, you learn the following tasks:
3636

3737
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment.
3838

39-
## Set up machine learning resources
39+
## Set up resources
4040

4141
The following actions set up the resources that you need to run a batch inference pipeline:
4242

4343
- Create a datastore that points to a blob container that has images to inference.
44-
- Set up data references as inputs and outputs for the batch inference pipeline step.
45-
- Set up a compute cluster to run the batch inference step.
44+
- Set up data references as inputs and outputs.
45+
- Set up a compute cluster to run batch inference.
4646

4747
### Create a datastore with sample images
4848

@@ -72,32 +72,40 @@ When you create your workspace, [Azure Files](https://docs.microsoft.com/azure/s
7272
def_data_store = ws.get_default_datastore()
7373
```
7474

75-
### Configure data inputs and outputs
75+
### Configure inputs and outputs
7676

77-
Now you need to configure data inputs and outputs, including:
77+
Now you need to configure inputs and outputs, including:
7878

7979
- The directory that contains the input images.
80-
- The directory where the pre-trained model is stored.
81-
- The directory that contains the labels.
82-
- The directory for output.
80+
- The directory for inference output.
8381

84-
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs to the batch inference pipeline step.
85-
86-
You can also reference other datasets in your custom inference script. For example, you can use it to access labels in your script for labeling images by using `Dataset.register` and `Dataset.get_by_name`.
82+
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs. `FileDataset` provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.
8783

8884
For more information about Azure Machine Learning datasets, see [Create and access datasets (preview)](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets).
8985

90-
[`PipelineData`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) objects are used for transferring intermediate data between pipeline steps. In this example, you use it for inference outputs.
91-
9286
```python
9387
from azureml.core.dataset import Dataset
9488

9589
mnist_ds_name = 'mnist_sample_data'
9690

9791
path_on_datastore = mnist_blob.path('mnist/')
9892
input_mnist_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
99-
registered_mnist_ds = input_mnist_ds.register(ws, mnist_ds_name, create_new_version=True)
100-
named_mnist_ds = registered_mnist_ds.as_named_input(mnist_ds_name)
93+
```
94+
95+
In order to use a dynamic data input when run the batch inference pipeline, you can define the input `Dataset` as a [`PipelineParameter`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.graph.pipelineparameter?view=azure-ml-py). You can specify the input dataset each time when you resubmit the batch inference pipeline.
96+
97+
```python
98+
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
99+
from azureml.pipeline.core import PipelineParameter
100+
101+
pipeline_param = PipelineParameter(name="mnist_param", default_value=input_mnist_ds)
102+
input_mnist_ds_consumption = DatasetConsumptionConfig("minist_param_config", pipeline_param).as_mount()
103+
```
104+
105+
[`PipelineData`](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) objects are used for transferring intermediate data between pipeline steps. In this example, you use it for inference outputs.
106+
107+
```python
108+
from azureml.pipeline.core import Pipeline, PipelineData
101109

102110
output_dir = PipelineData(name="inferences",
103111
datastore=def_data_store,
@@ -182,7 +190,7 @@ model = Model.register(model_path="models/",
182190
The script *must contain* two functions:
183191
- `init()`: Use this function for any costly or common preparation for later inference. For example, use it to load the model into a global object. This function will be called only once at beginning of process.
184192
- `run(mini_batch)`: The function will run for each `mini_batch` instance.
185-
- `mini_batch`: Parallel run step will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a file path if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
193+
- `mini_batch`: ParallelRunStep will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a file path if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
186194
- `response`: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful run of input element in the input mini-batch. You should make sure that enough data is included in run result to map input to run result. Run output will be written in output file and not guaranteed to be in order, you should use some key in the output to map it to input.
187195

188196
```python
@@ -245,14 +253,20 @@ Now you have everything you need to build the pipeline.
245253

246254
### Prepare the run environment
247255

248-
First, specify the dependencies for your script. You use this object later when you create the pipeline step.
256+
First, specify the dependencies for your script. You use this object later when you create the ParallelRunStep.
257+
- Please always include **azureml-core** package.
258+
- If your input is `FileDataset`, please include **azureml-dataprep[fuse]**.
259+
- If your input is `TabularDataset`, please include **azureml-dataprep[pandas, fuse]**.
260+
261+
`FileDataset` is used in this example, you will need to include **azureml-dataprep[fuse]** package.
249262

250263
```python
251264
from azureml.core.environment import Environment
252265
from azureml.core.conda_dependencies import CondaDependencies
253266
from azureml.core.runconfig import DEFAULT_GPU_IMAGE
254267

255-
batch_conda_deps = CondaDependencies.create(pip_packages=["tensorflow==1.13.1", "pillow"])
268+
batch_conda_deps = CondaDependencies.create(pip_packages=["tensorflow==1.13.1", "pillow",
269+
"azureml-core", "azureml-dataprep[fuse]"])",
256270

257271
batch_env = Environment(name="batch_environment")
258272
batch_env.python.conda_dependencies = batch_conda_deps
@@ -261,7 +275,7 @@ batch_env.docker.base_image = DEFAULT_GPU_IMAGE
261275
batch_env.spark.precache_packages = False
262276
```
263277

264-
### Specify the parameters for your batch inference pipeline step
278+
### Specify the parameters for ParallelRunStep using ParallelRunConfig
265279

266280
`ParallelRunConfig` is the major configuration for the newly introduced batch inference `ParallelRunStep` instance within the Azure Machine Learning pipeline. You use it to wrap your script and configure necessary parameters, including all of the following parameters:
267281
- `entry_script`: A user script as a local file path that will be run in parallel on multiple nodes. If `source_directory` is present, use a relative path. Otherwise, use any path that's accessible on the machine.
@@ -276,14 +290,14 @@ batch_env.spark.precache_packages = False
276290
- `source_directory`: Paths to folders that contain all files to execute on the compute target (optional).
277291
- `compute_target`: Only `AmlCompute` is supported.
278292
- `node_count`: The number of compute nodes to be used for running the user script.
279-
- `process_count_per_node`: The number of processes per node.
293+
- `process_count_per_node`: The number of processes per node. Best practice is to set to the number of GPU or CPU one node has (optional; default value is `1`).
280294
- `environment`: The Python environment definition. You can configure it to use an existing Python environment or to set up a temporary environment for the experiment. The definition is also responsible for setting the required application dependencies (optional).
281295
- `logging_level`: Log verbosity. Values in increasing verbosity are: `WARNING`, `INFO`, and `DEBUG`. (optional; the default value is `INFO`)
282296
- `run_invocation_timeout`: The `run()` method invocation timeout in seconds. (optional; default value is `60`)
283-
- `run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out. (optional; default value is `3`)
297+
- `run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out (optional; default value is `3`).
284298

285299
```python
286-
from azureml.contrib.pipeline.steps import ParallelRunConfig
300+
from azureml.pipeline.steps import ParallelRunConfig
287301

288302
parallel_run_config = ParallelRunConfig(
289303
source_directory=scripts_folder,
@@ -293,10 +307,10 @@ parallel_run_config = ParallelRunConfig(
293307
output_action="append_row",
294308
environment=batch_env,
295309
compute_target=compute_target,
296-
node_count=4)
310+
node_count=2)
297311
```
298312

299-
### Create the pipeline step
313+
### Create the ParallelRunStep
300314

301315
Create the pipeline step by using the script, environment configuration, and parameters. Specify the compute target that you already attached to your workspace as the target of execution for the script. Use `ParallelRunStep` to create the batch inference pipeline step, which takes all the following parameters:
302316
- `name`: The name of the step, with the following naming restrictions: unique, 3-32 characters, and regex ^\[a-z\]([-a-z0-9]*[a-z0-9])?$.
@@ -315,12 +329,11 @@ parallelrun_step = ParallelRunStep(
315329
parallel_run_config=parallel_run_config,
316330
inputs=[named_mnist_ds],
317331
output=output_dir,
318-
arguments=[],
319332
allow_reuse=True
320333
)
321334
```
322335

323-
>[!Note]
336+
> [!NOTE]
324337
> `models`, `tags` and `properties` are removed from `ParallelRunStep`. You can directly load the models in your python script.
325338
326339
### Submit the pipeline
@@ -337,7 +350,7 @@ pipeline = Pipeline(workspace=ws, steps=[parallelrun_step])
337350
pipeline_run = Experiment(ws, 'digit_identification').submit(pipeline)
338351
```
339352

340-
## Monitor the parallel run job
353+
## Monitor the batch inference job
341354

342355
A batch inference job can take a long time to finish. This example monitors progress by using a Jupyter widget. You can also manage the job's progress by using:
343356

@@ -351,6 +364,24 @@ RunDetails(pipeline_run).show()
351364
pipeline_run.wait_for_completion(show_output=True)
352365
```
353366

367+
## Resubmit a batch inference pipeline run with a different dataset
368+
369+
You can resubmit a run with a different dataset without having to create an entirely new experiment.
370+
371+
```python
372+
path_on_datastore = mnist_data.path('mnist/0.png')
373+
single_image_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
374+
single_image_ds._ensure_saved(ws)
375+
376+
pipeline_run_2 = experiment.submit(pipeline,
377+
pipeline_parameters={"mnist_param": single_image_ds,
378+
"batch_size_param": "1",
379+
"process_count_param": 1}
380+
)
381+
382+
pipeline_run_2.wait_for_completion(show_output=True)
383+
```
384+
354385
## Next steps
355386

356387
To see this process working end to end, try the [batch inference notebook](https://aka.ms/batch-inference-notebooks).

0 commit comments

Comments
 (0)