Skip to content

Commit daa27a1

Browse files
committed
Batch Inference GA doc update draft
1 parent 9d2b9dd commit daa27a1

File tree

1 file changed

+13
-15
lines changed

1 file changed

+13
-15
lines changed

articles/machine-learning/how-to-use-parallel-run-step.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,19 @@ ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: tutorial
99

10-
ms.reviewer: trbye, jmartens, larryfr, vaidyas
11-
ms.author: vaidyas
12-
author: vaidya-s
13-
ms.date: 01/15/2020
14-
ms.custom: Ignite2019
10+
ms.reviewer: trbye, jmartens, larryfr
11+
ms.author: tracych
12+
author: tracychms
13+
ms.date: 04/15/2020
14+
ms.custom: Build2019
1515
---
1616

1717
# Run batch inference on large amounts of data by using Azure Machine Learning
1818
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
1919

20-
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep capability described here is in public preview. It's a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
20+
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
2121

22-
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of production data resulting in improved productivity and optimized cost.
22+
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of structured or unstructured data resulting in improved productivity and optimized cost.
2323

2424
In this article, you learn the following tasks:
2525

@@ -34,7 +34,7 @@ In this article, you learn the following tasks:
3434

3535
* For a guided quickstart, complete the [setup tutorial](tutorial-1st-experiment-sdk-setup.md) if you don't already have an Azure Machine Learning workspace or notebook virtual machine.
3636

37-
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment. Run `pip install azureml-sdk[notebooks] azureml-pipeline-core azureml-contrib-pipeline-steps` in your environment to download the necessary dependencies.
37+
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment.
3838

3939
## Set up machine learning resources
4040

@@ -83,9 +83,6 @@ Now you need to configure data inputs and outputs, including:
8383

8484
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs to the batch inference pipeline step.
8585

86-
> [!NOTE]
87-
> `FileDataset` support in batch inference is restricted to Azure Blob storage for now.
88-
8986
You can also reference other datasets in your custom inference script. For example, you can use it to access labels in your script for labeling images by using `Dataset.register` and `Dataset.get_by_name`.
9087

9188
For more information about Azure Machine Learning datasets, see [Create and access datasets (preview)](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets).
@@ -274,14 +271,16 @@ batch_env.spark.precache_packages = False
274271
- `error_threshold`: The number of record failures for `TabularDataset` and file failures for `FileDataset` that should be ignored during processing. If the error count for the entire input goes above this value, the job will be aborted. The error threshold is for the entire input and not for individual mini-batches sent to the `run()` method. The range is `[-1, int.max]`. The `-1` part indicates ignoring all failures during processing.
275272
- `output_action`: One of the following values indicates how the output will be organized:
276273
- `summary_only`: The user script will store the output. `ParallelRunStep` will use the output only for the error threshold calculation.
277-
- `append_row`: For all input files, only one file will be created in the output folder to append all outputs separated by line. The file name will be `parallel_run_step.txt`.
274+
- `append_row`: For all input files, only one file will be created in the output folder to append all outputs separated by line. The file name is configurable, default will be `parallel_run_step.txt`.
275+
- `append_row_file_name`: To customize the output file name for append_row output_action (optional).
278276
- `source_directory`: Paths to folders that contain all files to execute on the compute target (optional).
279277
- `compute_target`: Only `AmlCompute` is supported.
280278
- `node_count`: The number of compute nodes to be used for running the user script.
281279
- `process_count_per_node`: The number of processes per node.
282280
- `environment`: The Python environment definition. You can configure it to use an existing Python environment or to set up a temporary environment for the experiment. The definition is also responsible for setting the required application dependencies (optional).
283281
- `logging_level`: Log verbosity. Values in increasing verbosity are: `WARNING`, `INFO`, and `DEBUG`. (optional; the default value is `INFO`)
284282
- `run_invocation_timeout`: The `run()` method invocation timeout in seconds. (optional; default value is `60`)
283+
- `run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out. (optional; default value is `3`)
285284

286285
```python
287286
from azureml.contrib.pipeline.steps import ParallelRunConfig
@@ -301,9 +300,9 @@ parallel_run_config = ParallelRunConfig(
301300

302301
Create the pipeline step by using the script, environment configuration, and parameters. Specify the compute target that you already attached to your workspace as the target of execution for the script. Use `ParallelRunStep` to create the batch inference pipeline step, which takes all the following parameters:
303302
- `name`: The name of the step, with the following naming restrictions: unique, 3-32 characters, and regex ^\[a-z\]([-a-z0-9]*[a-z0-9])?$.
304-
- `models`: Zero or more model names already registered in the Azure Machine Learning model registry.
305303
- `parallel_run_config`: A `ParallelRunConfig` object, as defined earlier.
306304
- `inputs`: One or more single-typed Azure Machine Learning datasets.
305+
- `side_inputs`: One or more reference data used as side inputs.
307306
- `output`: A `PipelineData` object that corresponds to the output directory.
308307
- `arguments`: A list of arguments passed to the user script (optional).
309308
- `allow_reuse`: Whether the step should reuse previous results when run with the same settings/inputs. If this parameter is `False`, a new run will always be generated for this step during pipeline execution. (optional; the default value is `True`.)
@@ -313,7 +312,6 @@ from azureml.contrib.pipeline.steps import ParallelRunStep
313312

314313
parallelrun_step = ParallelRunStep(
315314
name="batch-mnist",
316-
models=[model],
317315
parallel_run_config=parallel_run_config,
318316
inputs=[named_mnist_ds],
319317
output=output_dir,
@@ -323,7 +321,7 @@ parallelrun_step = ParallelRunStep(
323321
```
324322

325323
>[!Note]
326-
> The above step depends on `azureml-contrib-pipeline-steps`, as described in [Prerequisites](#prerequisites).
324+
> `models`, `tags` and `properties` are removed from `ParallelRunStep`. You can directly load the models in your python script.
327325
328326
### Submit the pipeline
329327

0 commit comments

Comments
 (0)