You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep capability described here is in public preview. It's a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
20
+
Learn how to process large amounts of data asynchronously and in parallel by using Azure Machine Learning. The ParallelRunStep is a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
21
21
22
-
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of production data resulting in improved productivity and optimized cost.
22
+
With ParallelRunStep, it's straightforward to scale offline inferences to large clusters of machines on terabytes of structured or unstructured data resulting in improved productivity and optimized cost.
23
23
24
24
In this article, you learn the following tasks:
25
25
@@ -34,7 +34,7 @@ In this article, you learn the following tasks:
34
34
35
35
* For a guided quickstart, complete the [setup tutorial](tutorial-1st-experiment-sdk-setup.md) if you don't already have an Azure Machine Learning workspace or notebook virtual machine.
36
36
37
-
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment. Run `pip install azureml-sdk[notebooks] azureml-pipeline-core azureml-contrib-pipeline-steps` in your environment to download the necessary dependencies.
37
+
* To manage your own environment and dependencies, see the [how-to guide](how-to-configure-environment.md) on configuring your own environment.
38
38
39
39
## Set up machine learning resources
40
40
@@ -83,9 +83,6 @@ Now you need to configure data inputs and outputs, including:
83
83
84
84
[`Dataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) is a class for exploring, transforming, and managing data in Azure Machine Learning. This class has two types: [`TabularDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and [`FileDataset`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py). In this example, you'll use `FileDataset` as the inputs to the batch inference pipeline step.
85
85
86
-
> [!NOTE]
87
-
> `FileDataset` support in batch inference is restricted to Azure Blob storage for now.
88
-
89
86
You can also reference other datasets in your custom inference script. For example, you can use it to access labels in your script for labeling images by using `Dataset.register` and `Dataset.get_by_name`.
90
87
91
88
For more information about Azure Machine Learning datasets, see [Create and access datasets (preview)](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets).
-`error_threshold`: The number of record failures for `TabularDataset` and file failures for `FileDataset` that should be ignored during processing. If the error count for the entire input goes above this value, the job will be aborted. The error threshold is for the entire input and not for individual mini-batches sent to the `run()` method. The range is `[-1, int.max]`. The `-1` part indicates ignoring all failures during processing.
275
272
-`output_action`: One of the following values indicates how the output will be organized:
276
273
-`summary_only`: The user script will store the output. `ParallelRunStep` will use the output only for the error threshold calculation.
277
-
-`append_row`: For all input files, only one file will be created in the output folder to append all outputs separated by line. The file name will be `parallel_run_step.txt`.
274
+
-`append_row`: For all input files, only one file will be created in the output folder to append all outputs separated by line. The file name is configurable, default will be `parallel_run_step.txt`.
275
+
-`append_row_file_name`: To customize the output file name for append_row output_action (optional).
278
276
-`source_directory`: Paths to folders that contain all files to execute on the compute target (optional).
279
277
-`compute_target`: Only `AmlCompute` is supported.
280
278
-`node_count`: The number of compute nodes to be used for running the user script.
281
279
-`process_count_per_node`: The number of processes per node.
282
280
-`environment`: The Python environment definition. You can configure it to use an existing Python environment or to set up a temporary environment for the experiment. The definition is also responsible for setting the required application dependencies (optional).
283
281
-`logging_level`: Log verbosity. Values in increasing verbosity are: `WARNING`, `INFO`, and `DEBUG`. (optional; the default value is `INFO`)
284
282
-`run_invocation_timeout`: The `run()` method invocation timeout in seconds. (optional; default value is `60`)
283
+
-`run_max_try`: Max call count for `run()` method against a mini batch in case of failure. A `run()` is failed if there's any system error, an exception, or timed out. (optional; default value is `3`)
285
284
286
285
```python
287
286
from azureml.contrib.pipeline.steps import ParallelRunConfig
Create the pipeline step by using the script, environment configuration, and parameters. Specify the compute target that you already attached to your workspace as the target of execution for the script. Use `ParallelRunStep` to create the batch inference pipeline step, which takes all the following parameters:
303
302
-`name`: The name of the step, with the following naming restrictions: unique, 3-32 characters, and regex ^\[a-z\]([-a-z0-9]*[a-z0-9])?$.
304
-
-`models`: Zero or more model names already registered in the Azure Machine Learning model registry.
305
303
-`parallel_run_config`: A `ParallelRunConfig` object, as defined earlier.
306
304
-`inputs`: One or more single-typed Azure Machine Learning datasets.
305
+
-`side_inputs`: One or more reference data used as side inputs.
307
306
-`output`: A `PipelineData` object that corresponds to the output directory.
308
307
-`arguments`: A list of arguments passed to the user script (optional).
309
308
-`allow_reuse`: Whether the step should reuse previous results when run with the same settings/inputs. If this parameter is `False`, a new run will always be generated for this step during pipeline execution. (optional; the default value is `True`.)
@@ -313,7 +312,6 @@ from azureml.contrib.pipeline.steps import ParallelRunStep
0 commit comments