Skip to content

Commit 7ebe683

Browse files
authored
Merge pull request #215976 from santiagxf/santiagxf/aml-batch-general
AzureML Batch Inference landing page
2 parents fd517c4 + ca11435 commit 7ebe683

20 files changed

+1105
-904
lines changed

articles/machine-learning/.openpublishing.redirection.machine-learning.json

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3499,6 +3499,26 @@
34993499
"source_path_from_root": "/articles/machine-learning/how-to-use-private-python-packages.md",
35003500
"redirect_url": "/azure/machine-learning/v1/how-to-use-private-python-packages",
35013501
"redirect_document_id": true
3502+
},
3503+
{
3504+
"source_path_from_root": "/articles/machine-learning/how-to-use-batch-endpoint.md",
3505+
"redirect_url": "/azure/machine-learning/batch-inference/how-to-use-batch-endpoint",
3506+
"redirect_document_id": true
3507+
},
3508+
{
3509+
"source_path_from_root": "/articles/machine-learning/how-to-use-batch-endpoint-sdk-v2.md",
3510+
"redirect_url": "/azure/machine-learning/batch-inference/how-to-use-batch-endpoint",
3511+
"redirect_document_id": false
3512+
},
3513+
{
3514+
"source_path_from_root": "/articles/machine-learning/how-to-use-batch-endpoints-studio.md",
3515+
"redirect_url": "/azure/machine-learning/batch-inference/how-to-use-batch-endpoint",
3516+
"redirect_document_id": false
3517+
},
3518+
{
3519+
"source_path_from_root": "/articles/machine-learning/how-to-troubleshoot-batch-endpoints.md",
3520+
"redirect_url": "/azure/machine-learning/batch-inference/how-to-troubleshoot-batch-endpoints",
3521+
"redirect_document_id": true
35023522
}
35033523
]
3504-
}
3524+
}

articles/machine-learning/batch-inference/how-to-deploy-model-custom-output.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ In any of those cases, Batch Deployments allow you to take control of the output
3232
[!INCLUDE [basic cli prereqs](../../../includes/machine-learning-cli-prereqs.md)]
3333

3434
* A model registered in the workspace. In this tutorial, we'll use an MLflow model. Particularly, we are using the *heart condition classifier* created in the tutorial [Using MLflow models in batch deployments](how-to-mlflow-batch.md).
35-
* You must have an endpoint already created. If you don't, follow the instructions at [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md). This example assumes the endpoint is named `heart-classifier-batch`.
36-
* You must have a compute created where to deploy the deployment. If you don't, follow the instructions at [Create compute](../how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
35+
* You must have an endpoint already created. If you don't, follow the instructions at [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md). This example assumes the endpoint is named `heart-classifier-batch`.
36+
* You must have a compute created where to deploy the deployment. If you don't, follow the instructions at [Create compute](how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
3737

3838
## About this sample
3939

@@ -145,7 +145,7 @@ Follow the next steps to create a deployment using the previous scoring script:
145145
2. MLflow models don't require you to indicate an environment or a scoring script when creating the deployments as it is created for you. However, in this case we are going to indicate a scoring script and environment since we want to customize how inference is executed.
146146

147147
> [!NOTE]
148-
> This example assumes you have an endpoint created with the name `heart-classifier-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md).
148+
> This example assumes you have an endpoint created with the name `heart-classifier-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md).
149149
150150
# [Azure ML CLI](#tab/cli)
151151

articles/machine-learning/batch-inference/how-to-image-processing-batch.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Image processing tasks with batch deployments"
2+
title: "Image processing with batch deployments"
33
titleSuffix: Azure Machine Learning
44
description: Learn how to deploy a model in batch endpoints that process images
55
services: machine-learning
@@ -13,7 +13,7 @@ ms.reviewer: larryfr
1313
ms.custom: devplatv2
1414
---
1515

16-
# Image processing tasks with batch deployments
16+
# Image processing with batch deployments
1717

1818
[!INCLUDE [ml v2](../../../includes/machine-learning-dev-v2.md)]
1919

@@ -23,8 +23,8 @@ Batch Endpoints can be used for processing tabular data, but also any other file
2323

2424
[!INCLUDE [basic cli prereqs](../../../includes/machine-learning-cli-prereqs.md)]
2525

26-
* You must have an endpoint already created. If you don't please follow the instructions at [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md). This example assumes the endpoint is named `imagenet-classifier-batch`.
27-
* You must have a compute created where to deploy the deployment. If you don't please follow the instructions at [Create compute](../how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
26+
* You must have an endpoint already created. If you don't please follow the instructions at [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md). This example assumes the endpoint is named `imagenet-classifier-batch`.
27+
* You must have a compute created where to deploy the deployment. If you don't please follow the instructions at [Create compute](how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
2828

2929
## About the model used in the sample
3030

@@ -169,7 +169,7 @@ One the scoring script is created, it's time to create a batch deployment for it
169169
1. Now, let create the deployment.
170170

171171
> [!NOTE]
172-
> This example assumes you have an endpoint created with the name `imagenet-classifier-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md).
172+
> This example assumes you have an endpoint created with the name `imagenet-classifier-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md).
173173
174174
# [Azure ML CLI](#tab/cli)
175175

articles/machine-learning/batch-inference/how-to-nlp-processing-batch.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "NLP tasks with batch deployments"
2+
title: "Text processing with batch deployments"
33
titleSuffix: Azure Machine Learning
44
description: Learn how to use batch deployments to process text and output results.
55
services: machine-learning
@@ -13,7 +13,7 @@ ms.reviewer: larryfr
1313
ms.custom: devplatv2
1414
---
1515

16-
# NLP tasks with batch deployments
16+
# Text processing with batch deployments
1717

1818
[!INCLUDE [cli v2](../../../includes/machine-learning-dev-v2.md)]
1919

@@ -23,8 +23,8 @@ Batch Endpoints can be used for processing tabular data, but also any other file
2323

2424
[!INCLUDE [basic cli prereqs](../../../includes/machine-learning-cli-prereqs.md)]
2525

26-
* You must have an endpoint already created. If you don't please follow the instructions at [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md). This example assumes the endpoint is named `text-summarization-batch`.
27-
* You must have a compute created where to deploy the deployment. If you don't please follow the instructions at [Create compute](../how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
26+
* You must have an endpoint already created. If you don't please follow the instructions at [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md). This example assumes the endpoint is named `text-summarization-batch`.
27+
* You must have a compute created where to deploy the deployment. If you don't please follow the instructions at [Create compute](how-to-use-batch-endpoint.md#create-compute). This example assumes the name of the compute is `cpu-cluster`.
2828

2929
## About the model used in the sample
3030

@@ -146,7 +146,7 @@ One the scoring script is created, it's time to create a batch deployment for it
146146
2. Now, let create the deployment.
147147

148148
> [!NOTE]
149-
> This example assumes you have an endpoint created with the name `text-summarization-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](../how-to-use-batch-endpoint.md).
149+
> This example assumes you have an endpoint created with the name `text-summarization-batch` and a compute cluster with name `cpu-cluster`. If you don't, please follow the steps in the doc [Use batch endpoints for batch scoring](how-to-use-batch-endpoint.md).
150150
151151
# [Azure ML CLI](#tab/cli)
152152

articles/machine-learning/batch-inference/how-to-secure-batch-endpoint.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ When deploying a machine learning model to a batch endpoint, you can secure thei
2727
All the batch endpoints created inside of secure workspace are deployed as private batch endpoints by default. Not further configuration is required.
2828

2929
> [!IMPORTANT]
30-
> When working on a private link-enabled workspaces, batch endpoints can be created and managed using Azure Machine Learning studio. However, they can't be invoked from the UI in studio. Please use the Azure ML CLI v2 instead for job creation. For more details about how to use it see [Invoke the batch endpoint to start a batch scoring job](../how-to-use-batch-endpoint.md#invoke-the-batch-endpoint-to-start-a-batch-scoring-job).
30+
> When working on a private link-enabled workspaces, batch endpoints can be created and managed using Azure Machine Learning studio. However, they can't be invoked from the UI in studio. Please use the Azure ML CLI v2 instead for job creation. For more details about how to use it see [Invoke the batch endpoint to start a batch scoring job](how-to-use-batch-endpoint.md#invoke-the-batch-endpoint-to-start-a-batch-scoring-job).
3131
3232
The following diagram shows how the networking looks like for batch endpoints when deployed in a private workspace:
3333

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: "Troubleshooting batch endpoints"
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to troubleshoot and diagnostic errors with batch endpoints jobs
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: how-to
9+
author: santiagxf
10+
ms.author: fasantia
11+
ms.date: 10/10/2022
12+
ms.reviewer: larryfr
13+
ms.custom: devplatv2
14+
---
15+
16+
# Troubleshooting batch endpoints
17+
18+
[!INCLUDE [dev v2](../../../includes/machine-learning-dev-v2.md)]
19+
20+
Learn how to troubleshoot and solve, or work around, common errors you may come across when using [batch endpoints](how-to-use-batch-endpoint.md) for batch scoring.
21+
22+
## Understanding logs of a batch scoring job
23+
24+
### Get logs
25+
26+
After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job will run asynchronously. There are two options to get the logs for a batch scoring job.
27+
28+
Option 1: Stream logs to local console
29+
30+
You can run the following command to stream system-generated logs to your console. Only logs in the `azureml-logs` folder will be streamed.
31+
32+
```azurecli
33+
az ml job stream -name <job_name>
34+
```
35+
36+
Option 2: View logs in studio
37+
38+
To get the link to the run in studio, run:
39+
40+
```azurecli
41+
az ml job show --name <job_name> --query interaction_endpoints.Studio.endpoint -o tsv
42+
```
43+
44+
1. Open the job in studio using the value returned by the above command.
45+
1. Choose __batchscoring__
46+
1. Open the __Outputs + logs__ tab
47+
1. Choose the log(s) you wish to review
48+
49+
### Understand log structure
50+
51+
There are two top-level log folders, `azureml-logs` and `logs`.
52+
53+
The file `~/azureml-logs/70_driver_log.txt` contains information from the controller that launches the scoring script.
54+
55+
Because of the distributed nature of batch scoring jobs, there are logs from several different sources. However, two combined files are created that provide high-level information:
56+
57+
- `~/logs/job_progress_overview.txt`: This file provides high-level information about the number of mini-batches (also known as tasks) created so far and the number of mini-batches processed so far. As the mini-batches end, the log records the results of the job. If the job failed, it will show the error message and where to start the troubleshooting.
58+
59+
- `~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. This log provides information on task creation, progress monitoring, the job result.
60+
61+
For a concise understanding of errors in your script there is:
62+
63+
- `~/logs/user/error.txt`: This file will try to summarize the errors in your script.
64+
65+
For more information on errors in your script, there is:
66+
67+
- `~/logs/user/error/`: This file contains full stack traces of exceptions thrown while loading and running the entry script.
68+
69+
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
70+
71+
- `~/logs/sys/node/<ip_address>/<process_name>.txt`: This file provides detailed info about each mini-batch as it's picked up or completed by a worker. For each mini-batch, this file includes:
72+
73+
- The IP address and the PID of the worker process.
74+
- The total number of items, the number of successfully processed items, and the number of failed items.
75+
- The start time, duration, process time, and run method time.
76+
77+
You can also view the results of periodic checks of the resource usage for each node. The log files and setup files are in this folder:
78+
79+
- `~/logs/perf`: Set `--resource_monitor_interval` to change the checking interval in seconds. The default interval is `600`, which is approximately 10 minutes. To stop the monitoring, set the value to `0`. Each `<ip_address>` folder includes:
80+
81+
- `os/`: Information about all running processes in the node. One check runs an operating system command and saves the result to a file. On Linux, the command is `ps`.
82+
- `%Y%m%d%H`: The sub folder name is the time to hour.
83+
- `processes_%M`: The file ends with the minute of the checking time.
84+
- `node_disk_usage.csv`: Detailed disk usage of the node.
85+
- `node_resource_usage.csv`: Resource usage overview of the node.
86+
- `processes_resource_usage.csv`: Resource usage overview of each process.
87+
88+
### How to log in scoring script
89+
90+
You can use Python logging in your scoring script. Logs are stored in `logs/user/stdout/<node_id>/processNNN.stdout.txt`.
91+
92+
```python
93+
import argparse
94+
import logging
95+
96+
# Get logging_level
97+
arg_parser = argparse.ArgumentParser(description="Argument parser.")
98+
arg_parser.add_argument("--logging_level", type=str, help="logging level")
99+
args, unknown_args = arg_parser.parse_known_args()
100+
print(args.logging_level)
101+
102+
# Initialize Python logger
103+
logger = logging.getLogger(__name__)
104+
logger.setLevel(args.logging_level.upper())
105+
logger.info("Info log statement")
106+
logger.debug("Debug log statement")
107+
```
108+
109+
## Common issues
110+
111+
The following section contains common problems and solutions you may see during batch endpoint development and consumption.
112+
113+
### No module named 'azureml'
114+
115+
__Reason__: Azure Machine Learning Batch Deployments require the package `azureml-core` to be installed.
116+
117+
__Solution__: Add `azureml-core` to your conda dependencies file.
118+
119+
### Output already exists
120+
121+
__Reason__: Azure Machine Learning Batch Deployment can't overwrite the `predictions.csv` file generated by the output.
122+
123+
__Solution__: If you are indicated an output location for the predictions, ensure the path leads to a non-existing file.
124+
125+
### The run() function in the entry script had timeout for [number] times
126+
127+
__Message logged__: `No progress update in [number] seconds. No progress update in this check. Wait [number] seconds since last update.`
128+
129+
__Reason__: Batch Deployments can be configured with a `timeout` value that indicates the amount of time the deployment shall wait for a single batch to be processed. If the execution of the batch takes more than such value, the task is aborted. Tasks that are aborted can be retried up to a maximum of times that can also be configured. If the `timeout` occurs on each retry, then the deployment job fails. These properties can be configured for each deployment.
130+
131+
__Solution__: Increase the `timemout` value of the deployment by updating the deployment. These properties are configured in the parameter `retry_settings`. By default, a `timeout=30` and `retries=3` is configured. When deciding the value of the `timeout`, take into consideration the number of files being processed on each batch and the size of each of those files. You can also decrease them to account for more mini-batches of smaller size and hence quicker to execute.
132+
133+
### Dataset initialization failed
134+
135+
__Message logged__: Dataset initialization failed: UserErrorException: Message: Cannot mount Dataset(id='xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx', name='None', version=None). Source of the dataset is either not accessible or does not contain any data.
136+
137+
__Reason__: The compute cluster where the deployment is running can't mount the storage where the data asset is located. The managed identity of the compute don't have permissions to perform the mount.
138+
139+
__Solutions__: Ensure the identity associated with the compute cluster where your deployment is running has at least has at least [Storage Blob Data Reader](../../role-based-access-control/built-in-roles.md#storage-blob-data-reader) access to the storage account. Only storage account owners can [change your access level via the Azure portal](../../storage/blobs/assign-azure-role-data-access.md).
140+
141+
### Data set node [code] references parameter dataset_param which doesn't have a specified value or a default value
142+
143+
__Message logged__: Data set node [code] references parameter dataset_param which doesn't have a specified value or a default value.
144+
145+
__Reason__: The input data asset provided to the batch endpoint isn't supported.
146+
147+
__Solution__: Ensure you are providing a data input that is supported for batch endpoints.
148+
149+
### User program failed with Exception: Run failed, please check logs for details
150+
151+
__Message logged__: User program failed with Exception: Run failed, please check logs for details. You can check logs/readme.txt for the layout of logs.
152+
153+
__Reason__: There was an error while running the `init()` or `run()` function of the scoring script.
154+
155+
__Solution__: Go to __Outputs + Logs__ and open the file at `logs > user > error > 10.0.0.X > process000.txt`. You will see the error message generated by the `init()` or `run()` method.
156+
157+
### There is no succeeded mini batch item returned from run()
158+
159+
__Message logged__: There is no succeeded mini batch item returned from run(). Please check 'response: run()' in https://aka.ms/batch-inference-documentation.
160+
161+
__Reason__: The batch endpoint failed to provide data in the expected format to the `run()` method. This may be due to corrupted files being read or incompatibility of the input data with the signature of the model (MLflow).
162+
163+
__Solution__: To understand what may be happening, go to __Outputs + Logs__ and open the file at `logs > user > stdout > 10.0.0.X > process000.stdout.txt`. Look for error entries like `Error processing input file`. You should find there details about why the input file can't be correctly read.

0 commit comments

Comments
 (0)