You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-batch-endpoints.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ ms.subservice: inferencing
8
8
ms.topic: troubleshooting-general
9
9
author: msakande
10
10
ms.author: mopeakande
11
-
ms.date: 07/19/2024
11
+
ms.date: 07/29/2024
12
12
ms.reviewer: cacrest
13
13
ms.custom: devplatv2
14
14
@@ -27,15 +27,15 @@ After you invoke a batch endpoint by using the Azure CLI or the REST API, the ba
27
27
28
28
-**Option 1**: Stream job logs to a local console. Only logs in the _azureml-logs_ folder are streamed.
29
29
30
-
Run the following command to stream system-generated logs to your console. Replace the `\<job_name>` parameter with the name of your batch scoring job:
30
+
Run the following command to stream system-generated logs to your console. Replace the `<job_name>` parameter with the name of your batch scoring job:
31
31
32
32
```azurecli
33
33
az ml job stream --name <job_name>
34
34
```
35
35
36
36
-**Option 2**: View job logs in Azure Machine Learning studio.
37
37
38
-
Run the following command to get the job link to use in the studio. Replace the `\<job_name>` parameter with the name of your batch scoring job:
38
+
Run the following command to get the job link to use in the studio. Replace the `<job_name>` parameter with the name of your batch scoring job:
39
39
40
40
```azurecli
41
41
az ml job show --name <job_name> --query services.Studio.endpoint -o tsv
@@ -49,7 +49,7 @@ After you invoke a batch endpoint by using the Azure CLI or the REST API, the ba
49
49
50
50
## Review log files
51
51
52
-
Machine Learning provides several types of log files and other data files that you can use to help troubleshoot your batch scoring job.
52
+
Azure Machine Learning provides several types of log files and other data files that you can use to help troubleshoot your batch scoring job.
53
53
54
54
The two top-level folders for batch scoring logs are _azureml-logs_ and _logs_. Information from the controller that launches the scoring script is stored in the _~/azureml-logs/70\_driver\_log.txt_ file.
55
55
@@ -59,8 +59,8 @@ The distributed nature of batch scoring jobs results in logs from different sour
59
59
60
60
| File | Description |
61
61
| --- | --- |
62
-
|**~/logs/job_progress_overview.txt**| Provides high-level information about the current number of created mini-batches (also known as _tasks_) created and the current number of processed mini-batches. As processing for mini-batches comes to an end, the log records the results of the job. If the job fails, the log shows the error message and where to start the troubleshooting. |
63
-
|**~/logs/sys/master_role.txt**|Supplies the principal node (also known as the _orchestrator_) view of the running job. This log includes information about the task creation, progress monitoring, and the job result. |
62
+
|**~/logs/job_progress_overview.txt**| Provides high-level information about the current number of mini-batches (also known as _tasks_) created and the current number of processed mini-batches. As processing for mini-batches comes to an end, the log records the results of the job. If the job fails, the log shows the error message and where to start the troubleshooting. |
63
+
|**~/logs/sys/master_role.txt**|Provides the principal node (also known as the _orchestrator_) view of the running job. This log includes information about the task creation, progress monitoring, and the job result. |
64
64
65
65
### Examine stack trace data for errors
66
66
@@ -69,13 +69,13 @@ Other files provide information about possible errors in your script:
69
69
| File | Description |
70
70
| --- | --- |
71
71
|**~/logs/user/error.txt**| Provides a summary of errors in your script. |
72
-
|**~/logs/user/error/\***|Supplies the full stack traces of exceptions thrown while loading and running the entry script. |
72
+
|**~/logs/user/error/\***|Provides the full stack traces of exceptions thrown while loading and running the entry script. |
73
73
74
74
### Examine process logs per node
75
75
76
76
For a complete understanding of how each node executes your score script, examine the individual process logs for each node. The process logs are stored in the _~/logs/sys/node_ folder and grouped by worker nodes.
77
77
78
-
The folder contains an _\<ip\_address>/_ subfolder and a _\<process\_name>.txt_ file with detailed info about each mini-batch. The folder contents updates when a worker selects or completes the mini-batch. For each mini-batch, the log file includes:
78
+
The folder contains an _\<ip\_address>/_ subfolder that contains a _\<process\_name>.txt_ file with detailed info about each mini-batch. The folder contents updates when a worker selects or completes the mini-batch. For each mini-batch, the log file includes:
79
79
80
80
- The IP address and the process ID (PID) of the worker process.
81
81
- The total number of items, the number of successfully processed items, and the number of failed items.
@@ -94,7 +94,7 @@ The folder contains an _\<ip\_address>/_ subfolder about each mini-batch. The fo
94
94
95
95
| File or Folder | Description |
96
96
| --- | --- |
97
-
|**os/**| Stores information about all running processes in the node. One check runs an operating system command and saves the result to a file. On Linux, the command is `ps`. The folder contains the following items: <br> - **%Y%m%d%H**: Contains one or more process check files. The subfolder name is the creation date and time of the check (Year, Month, Day, Hour). <br> **processes_%M**: Shows details about the process check. The file name ends with the check time (Minute) relative to the check creation time. |
97
+
|**os/**| Stores information about all running processes in the node. One check runs an operating system command and saves the result to a file. On Linux, the command is `ps`. The folder contains the following items: <br> - **%Y%m%d%H**: Subfolder that contains one or more process check files. The subfolder name is the creation date and time of the check (Year, Month, Day, Hour). <br> **processes_%M**: File within the subfolder. The file shows details about the process check. The file name ends with the check time (Minute) relative to the check creation time. |
98
98
|**node_disk_usage.csv**| Shows the detailed disk usage of the node. |
99
99
|**node_resource_usage.csv**| Supplies the resource usage overview of the node. |
100
100
|**processes_resource_usage.csv**| Provides a resource usage overview of each process. |
The following sections describe common errors that can occur during batch endpoint development and consumption, and steps for resolution.
128
128
129
-
### No azureml module in installation
129
+
### No module named azureml
130
130
131
131
Azure Machine Learning batch deployment requires the **azureml-core** package in the installation.
132
132
133
-
**Message logged**: "No module named azureml."
133
+
**Message logged**: "No module named `azureml`."
134
134
135
-
**Reason**: The azureml-core package appears to be missing in the installation.
135
+
**Reason**: The `azureml-core` package appears to be missing in the installation.
136
136
137
-
**Solution**: Add the azureml-core package to your conda dependencies file.
137
+
**Solution**: Add the `azureml-core` package to your conda dependencies file.
138
138
139
139
### No output in predictions file
140
140
@@ -170,7 +170,7 @@ For batch deployment to succeed, the managed identity for the compute cluster mu
170
170
171
171
**Solution**: Ensure the managed identity associated with the compute cluster where your deployment is running has at least [Storage Blob Data Reader](../role-based-access-control/built-in-roles.md#storage-blob-data-reader) access to the storage account. Only Azure Storage account owners can [change the access level in the Azure portal](../storage/blobs/assign-azure-role-data-access.md).
172
172
173
-
### No mounted storage, no dataset initialization
173
+
### Dataset initialization failed, can't mount dataset
174
174
175
175
The batch deployment process requires mounted storage for the data asset. When the storage doesn't mount, the dataset can't be initialized.
176
176
@@ -180,11 +180,11 @@ The batch deployment process requires mounted storage for the data asset. When t
180
180
181
181
**Solution**: Ensure the managed identity associated with the compute cluster where your deployment is running has at least [Storage Blob Data Reader](../role-based-access-control/built-in-roles.md#storage-blob-data-reader) access to the storage account. Only Azure Storage account owners can [change the access level in the Azure portal](../storage/blobs/assign-azure-role-data-access.md).
182
182
183
-
### No value for dataset_param parameter
183
+
### The dataset_param parameter doesn't have a specified value or a default value
184
184
185
185
During batch deployment, the data set node references the `dataset_param` parameter. For the deployment to proceed, the parameter must have an assigned value or a specified default value.
186
186
187
-
**Message logged**: "Data set node [code] references parameter dataset_param, which doesn't have a specified value or a default value."
187
+
**Message logged**: "Data set node [code] references parameter `dataset_param`, which doesn't have a specified value or a default value."
188
188
189
189
**Reason**: The input data asset provided to the batch endpoint isn't supported.
190
190
@@ -272,7 +272,7 @@ For batch deployment to succeed, the batch endpoint must have at least one valid
272
272
273
273
- Define the route with a deployment-specific header.
274
274
275
-
## Review unsupported configurations and file types
275
+
## Limitations and unsupported scenarios
276
276
277
277
When you design machine learning deployment solutions that rely on batch endpoints, keep in mind that some configurations and scenarios aren't supported. The following sections identify unsupported workspaces and compute resources, and invalid types for input files.
0 commit comments