You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
17
+
In this article, you learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
18
18
19
19
## Inference guide
20
20
21
21
The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
22
22
23
23
### Kubernetes compute errors
24
24
25
-
Below is a list of error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
25
+
The common error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
@@ -33,7 +33,7 @@ Below is a list of error types in **compute scope** that you might encounter whe
33
33
34
34
35
35
#### ERROR: GenericComputeError
36
-
The error message is as below:
36
+
The error message is as:
37
37
38
38
```bash
39
39
Failed to get compute information.
@@ -44,7 +44,7 @@ This error should occur when system failed to get the compute information from t
44
44
* Check the Kubernetes cluster health.
45
45
* You can view the cluster health check report for any issues, for example, if the cluster is not reachable.
46
46
* You can go to your workspace portal to check the compute status.
47
-
* Check if the instance types is information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
47
+
* Check if the instance types are information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
48
48
* Try to detach and reattach the compute to the workspace if applicable.
49
49
50
50
> [!NOTE]
@@ -83,7 +83,7 @@ The error message is as follows:
83
83
```bash
84
84
The compute information is invalid.
85
85
```
86
-
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
86
+
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid. For example, the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
87
87
88
88
You can check the following items to troubleshoot the issue:
89
89
* Check whether the compute target you used is correct and existing in your workspace.
@@ -125,7 +125,7 @@ For AKS clusters:
125
125
* Check if the AKS cluster is shut down.
126
126
* If the cluster isn't running, you need to start the cluster first.
127
127
* Check if the AKS cluster has enabled selected network by using authorized IP ranges.
128
-
* If the AKS cluster has enabled authorized IP ranges, please make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
128
+
* If the AKS cluster has enabled authorized IP ranges, make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
129
129
130
130
131
131
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
@@ -170,7 +170,7 @@ You can check the following items to troubleshoot the issue:
170
170
171
171
#### ERROR: RefreshExtensionIdentityNotSet
172
172
173
-
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to re-install the extension to fix it.
173
+
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to reinstall the extension to fix it.
174
174
175
175
> Please notice this error is only for managed clusters
176
176
@@ -179,41 +179,46 @@ This error occurs when the extension is installed but the extension identity is
179
179
180
180
181
181
### How to check sslCertPemFile and sslKeyPemFile is correct?
182
-
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
182
+
In order to allow for any known errors to be surfaced, you can use the commands to run a baseline check for your cert and key. Expect the second command to return "RSA key ok" without prompting you for password.
183
183
184
184
```bash
185
185
openssl x509 -in cert.pem -noout -text
186
186
openssl rsa -in key.pem -noout -check
187
187
```
188
188
189
-
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
189
+
Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:
For sslCertPemFile, it is the public certificate. It should include the certificate chain which includes the following certificates and should be in the sequence of the server certificate, the intermediate CA certificate and the root CA certificate:
197
+
* The server certificate: the server presents to the client during the TLS handshake. It contains the server’s public key, domain name, and other information. The server certificate is signed by an intermediate certificate authority (CA) that vouches for the server’s identity.
198
+
* The intermediate CA certificate: the intermediate CA presents to the client to prove its authority to sign the server certificate. It contains the intermediate CA’s public key, name, and other information. The intermediate CA certificate is signed by a root CA that vouches for the intermediate CA’s identity.
199
+
* The root CA certificate: the root CA presents to the client to prove its authority to sign the intermediate CA certificate. It contains the root CA’s public key, name, and other information. The root CA certificate is self-signed and trusted by the client.
200
+
196
201
197
202
## Training guide
198
203
199
-
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
204
+
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide to troubleshoot the issue.
200
205
201
206
### Job retry debugging
202
207
203
-
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be**automatically retried** to another available node.
208
+
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job is**automatically retried** to another available node.
204
209
205
210
To further debug the root cause of the job try, you can go to the workspace portal to check the job retry log.
206
211
207
-
* Each retry log will be recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
212
+
* Each retry log is recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
208
213
209
-
Then you can get the retry job-node mapping information as mentioned above, to figure out which node the retry-job has been running on.
214
+
Then you can get the retry job-node mapping information, to figure out which node the retry-job has been running on.
210
215
211
216
:::image type="content" source="media/how-to-troubleshoot-kubernetes-compute/job-retry-log.png" alt-text="Screenshot of adding a new extension to the Azure Arc-enabled Kubernetes cluster from the Azure portal.":::
212
217
213
218
You can get job-node mapping information from the
214
219
**amlarc_cr_bootstrap.log** under system_logs folder.
215
220
216
-
The host name of the node which the job pod is running on will be indicated in this log, for example:
221
+
The host name of the node, which the job pod is running on is indicated in this log, for example:
217
222
218
223
```bash
219
224
++ echo'Run on node: ask-agentpool-17631869-vmss0000"
@@ -257,7 +262,7 @@ If the error message is:
257
262
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
258
263
```
259
264
260
-
Please check whether you have `enableTraining=True`set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
265
+
Check whether you have `enableTraining=True`set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
261
266
262
267
### Job failed. 400
263
268
@@ -274,13 +279,13 @@ If you need to access Azure Container Registry (ACR) for Docker image, and to ac
274
279
275
280
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled.
276
281
277
-
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
282
+
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job fails and reports missing account key or sas token. Take accessing storage account,for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
278
283
279
284
```bash
280
285
Unable to mount data store workspaceblobstore. Give either an account key or SAS token
281
286
```
282
287
283
-
This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
288
+
The cause is machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
284
289
285
290
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
286
291
@@ -292,13 +297,13 @@ If you need to access the AzureBlob for data upload or download in your training
292
297
Unable to upload project files to working directory in AzureBlob because the authorization failed.
293
298
```
294
299
295
-
This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
300
+
The cause is the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
296
301
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
297
302
* Make sure the workspace has a system assigned managed identity.
298
303
299
304
## Private link issue
300
305
301
-
We could use the method below to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
306
+
We could use the method to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
302
307
303
308
* Find workspace ID in Azure portal or get this ID by running `az ml workspace show` in the command line.
304
309
* Show all azureml-fe pods run by `kubectl get po -n azureml -l azuremlappname=azureml-fe`.
@@ -311,13 +316,13 @@ If you set up private link from VNet to workspace correctly, then the internal I
311
316
curl https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/post -X POST -x {proxy_address} -d {} -v -k
312
317
```
313
318
314
-
If the proxy and workspace with private link is configured correctly, you can see it's trying to connect to an internal IP. This will return a response with http 401, which is expected when you don't provide token.
319
+
When the proxy and workspace are correctly set up with a private link, you should observe an attempt to connect to an internal IP. A response with an HTTP 401 status code is expected in this scenario if a token is not provided.
315
320
316
321
## Other known issues
317
322
318
323
### Kubernetes compute update does not take effect
319
324
320
-
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace will not take effect.
325
+
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace does not take effect.
0 commit comments