You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
17
+
In this article, you learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
18
18
19
19
## Inference guide
20
20
21
21
The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
22
22
23
23
### Kubernetes compute errors
24
24
25
-
Below is a list of error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
25
+
The common error types in **compute scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guidelines:
@@ -33,7 +33,7 @@ Below is a list of error types in **compute scope** that you might encounter whe
33
33
34
34
35
35
#### ERROR: GenericComputeError
36
-
The error message is as below:
36
+
The error message is as:
37
37
38
38
```bash
39
39
Failed to get compute information.
@@ -44,7 +44,7 @@ This error should occur when system failed to get the compute information from t
44
44
* Check the Kubernetes cluster health.
45
45
* You can view the cluster health check report for any issues, for example, if the cluster is not reachable.
46
46
* You can go to your workspace portal to check the compute status.
47
-
* Check if the instance types is information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
47
+
* Check if the instance types are information is correct. You can check the supported instance types in the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) documentation.
48
48
* Try to detach and reattach the compute to the workspace if applicable.
49
49
50
50
> [!NOTE]
@@ -83,7 +83,7 @@ The error message is as follows:
83
83
```bash
84
84
The compute information is invalid.
85
85
```
86
-
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
86
+
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid. For example, the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
87
87
88
88
You can check the following items to troubleshoot the issue:
89
89
* Check whether the compute target you used is correct and existing in your workspace.
@@ -125,7 +125,7 @@ For AKS clusters:
125
125
* Check if the AKS cluster is shut down.
126
126
* If the cluster isn't running, you need to start the cluster first.
127
127
* Check if the AKS cluster has enabled selected network by using authorized IP ranges.
128
-
* If the AKS cluster has enabled authorized IP ranges, please make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
128
+
* If the AKS cluster has enabled authorized IP ranges, make sure all the **Azure Machine Learning control plane IP ranges** have been enabled for the AKS cluster. More information you can see this [document](how-to-deploy-kubernetes-extension.md#limitations).
129
129
130
130
131
131
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
@@ -170,7 +170,7 @@ You can check the following items to troubleshoot the issue:
170
170
171
171
#### ERROR: RefreshExtensionIdentityNotSet
172
172
173
-
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to re-install the extension to fix it.
173
+
This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to reinstall the extension to fix it.
174
174
175
175
> Please notice this error is only for managed clusters
176
176
@@ -179,46 +179,46 @@ This error occurs when the extension is installed but the extension identity is
179
179
180
180
181
181
### How to check sslCertPemFile and sslKeyPemFile is correct?
182
-
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
182
+
In order to allow for any known errors to be surfaced, you can use the commands to run a baseline check for your cert and key. Expect the second command to return "RSA key ok" without prompting you for password.
183
183
184
184
```bash
185
185
openssl x509 -in cert.pem -noout -text
186
186
openssl rsa -in key.pem -noout -check
187
187
```
188
188
189
-
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
189
+
Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:
For sslCertPemFile, it is the public certificate. It should include the certificate chain which includes below certificates and should be in the sequence of the server certificate, the intermediate CA certificate and the root CA certificate:
197
-
* The server certificate: This is the certificate that the server presents to the client during the TLS handshake. It contains the server’s public key, domain name, and other information. The server certificate is signed by an intermediate certificate authority (CA) that vouches for the server’s identity.
198
-
* The intermediate CA certificate: This is the certificate that the intermediate CA presents to the client to prove its authority to sign the server certificate. It contains the intermediate CA’s public key, name, and other information. The intermediate CA certificate is signed by a root CA that vouches for the intermediate CA’s identity.
199
-
* The root CA certificate: This is the certificate that the root CA presents to the client to prove its authority to sign the intermediate CA certificate. It contains the root CA’s public key, name, and other information. The root CA certificate is self-signed and trusted by the client.
196
+
For sslCertPemFile, it is the public certificate. It should include the certificate chain which includes the following certificates and should be in the sequence of the server certificate, the intermediate CA certificate and the root CA certificate:
197
+
* The server certificate: the server presents to the client during the TLS handshake. It contains the server’s public key, domain name, and other information. The server certificate is signed by an intermediate certificate authority (CA) that vouches for the server’s identity.
198
+
* The intermediate CA certificate: the intermediate CA presents to the client to prove its authority to sign the server certificate. It contains the intermediate CA’s public key, name, and other information. The intermediate CA certificate is signed by a root CA that vouches for the intermediate CA’s identity.
199
+
* The root CA certificate: the root CA presents to the client to prove its authority to sign the intermediate CA certificate. It contains the root CA’s public key, name, and other information. The root CA certificate is self-signed and trusted by the client.
200
200
201
201
202
202
## Training guide
203
203
204
-
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
204
+
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide to troubleshoot the issue.
205
205
206
206
### Job retry debugging
207
207
208
-
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be**automatically retried** to another available node.
208
+
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job is**automatically retried** to another available node.
209
209
210
210
To further debug the root cause of the job try, you can go to the workspace portal to check the job retry log.
211
211
212
-
* Each retry log will be recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
212
+
* Each retry log is recorded in a new log folder with the format of "retry-<retry number\>"(such as: retry-001).
213
213
214
-
Then you can get the retry job-node mapping information as mentioned above, to figure out which node the retry-job has been running on.
214
+
Then you can get the retry job-node mapping information, to figure out which node the retry-job has been running on.
215
215
216
216
:::image type="content" source="media/how-to-troubleshoot-kubernetes-compute/job-retry-log.png" alt-text="Screenshot of adding a new extension to the Azure Arc-enabled Kubernetes cluster from the Azure portal.":::
217
217
218
218
You can get job-node mapping information from the
219
219
**amlarc_cr_bootstrap.log** under system_logs folder.
220
220
221
-
The host name of the node which the job pod is running on will be indicated in this log, for example:
221
+
The host name of the node, which the job pod is running on is indicated in this log, for example:
222
222
223
223
```bash
224
224
++ echo'Run on node: ask-agentpool-17631869-vmss0000"
@@ -262,7 +262,7 @@ If the error message is:
262
262
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
263
263
```
264
264
265
-
Please check whether you have `enableTraining=True`set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
265
+
Check whether you have `enableTraining=True`set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
266
266
267
267
### Job failed. 400
268
268
@@ -279,13 +279,13 @@ If you need to access Azure Container Registry (ACR) for Docker image, and to ac
279
279
280
280
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled.
281
281
282
-
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
282
+
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job fails and reports missing account key or sas token. Take accessing storage account,for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
283
283
284
284
```bash
285
285
Unable to mount data store workspaceblobstore. Give either an account key or SAS token
286
286
```
287
287
288
-
This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
288
+
The cause is machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
289
289
290
290
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
291
291
@@ -297,13 +297,13 @@ If you need to access the AzureBlob for data upload or download in your training
297
297
Unable to upload project files to working directory in AzureBlob because the authorization failed.
298
298
```
299
299
300
-
This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
300
+
The cause is the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
301
301
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
302
302
* Make sure the workspace has a system assigned managed identity.
303
303
304
304
## Private link issue
305
305
306
-
We could use the method below to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
306
+
We could use the method to check private link setup by logging into one pod in the Kubernetes cluster and then check related network settings.
307
307
308
308
* Find workspace ID in Azure portal or get this ID by running `az ml workspace show` in the command line.
309
309
* Show all azureml-fe pods run by `kubectl get po -n azureml -l azuremlappname=azureml-fe`.
@@ -316,13 +316,13 @@ If you set up private link from VNet to workspace correctly, then the internal I
316
316
curl https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/post -X POST -x {proxy_address} -d {} -v -k
317
317
```
318
318
319
-
If the proxy and workspace with private link is configured correctly, you can see it's trying to connect to an internal IP. This will return a response with http 401, which is expected when you don't provide token.
319
+
When the proxy and workspace are correctly set up with a private link, you should observe an attempt to connect to an internal IP. A response with an HTTP 401 status code is expected in this scenario if a token is not provided.
320
320
321
321
## Other known issues
322
322
323
323
### Kubernetes compute update does not take effect
324
324
325
-
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace will not take effect.
325
+
At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an existing Kubernetes compute. For example, changing the namespace does not take effect.
0 commit comments