You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-deploy-kubernetes-extension.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ You can use Azure Machine Learning CLI command `k8s-extension create` to deploy
59
59
|`allowInsecureConnections`|`True` or `False`, default `False`. **Can** be set to `True` to use inference HTTP endpoints for development or test purposes. |N/A| Optional | Optional |
60
60
|`inferenceRouterServiceType`|`loadBalancer`, `nodePort` or `clusterIP`. **Required** if `enableInference=True`. | N/A|**✓**|**✓**|
61
61
|`internalLoadBalancerProvider`| This config is only applicable for Azure Kubernetes Service(AKS) cluster now. Set to `azure` to allow the inference router using internal load balancer. | N/A| Optional | Optional |
62
-
|`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md#configure-sslsecret). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional | Optional |
62
+
|`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional | Optional |
63
63
|`sslCname`|An TLS/SSL CNAME is used by inference HTTPS endpoint. **Required** if `allowInsecureConnections=False`| N/A | Optional | Optional|
64
64
|`inferenceRouterHA`|`True` or `False`, default `True`. By default, Azure Machine Learning extension will deploy three inference router replicas for high availability, which requires at least three worker nodes in a cluster. Set to `False` if your cluster has fewer than three worker nodes, in this case only one inference router service is deployed. | N/A| Optional | Optional |
65
65
|`nodeSelector`| By default, the deployed kubernetes resources and your machine learning workloads are randomly deployed to one or more nodes of the cluster, and DaemonSet resources are deployed to ALL nodes. If you want to restrict the extension deployment and your training/inference workloads to specific nodes with label `key1=value1` and `key2=value2`, use `nodeSelector.key1=value1`, `nodeSelector.key2=value2` correspondingly. | Optional| Optional | Optional |
In this article, you'll learn how to troubleshoot common problems you may encounter with using [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) for training jobs and model deployments.
17
+
In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
18
18
19
19
## Inference guide
20
20
21
-
### How to check sslCertPemFile and sslKeyPemFile is correct?
22
-
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
23
-
24
-
```bash
25
-
openssl x509 -in cert.pem -noout -text
26
-
openssl rsa -in key.pem -noout -check
27
-
```
28
-
29
-
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
35
22
36
23
### Kubernetes compute errors
37
24
@@ -179,10 +166,27 @@ You can check the following items to troubleshoot the issue:
179
166
> [!TIP]
180
167
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
181
168
169
+
### How to check sslCertPemFile and sslKeyPemFile is correct?
170
+
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
171
+
172
+
```bash
173
+
openssl x509 -in cert.pem -noout -text
174
+
openssl rsa -in key.pem -noout -check
175
+
```
176
+
177
+
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
188
+
189
+
### Job retry debugging
186
190
187
191
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be **automatically retried** to another available node.
188
192
@@ -205,37 +209,45 @@ The host name of the node which the job pod is running on will be indicated in t
205
209
206
210
"ask-agentpool-17631869-vmss0000" represents the **node host name** running this job in your AKS cluster. Then you can access the cluster to check about the node status for further investigation.
If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`, the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data.
213
216
214
-
```bash
215
-
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
216
-
```
217
+
To resolve this issue, change to mount mode for your input data.
217
218
218
-
Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
219
219
220
-
#### Unable to mount data store workspaceblobstore. Give either an account key or SAS token
220
+
### Common job failure errors
221
221
222
-
If you need to access Azure Container Registry (ACR) for Docker image, and Storage Account for training data, this issue should occur when the compute is not specified with a managed identity. This is because machine learning workspace default storage account without any credentials is not supported for training jobs.
222
+
Below is a list of common error types that you might encounter when using Kubernetes compute to create and execute a training job, which you can trouble shoot by following the guideline:
223
223
224
-
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
224
+
* [Job failed. 137](#job-failed-137)
225
+
* [Job failed. E45004](#job-failed-e45004)
226
+
* [Job failed. 400](#job-failed-400)
227
+
* [Give either an account key or SAS token](#give-either-an-account-key-or-sas-token)
#### Unable to upload project files to working directory in AzureBlob because the authorization failed
230
+
#### Job failed. 137
227
231
228
232
If the error message is:
229
233
230
234
```bash
231
-
Unable to upload project files to working directory in AzureBlob because the authorization failed.
235
+
Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n WARNING: This is a development server. Do not use it in a production deployment.\n Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
232
236
```
233
237
234
-
You can check the following items to troubleshoot the issue:
235
-
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
236
-
* Make sure the workspace has a system assigned managed identity.
238
+
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
237
239
238
-
### Encountered an error when attempting to connect to the Azure Machine Learning token service
240
+
#### Job failed. E45004
241
+
242
+
If the error message is:
243
+
244
+
```bash
245
+
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
246
+
```
247
+
248
+
Please check whether you have `enableTraining=True`set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
249
+
250
+
### Job failed. 400
239
251
240
252
If the error message is:
241
253
@@ -244,23 +256,33 @@ Azure Machine Learning Kubernetes job failed. 400:{"Msg":"Encountered an error w
244
256
```
245
257
You can follow [Private Link troubleshooting section](#private-link-issue) to check your network settings.
246
258
247
-
### ServiceError
259
+
#### Give either an account key or SAS token
248
260
249
-
#### Job pod get stuck in Init state
261
+
If you need to access Azure Container Registry (ACR) for Docker image, and to access the Storage Account for training data, this issue should occur when the compute is not specified with a managed identity.
250
262
251
-
If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`, the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data.
263
+
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled.
252
264
253
-
To resolve this issue, change to mount mode for your input data.
265
+
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account forexample, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
254
266
255
-
#### Azure Machine Learning Kubernetes job failed
267
+
```bash
268
+
Unable to mount data store workspaceblobstore. Give either an account key or SAS token
269
+
```
256
270
257
-
If the error message is:
271
+
This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
272
+
273
+
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
274
+
275
+
#### AzureBlob authorization failed
276
+
277
+
If you need to access the AzureBlob for data upload or download in your training jobs on Kubernetes compute, then the job fails with the following error message:
258
278
259
279
```bash
260
-
Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n WARNING: This is a development server. Do not use it in a production deployment.\n Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
280
+
Unable to upload project files to working directory in AzureBlob because the authorization failed.
261
281
```
262
282
263
-
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
283
+
This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
284
+
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
285
+
* Make sure the workspace has a system assigned managed identity.
@@ -536,6 +547,35 @@ This is a list of reasons you might run into this error when creating/updating t
536
547
* The Kubernetes cluster has improper network configuration, please check the proxy, network policy or certificate.
537
548
* If you are using a private AKS cluster, it is necessary to set up private endpoints for ACR, storage account, workspace in the AKS vnet.
538
549
550
+
### ERROR: TokenRefreshFailed
551
+
552
+
This is because extension cannot get principal credential from Azure because the Kubernetes cluster identity is not set properly, please re-install the [Azure Machine Learning extension](../machine-learning/how-to-deploy-kubernetes-extension.md) and try again.
553
+
554
+
555
+
### ERROR: GetAADTokenFailed
556
+
557
+
This is because the Kubernetes cluster request AAD token failed or timeout, please check your network accessibility then try again.
558
+
559
+
* You can follow the [Configure required network traffic](../machine-learning/how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute) to check the outbound proxy, make sure the cluster can connect to workspace.
560
+
* The workspace endpoint url can be found in online endpoint CRD in cluster.
561
+
562
+
If your workspace is a private workspace which disabled public network access, the Kubernetes cluster should only communicate with that private workspace through the private link.
563
+
564
+
* You can check if the workspace access allows public access, no matter if an AKS cluster itself is public or private, it cannot access the private workspace.
565
+
* More information you can refer to [Secure Azure Kubernetes Service inferencing environment](../machine-learning/how-to-secure-kubernetes-inferencing-environment.md#what-is-a-secure-aks-inferencing-environment)
566
+
567
+
### ERROR: ACRAuthenticationChallengeFailed
568
+
569
+
This is because the Kubernetes cluster cannot reach ACR service of the workspace to do authentication challenge. Please check your network, especially the ACR public network access, then try again.
570
+
571
+
You can follow the troubleshooting steps in [GetAADTokenFailed](#error-getaadtokenfailed) to check the network.
572
+
573
+
### ERROR: ACRTokenExchangeFailed
574
+
575
+
This is because the Kubernetes cluster exchange ACR token failed because AAD token is unauthorized yet, since the role assignment takes some time, so you can wait a moment then try again.
576
+
577
+
This failure may also be due to too many requests to the ACR service at that time, it should be a transient error, you can try again later.
578
+
539
579
### ERROR: ImagePullLoopBackOff
540
580
541
581
The reason you might run into this error when creating/updating Kubernetes online deployments is because you can't download the images from the container registry, resulting in the images pull failure.
0 commit comments