Skip to content

Commit 78d25e8

Browse files
authored
Merge pull request #233845 from jiaochenlu/update-230407
update TSG of Kubernetes compute
2 parents 2828770 + 5bce82e commit 78d25e8

File tree

4 files changed

+109
-43
lines changed

4 files changed

+109
-43
lines changed

articles/machine-learning/how-to-deploy-kubernetes-extension.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ You can use Azure Machine Learning CLI command `k8s-extension create` to deploy
5959
| `allowInsecureConnections` |`True` or `False`, default `False`. **Can** be set to `True` to use inference HTTP endpoints for development or test purposes. |N/A| Optional | Optional |
6060
| `inferenceRouterServiceType` |`loadBalancer`, `nodePort` or `clusterIP`. **Required** if `enableInference=True`. | N/A| **✓** | **✓** |
6161
| `internalLoadBalancerProvider` | This config is only applicable for Azure Kubernetes Service(AKS) cluster now. Set to `azure` to allow the inference router using internal load balancer. | N/A| Optional | Optional |
62-
|`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md#configure-sslsecret). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional | Optional |
62+
|`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional | Optional |
6363
|`sslCname` |An TLS/SSL CNAME is used by inference HTTPS endpoint. **Required** if `allowInsecureConnections=False` | N/A | Optional | Optional|
6464
| `inferenceRouterHA` |`True` or `False`, default `True`. By default, Azure Machine Learning extension will deploy three inference router replicas for high availability, which requires at least three worker nodes in a cluster. Set to `False` if your cluster has fewer than three worker nodes, in this case only one inference router service is deployed. | N/A| Optional | Optional |
6565
|`nodeSelector` | By default, the deployed kubernetes resources and your machine learning workloads are randomly deployed to one or more nodes of the cluster, and DaemonSet resources are deployed to ALL nodes. If you want to restrict the extension deployment and your training/inference workloads to specific nodes with label `key1=value1` and `key2=value2`, use `nodeSelector.key1=value1`, `nodeSelector.key2=value2` correspondingly. | Optional| Optional | Optional |

articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md

Lines changed: 62 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -14,24 +14,11 @@ ms.custom: build-spring-2022, cliv2, sdkv2, event-tier1-build-2022
1414

1515
# Troubleshoot Kubernetes Compute
1616

17-
In this article, you'll learn how to troubleshoot common problems you may encounter with using [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) for training jobs and model deployments.
17+
In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md).
1818

1919
## Inference guide
2020

21-
### How to check sslCertPemFile and sslKeyPemFile is correct?
22-
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
23-
24-
```bash
25-
openssl x509 -in cert.pem -noout -text
26-
openssl rsa -in key.pem -noout -check
27-
```
28-
29-
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
30-
31-
```bash
32-
openssl x509 -in cert.pem -noout -modulus | md5sum
33-
openssl rsa -in key.pem -noout -modulus | md5sum
34-
```
21+
The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
3522

3623
### Kubernetes compute errors
3724

@@ -179,10 +166,27 @@ You can check the following items to troubleshoot the issue:
179166
> [!TIP]
180167
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
181168
169+
### How to check sslCertPemFile and sslKeyPemFile is correct?
170+
Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
171+
172+
```bash
173+
openssl x509 -in cert.pem -noout -text
174+
openssl rsa -in key.pem -noout -check
175+
```
176+
177+
Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
178+
179+
```bash
180+
openssl x509 -in cert.pem -noout -modulus | md5sum
181+
openssl rsa -in key.pem -noout -modulus | md5sum
182+
```
183+
182184

183185
## Training guide
184186

185-
### Job retry
187+
When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
188+
189+
### Job retry debugging
186190

187191
If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be **automatically retried** to another available node.
188192

@@ -205,37 +209,45 @@ The host name of the node which the job pod is running on will be indicated in t
205209
206210
"ask-agentpool-17631869-vmss0000" represents the **node host name** running this job in your AKS cluster. Then you can access the cluster to check about the node status for further investigation.
207211
208-
### UserError
209212
210-
#### Azure Machine Learning Kubernetes job failed. E45004
213+
### Job pod get stuck in Init state
211214
212-
If the error message is:
215+
If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`, the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data.
213216

214-
```bash
215-
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
216-
```
217+
To resolve this issue, change to mount mode for your input data.
217218

218-
Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
219219

220-
#### Unable to mount data store workspaceblobstore. Give either an account key or SAS token
220+
### Common job failure errors
221221

222-
If you need to access Azure Container Registry (ACR) for Docker image, and Storage Account for training data, this issue should occur when the compute is not specified with a managed identity. This is because machine learning workspace default storage account without any credentials is not supported for training jobs.
222+
Below is a list of common error types that you might encounter when using Kubernetes compute to create and execute a training job, which you can trouble shoot by following the guideline:
223223

224-
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
224+
* [Job failed. 137](#job-failed-137)
225+
* [Job failed. E45004](#job-failed-e45004)
226+
* [Job failed. 400](#job-failed-400)
227+
* [Give either an account key or SAS token](#give-either-an-account-key-or-sas-token)
228+
* [AzureBlob authorization failed](#azureblob-authorization-failed)
225229

226-
#### Unable to upload project files to working directory in AzureBlob because the authorization failed
230+
#### Job failed. 137
227231

228232
If the error message is:
229233

230234
```bash
231-
Unable to upload project files to working directory in AzureBlob because the authorization failed.
235+
Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n WARNING: This is a development server. Do not use it in a production deployment.\n Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
232236
```
233237

234-
You can check the following items to troubleshoot the issue:
235-
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
236-
* Make sure the workspace has a system assigned managed identity.
238+
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
237239

238-
### Encountered an error when attempting to connect to the Azure Machine Learning token service
240+
#### Job failed. E45004
241+
242+
If the error message is:
243+
244+
```bash
245+
Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
246+
```
247+
248+
Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
249+
250+
### Job failed. 400
239251

240252
If the error message is:
241253

@@ -244,23 +256,33 @@ Azure Machine Learning Kubernetes job failed. 400:{"Msg":"Encountered an error w
244256
```
245257
You can follow [Private Link troubleshooting section](#private-link-issue) to check your network settings.
246258

247-
### ServiceError
259+
#### Give either an account key or SAS token
248260

249-
#### Job pod get stuck in Init state
261+
If you need to access Azure Container Registry (ACR) for Docker image, and to access the Storage Account for training data, this issue should occur when the compute is not specified with a managed identity.
250262

251-
If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`, the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data.
263+
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled.
252264

253-
To resolve this issue, change to mount mode for your input data.
265+
In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
254266
255-
#### Azure Machine Learning Kubernetes job failed
267+
```bash
268+
Unable to mount data store workspaceblobstore. Give either an account key or SAS token
269+
```
256270
257-
If the error message is:
271+
This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute.
272+
273+
To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
274+
275+
#### AzureBlob authorization failed
276+
277+
If you need to access the AzureBlob for data upload or download in your training jobs on Kubernetes compute, then the job fails with the following error message:
258278
259279
```bash
260-
Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n WARNING: This is a development server. Do not use it in a production deployment.\n Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
280+
Unable to upload project files to working directory in AzureBlob because the authorization failed.
261281
```
262282
263-
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
283+
This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
284+
* Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list.
285+
* Make sure the workspace has a system assigned managed identity.
264286
265287
## Private link issue
266288

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -510,14 +510,25 @@ Although we do our best to provide a stable and reliable service, sometimes thin
510510

511511
## Common errors specific to Kubernetes deployments
512512

513+
Errors regarding to identity and authentication:
513514
* [ACRSecretError](#error-acrsecreterror)
515+
* [TokenRefreshFailed](#error-tokenrefreshfailed)
516+
* [GetAADTokenFailed](#error-getaadtokenfailed)
517+
* [ACRAuthenticationChallengeFailed](#error-acrauthenticationchallengefailed)
518+
* [ACRTokenExchangeFailed](#error-acrtokenexchangefailed)
519+
520+
Errors regarding to crashloopbackoff:
514521
* [ImagePullLoopBackOff](#error-imagepullloopbackoff)
515522
* [DeploymentCrashLoopBackOff](#error-deploymentcrashloopbackoff)
516523
* [KubernetesCrashLoopBackOff](#error-kubernetescrashloopbackoff)
517-
* [NamespaceNotFound](#error-namespacenotfound)
524+
525+
Errors regarding to scoring script:
518526
* [UserScriptInitFailed](#error-userscriptinitfailed)
519527
* [UserScriptImportError](#error-userscriptimporterror)
520528
* [UserScriptFunctionNotFound](#error-userscriptfunctionnotfound)
529+
530+
Others:
531+
* [NamespaceNotFound](#error-namespacenotfound)
521532
* [EndpointAlreadyExists](#error-endpointalreadyexists)
522533
* [ScoringFeUnhealthy](#error-scoringfeunhealthy)
523534
* [ValidateScoringFailed](#error-validatescoringfailed)
@@ -536,6 +547,35 @@ This is a list of reasons you might run into this error when creating/updating t
536547
* The Kubernetes cluster has improper network configuration, please check the proxy, network policy or certificate.
537548
* If you are using a private AKS cluster, it is necessary to set up private endpoints for ACR, storage account, workspace in the AKS vnet.
538549

550+
### ERROR: TokenRefreshFailed
551+
552+
This is because extension cannot get principal credential from Azure because the Kubernetes cluster identity is not set properly, please re-install the [Azure Machine Learning extension](../machine-learning/how-to-deploy-kubernetes-extension.md) and try again.
553+
554+
555+
### ERROR: GetAADTokenFailed
556+
557+
This is because the Kubernetes cluster request AAD token failed or timeout, please check your network accessibility then try again.
558+
559+
* You can follow the [Configure required network traffic](../machine-learning/how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute ) to check the outbound proxy, make sure the cluster can connect to workspace.
560+
* The workspace endpoint url can be found in online endpoint CRD in cluster.
561+
562+
If your workspace is a private workspace which disabled public network access, the Kubernetes cluster should only communicate with that private workspace through the private link.
563+
564+
* You can check if the workspace access allows public access, no matter if an AKS cluster itself is public or private, it cannot access the private workspace.
565+
* More information you can refer to [Secure Azure Kubernetes Service inferencing environment](../machine-learning/how-to-secure-kubernetes-inferencing-environment.md#what-is-a-secure-aks-inferencing-environment)
566+
567+
### ERROR: ACRAuthenticationChallengeFailed
568+
569+
This is because the Kubernetes cluster cannot reach ACR service of the workspace to do authentication challenge. Please check your network, especially the ACR public network access, then try again.
570+
571+
You can follow the troubleshooting steps in [GetAADTokenFailed](#error-getaadtokenfailed) to check the network.
572+
573+
### ERROR: ACRTokenExchangeFailed
574+
575+
This is because the Kubernetes cluster exchange ACR token failed because AAD token is unauthorized yet, since the role assignment takes some time, so you can wait a moment then try again.
576+
577+
This failure may also be due to too many requests to the ACR service at that time, it should be a transient error, you can try again later.
578+
539579
### ERROR: ImagePullLoopBackOff
540580

541581
The reason you might run into this error when creating/updating Kubernetes online deployments is because you can't download the images from the container registry, resulting in the images pull failure.

articles/machine-learning/toc.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -493,7 +493,7 @@
493493
- name: Secure inferencing environment
494494
displayName: AKS, Arc Kubernetes, HTTPS, private IP, no-public IP, private link, private endpoint, inference
495495
href: how-to-secure-kubernetes-inferencing-environment.md
496-
- name: Secure Kubernetes online endpoint
496+
- name: Configure a secure online endpoint with TLS/SSL
497497
displayName: AKS, Arc Kubernetes, HTTPS, TSL, SSL, Cname, DNS, Certificate, inference
498498
href: how-to-secure-kubernetes-online-endpoint.md
499499
- name: Troubleshoot Azure Machine Learning extension
@@ -1174,6 +1174,10 @@
11741174
href: how-to-troubleshoot-online-endpoints.md
11751175
- name: Troubleshoot batch endpoints
11761176
href: how-to-troubleshoot-batch-endpoints.md
1177+
- name: Troubleshoot Kubernetes Compute
1178+
href: how-to-troubleshoot-kubernetes-compute.md
1179+
- name: Troubleshoot Azure Machine Learning extension
1180+
href: how-to-troubleshoot-kubernetes-extension.md
11771181
# v1
11781182
- name: Pipeline issues
11791183
items:

0 commit comments

Comments
 (0)