Merge pull request #233845 from jiaochenlu/update-230407

PMEds28 · web-flow · commit 78d25e84e5c5 · 2023-04-10T09:43:53.000+01:00
update TSG of Kubernetes compute
diff --git a/articles/machine-learning/how-to-deploy-kubernetes-extension.md b/articles/machine-learning/how-to-deploy-kubernetes-extension.md
@@ -59,7 +59,7 @@ You can use Azure Machine Learning CLI command `k8s-extension create` to deploy
    | `allowInsecureConnections` |`True` or `False`, default `False`. **Can** be set to `True` to use inference HTTP endpoints for development or test purposes. |N/A| Optional |  Optional |
    | `inferenceRouterServiceType` |`loadBalancer`, `nodePort` or `clusterIP`.  **Required** if `enableInference=True`. | N/A| **&check;** |   **&check;** |
    | `internalLoadBalancerProvider` | This config is only applicable for Azure Kubernetes Service(AKS) cluster now. Set to `azure` to allow the inference router using internal load balancer.  | N/A| Optional |  Optional |
-   |`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md#configure-sslsecret). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional |  Optional |
+   |`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional |  Optional |
    |`sslCname` |An TLS/SSL CNAME is used by inference HTTPS endpoint. **Required** if `allowInsecureConnections=False`  |  N/A | Optional | Optional|
    | `inferenceRouterHA` |`True` or `False`, default `True`. By default, Azure Machine Learning extension will deploy three inference router replicas for high availability, which requires at least three worker nodes in a cluster. Set to `False` if your cluster has fewer than three worker nodes, in this case only one inference router service is deployed. | N/A| Optional |  Optional |
    |`nodeSelector` | By default, the deployed kubernetes resources and your machine learning workloads are randomly deployed to one or more nodes of the cluster, and DaemonSet resources are deployed to ALL nodes. If you want to restrict the extension deployment and your training/inference workloads to specific nodes with label `key1=value1` and `key2=value2`, use `nodeSelector.key1=value1`, `nodeSelector.key2=value2` correspondingly. | Optional| Optional |  Optional |
diff --git a/articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md b/articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md
@@ -14,24 +14,11 @@ ms.custom: build-spring-2022, cliv2, sdkv2, event-tier1-build-2022
 
 # Troubleshoot Kubernetes Compute
 
-In this article, you'll learn how to troubleshoot common problems you may encounter with using [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md) for training jobs and model deployments.
+In this article, you'll learn how to troubleshoot common workload (including training jobs and endpoints) errors on the [Kubernetes compute](./how-to-attach-kubernetes-to-workspace.md). 
 
 ## Inference guide
 
-### How to check sslCertPemFile and sslKeyPemFile is correct?
-Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
-
-```bash
-openssl x509 -in cert.pem -noout -text
-openssl rsa -in key.pem -noout -check
-```
-
-Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
-
-```bash
-openssl x509 -in cert.pem -noout -modulus | md5sum
-openssl rsa -in key.pem -noout -modulus | md5sum
-```
+The common Kubernetes endpoint errors on Kubernetes compute are categorized into two scopes: **compute scope** and **cluster scope**. The compute scope errors are related to the compute target, such as the compute target is not found, or the compute target is not accessible. The cluster scope errors are related to the underlying Kubernetes cluster, such as the cluster itself is not reachable, or the cluster is not found.
 
 ### Kubernetes compute errors
 
@@ -179,10 +166,27 @@ You can check the following items to troubleshoot the issue:
 > [!TIP]
    > More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
 
+### How to check sslCertPemFile and sslKeyPemFile is correct?
+Use the commands below to run a baseline check for your cert and key. This is to allow for any known errors to be surfaced. Expect the second command to return "RSA key ok" without prompting you for password.
+
+```bash
+openssl x509 -in cert.pem -noout -text
+openssl rsa -in key.pem -noout -check
+```
+
+Run the commands below to verify whether sslCertPemFile and sslKeyPemFile are matched:
+
+```bash
+openssl x509 -in cert.pem -noout -modulus | md5sum
+openssl rsa -in key.pem -noout -modulus | md5sum
+```
+
 
 ## Training guide
 
-### Job retry
+When the training job is running, you can check the job status in the workspace portal. When you encounter some abnormal job status, such as the job retried multiple times, or the job has been stuck in initializing state, or even the job has eventually failed, you can follow the guide below to troubleshoot the issue.
+
+### Job retry debugging
 
 If the training job pod running in the cluster was terminated due to the node running to node OOM (out of memory), the job will be **automatically retried** to another available node.
 
@@ -205,37 +209,45 @@ The host name of the node which the job pod is running on will be indicated in t
 
 "ask-agentpool-17631869-vmss0000" represents the **node host name** running this job in your AKS cluster. Then you can access the cluster to check about the node status for further investigation.
 
-### UserError
 
-#### Azure Machine Learning Kubernetes job failed. E45004
+### Job pod get stuck in Init state
 
-If the error message is:
+If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`,  the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data. 
 
-```bash
-Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
-```
+To resolve this issue, change to mount mode for your input data.
 
-Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
 
-#### Unable to mount data store workspaceblobstore. Give either an account key or SAS token
+### Common job failure errors
 
-If you need to access Azure Container Registry (ACR) for Docker image, and Storage Account for training data, this issue should occur when the compute is not specified with a managed identity. This is because machine learning workspace default storage account without any credentials is not supported for training jobs. 
+Below is a list of common error types that you might encounter when using Kubernetes compute to create and execute a training job, which you can trouble shoot by following the guideline:
 
-To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
+* [Job failed. 137](#job-failed-137)
+* [Job failed. E45004](#job-failed-e45004)
+* [Job failed. 400](#job-failed-400)
+* [Give either an account key or SAS token](#give-either-an-account-key-or-sas-token)
+* [AzureBlob authorization failed](#azureblob-authorization-failed)
 
-#### Unable to upload project files to working directory in AzureBlob because the authorization failed
+#### Job failed. 137
 
 If the error message is:
 
 ```bash
-Unable to upload project files to working directory in AzureBlob because the authorization failed. 
+Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n   WARNING: This is a development server. Do not use it in a production deployment.\n   Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
 ```
 
-You can check the following items to troubleshoot the issue:
-*  Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list. 
-*  Make sure the workspace has a system assigned managed identity.
+Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
 
-### Encountered an error when attempting to connect to the Azure Machine Learning token service
+#### Job failed. E45004
+
+If the error message is:
+
+```bash
+Azure Machine Learning Kubernetes job failed. E45004:"Training feature is not enabled, please enable it when install the extension."
+```
+
+Please check whether you have `enableTraining=True` set when doing the Azure Machine Learning extension installation. More details could be found at [Deploy Azure Machine Learning extension on AKS or Arc Kubernetes cluster](how-to-deploy-kubernetes-extension.md)
+
+### Job failed. 400
 
 If the error message is:
 
@@ -244,23 +256,33 @@ Azure Machine Learning Kubernetes job failed. 400:{"Msg":"Encountered an error w
 ```
 You can follow [Private Link troubleshooting section](#private-link-issue) to check your network settings.
 
-### ServiceError
+#### Give either an account key or SAS token
 
-#### Job pod get stuck in Init state
+If you need to access Azure Container Registry (ACR) for Docker image, and to access the Storage Account for training data, this issue should occur when the compute is not specified with a managed identity.
 
-If the job runs longer than you expected and if you find that your job pods are getting stuck in an Init state with this warning `Unable to attach or mount volumes: *** failed to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin matched`,  the issue might be occurring because Azure Machine Learning extension doesn't support download mode for input data. 
+To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker images, or access a storage account for training data, you need to attach the Kubernetes compute with a system-assigned or user-assigned managed identity enabled. 
 
-To resolve this issue, change to mount mode for your input data.
+In the above training scenario, this **computing identity** is necessary for Kubernetes compute to be used as a credential to communicate between the ARM resource bound to the workspace and the Kubernetes computing cluster. So without this identity, the training job will fail and report missing account key or sas token. Take accessing storage account for example, if you don't specify a managed identity to your Kubernetes compute, the job fails with the following error message:
 
-#### Azure Machine Learning Kubernetes job failed
+```bash
+Unable to mount data store workspaceblobstore. Give either an account key or SAS token
+```
 
-If the error message is:
+This is because machine learning workspace default storage account without any credentials is not accessible for training jobs in Kubernetes compute. 
+
+To mitigate this issue, you can assign Managed Identity to the compute in compute attach step, or you can assign Managed Identity to the compute after it has been attached. More details could be found at [Assign Managed Identity to the compute target](how-to-attach-kubernetes-to-workspace.md#assign-managed-identity-to-the-compute-target).
+
+#### AzureBlob authorization failed
+
+If you need to access the AzureBlob for data upload or download in your training jobs on Kubernetes compute, then the job fails with the following error message:
 
 ```bash
-Azure Machine Learning Kubernetes job failed. 137:PodPattern matched: {"containers":[{"name":"training-identity-sidecar","message":"Updating certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-endpoint-server' (lazy loading)\n * Environment: production\n   WARNING: This is a development server. Do not use it in a production deployment.\n   Use a production WSGI server instead.\n * Debug mode: off\n * Running on http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}
+Unable to upload project files to working directory in AzureBlob because the authorization failed. 
 ```
 
-Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range when using `az connectedk8s connect` by following this [network configuring](how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).
+This is because the authorization failed when the job tries to upload the project files to the AzureBlob. You can check the following items to troubleshoot the issue:
+*  Make sure the storage account has enabled the exceptions of “Allow Azure services on the trusted service list to access this storage account” and the workspace is in the resource instances list. 
+*  Make sure the workspace has a system assigned managed identity.
 
 ## Private link issue
 
diff --git a/articles/machine-learning/how-to-troubleshoot-online-endpoints.md b/articles/machine-learning/how-to-troubleshoot-online-endpoints.md
@@ -510,14 +510,25 @@ Although we do our best to provide a stable and reliable service, sometimes thin
 
 ## Common errors specific to Kubernetes deployments
 
+Errors regarding to identity and authentication:
 * [ACRSecretError](#error-acrsecreterror)
+* [TokenRefreshFailed](#error-tokenrefreshfailed)
+* [GetAADTokenFailed](#error-getaadtokenfailed)
+* [ACRAuthenticationChallengeFailed](#error-acrauthenticationchallengefailed)
+* [ACRTokenExchangeFailed](#error-acrtokenexchangefailed)
+
+Errors regarding to crashloopbackoff:
 * [ImagePullLoopBackOff](#error-imagepullloopbackoff)
 * [DeploymentCrashLoopBackOff](#error-deploymentcrashloopbackoff)
 * [KubernetesCrashLoopBackOff](#error-kubernetescrashloopbackoff)
-* [NamespaceNotFound](#error-namespacenotfound)
+
+Errors regarding to scoring script:
 * [UserScriptInitFailed](#error-userscriptinitfailed)
 * [UserScriptImportError](#error-userscriptimporterror)
 * [UserScriptFunctionNotFound](#error-userscriptfunctionnotfound)
+
+Others:
+* [NamespaceNotFound](#error-namespacenotfound)
 * [EndpointAlreadyExists](#error-endpointalreadyexists)
 * [ScoringFeUnhealthy](#error-scoringfeunhealthy)
 * [ValidateScoringFailed](#error-validatescoringfailed)
@@ -536,6 +547,35 @@ This is a list of reasons you might run into this error when creating/updating t
 * The Kubernetes cluster has improper network configuration, please check the proxy, network policy or certificate.
   * If you are using a private AKS cluster, it is necessary to set up private endpoints for ACR, storage account, workspace in the AKS vnet. 
 
+### ERROR: TokenRefreshFailed
+
+This is because extension cannot get principal credential from Azure because the Kubernetes cluster identity is not set properly, please re-install the [Azure Machine Learning extension](../machine-learning/how-to-deploy-kubernetes-extension.md) and try again. 
+
+
+### ERROR: GetAADTokenFailed
+
+This is because the Kubernetes cluster request AAD token failed or timeout, please check your network accessibility then try again. 
+
+* You can follow the [Configure required network traffic](../machine-learning/how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute ) to check the outbound proxy, make sure the cluster can connect to workspace. 
+* The workspace endpoint url can be found in online endpoint CRD in cluster. 
+
+If your workspace is a private workspace which disabled public network access, the Kubernetes cluster should only communicate with that private workspace through the private link. 
+
+* You can check if the workspace access allows public access, no matter if an AKS cluster itself is public or private, it cannot access the private workspace. 
+* More information you can refer to [Secure Azure Kubernetes Service inferencing environment](../machine-learning/how-to-secure-kubernetes-inferencing-environment.md#what-is-a-secure-aks-inferencing-environment)
+
+### ERROR: ACRAuthenticationChallengeFailed
+
+This is because the Kubernetes cluster cannot reach ACR service of the workspace to do authentication challenge. Please check your network, especially the ACR public network access, then try again. 
+
+You can follow the troubleshooting steps in [GetAADTokenFailed](#error-getaadtokenfailed) to check the network.
+
+### ERROR: ACRTokenExchangeFailed
+
+This is because the Kubernetes cluster exchange ACR token failed because AAD token is unauthorized yet, since the role assignment takes some time, so you can wait a moment then try again.
+
+This failure may also be due to too many requests to the ACR service at that time, it should be a transient error, you can try again later.
+
 ### ERROR: ImagePullLoopBackOff
 
 The reason you might run into this error when creating/updating Kubernetes online deployments is because you can't download the images from the container registry, resulting in the images pull failure. 
diff --git a/articles/machine-learning/toc.yml b/articles/machine-learning/toc.yml
@@ -493,7 +493,7 @@
         - name: Secure inferencing environment
           displayName: AKS, Arc Kubernetes, HTTPS, private IP, no-public IP, private link, private endpoint, inference
           href: how-to-secure-kubernetes-inferencing-environment.md
-        - name: Secure Kubernetes online endpoint
+        - name: Configure a secure online endpoint with TLS/SSL
           displayName: AKS, Arc Kubernetes, HTTPS, TSL, SSL, Cname, DNS, Certificate, inference
           href: how-to-secure-kubernetes-online-endpoint.md
         - name: Troubleshoot Azure Machine Learning extension
@@ -1174,6 +1174,10 @@
       href: how-to-troubleshoot-online-endpoints.md
     - name: Troubleshoot batch endpoints
       href: how-to-troubleshoot-batch-endpoints.md
+    - name: Troubleshoot Kubernetes Compute
+      href: how-to-troubleshoot-kubernetes-compute.md
+    - name: Troubleshoot Azure Machine Learning extension
+      href: how-to-troubleshoot-kubernetes-extension.md
     # v1
     - name: Pipeline issues
       items: