Skip to content

Commit 9f46193

Browse files
committed
update k8s compute TSG and log
1 parent 4b461ef commit 9f46193

8 files changed

+185
-19
lines changed

articles/machine-learning/how-to-deploy-kubernetes-extension.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,8 @@ In this article, you can learn:
4444
- [Disabling local accounts](../aks/managed-aad.md#disable-local-accounts) for AKS is **not supported** by Azure Machine Learning. When the AKS Cluster is deployed, local accounts are enabled by default.
4545
- If your AKS cluster has an [Authorized IP range enabled to access the API server](../aks/api-server-authorized-ip-ranges.md), enable the AzureML control plane IP ranges for the AKS cluster. The AzureML control plane is deployed across paired regions. Without access to the API server, the machine learning pods can't be deployed. Use the [IP ranges](https://www.microsoft.com/download/confirmation.aspx?id=56519) for both the [paired regions](../availability-zones/cross-region-replication-azure.md) when enabling the IP ranges in an AKS cluster.
4646
- Azure Machine Learning does not guarantee support for all preview stage features in AKS. For example, [Azure AD pod identity](../aks/use-azure-ad-pod-identity.md) is not supported.
47-
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
47+
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
48+
- We currently don't support attaching your AKS cluster across subscription, which means that your AKS cluster must be in the same subscription as your workspace. The workaround to meet your cross-subscription needs is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
4849

4950
## Review AzureML extension configuration settings
5051

articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ Below is a list of error types in **compute scope** that you might encounter whe
4141
* [ERROR: GenericComputeError](#error-genericcomputeerror)
4242
* [ERROR: ComputeNotFound](#error-computenotfound)
4343
* [ERROR: ComputeNotAccessible](#error-computenotaccessible)
44+
* [ERROR: InvalidComputeInformation](#error-invalidcomputeinformation)
45+
* [ERROR: InvalidComputeNoKubernetesConfiguration](#error-invalidcomputenokubernetesconfiguration)
4446

4547

4648
#### ERROR: GenericComputeError
@@ -71,7 +73,7 @@ Cannot find Kubernetes compute.
7173

7274
This error should occur when:
7375
* The system can't find the compute when create/update new online endpoint/deployment.
74-
* The compute of existing online endpoints/deployments have been removed.
76+
* The compute of existing online endpoints/deployments have been removed.
7577

7678
You can check the following items to troubleshoot the issue:
7779
* Try to recreate the endpoint and deployment.
@@ -87,12 +89,40 @@ The Kubernetes compute is not accessible.
8789

8890
This error should occur when the workspace MSI (managed identity) doesn't have access to the AKS cluster. You can check if the workspace MSI has the access to the AKS, and if not, you can follow this [document](how-to-identity-based-service-authentication.md) to manage access and identity.
8991

92+
#### ERROR: InvalidComputeInformation
93+
94+
The error message is as follows:
95+
96+
```bash
97+
The compute information is invalid.
98+
```
99+
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
100+
101+
You can check the following items to troubleshoot the issue:
102+
* Check whether the compute target you used is correct and exsiting in your workspace.
103+
* Try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
104+
105+
#### ERROR: InvalidComputeNoKubernetesConfiguration
106+
107+
The error message is as follows:
108+
109+
```bash
110+
The compute kubeconfig is invalid.
111+
```
112+
113+
This error should occur when the system failed to find any configuration to connect to cluster, such as:
114+
* For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.
115+
* For AKS cluster, there is no AKS configuraiton can be found.
116+
117+
To rebuild the configuration of compute connection in your cluster, you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
118+
90119
### Kubernetes cluster error
91120

92121
Below is a list of error types in **cluster scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guideline:
93122

94123
* [ERROR: GenericClusterError](#error-genericclustererror)
95124
* [ERROR: ClusterNotReachable](#error-clusternotreachable)
125+
* [ERROR: ClusterNotFound](#error-clusternotfound)
96126

97127
#### ERROR: GenericClusterError
98128

@@ -132,6 +162,23 @@ For AKS clusters:
132162
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
133163
* Check if the Kubernetes API server is accessible by running `kubectl` command in cluster.
134164

165+
#### ERROR: ClusterNotFound
166+
167+
The error message is as follows:
168+
169+
```bash
170+
Cannot found Kubernetes cluster.
171+
```
172+
173+
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
174+
175+
You can check the following items to troubleshoot the issue:
176+
* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exist and is running normally.
177+
* If the cluster exists and is running, then you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
178+
179+
> [!TIP]
180+
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](#how-to-troubleshoot-online-endpoints.md).
181+
135182

136183
## Training guide
137184

articles/machine-learning/how-to-troubleshoot-kubernetes-extension.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,40 @@ volcano-scheduler.conf: |
225225
- name: nodeorder
226226
- name: binpack
227227
```
228-
You need to use the same config settings as above, and disable `job/validate` webhook in the volcano admission, so that AzureML training workloads can perform properly.
228+
You need to use the same config settings as above, and you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**, so that AzureML training workloads can perform properly.
229+
230+
#### Volcano scheduler integration supporting cluster autoscaler
231+
As discussed in this [thread](https://github.com/volcano-sh/volcano/issues/2558) , the **gang plugin** is not working well with the cluster autoscaler(CA) and also the node autoscaler in AKS.
232+
233+
In this case, you could use this config of **no gang** volcano scheduler when using cluster autoscaler:
234+
235+
```yaml
236+
volcano-scheduler.conf: |
237+
actions: "enqueue, allocate, backfill"
238+
tiers:
239+
- plugins:
240+
- name: sla
241+
arguments:
242+
sla-waiting-time: 1m
243+
- plugins:
244+
- name: conformance
245+
- plugins:
246+
- name: overcommit
247+
- name: drf
248+
- name: predicates
249+
- name: proportion
250+
- name: nodeorder
251+
- name: binpack
252+
```
253+
254+
And you need to skip the resource validation when install the extension by configuring `--config amloperator.skipResourceValidation=true`.
255+
256+
[!NOTE]
257+
> Since the gang plugin is removed, there's potential that the deadlock happens when volcano schedules the job.
258+
>
259+
> * To avoid this situation, you can **use same instance type across the jobs**.
260+
>
261+
> Note that you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**.
262+
229263
230264

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 71 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -458,27 +458,78 @@ Retrying the operation after waiting several seconds up to a minute may allow it
458458

459459
### ERROR: NamespaceNotFound
460460

461-
The reason you might run into this error when using Kubernetes online endpoint is because the namespace your Kubernetes compute used is unavailable in your cluster.
461+
The reason you might run into this error when creating/updating the Kubernetes online endpoints is because the namespace your Kubernetes compute used is unavailable in your cluster.
462462

463463
You can check the Kubernetes compute in your workspace portal and check the namespace in your Kubernetes cluster. If the namespace is not available, you can detach the legacy compute and re-attach to create a new one, specifying a namespace that already exists in your cluster.
464464

465-
### ERROR: KubernetesCrashLoopBackOff
465+
### ERROR: EndpointAlreadyExists
466466

467-
Below is a list of reasons you might run into this error when using Kubernetes online endpoint:
468-
* There is an error in `score.py` and the container crashed when init your score code, please following [ERROR: ResourceNotReady](#error-resourcenotfound) part.
469-
* Your scoring process needs more memory that your deployment config limit is insufficient, you can try to update the deployment with a larger memory limit.
467+
The reason you might run into this error when creating a Kubernetes online endpoint is because the creating endpoint already exists in your cluster.
468+
469+
The endpoint name should be unique per workspace and per cluster, so in this case, you should create endpoint with another name.
470+
471+
### ERROR: ScoringFeUnhealthy
472+
473+
The reason you might run into this error when creating/updating a Kubernetes online endpoint/deployment is because the [Azureml-fe](how-to-kubernetes-inference-routing-azureml-fe.md) that is the system service running in the cluster is not found or unhealthy.
474+
475+
To trouble shoot this issue, you can re-install or update the Azure Machine Learning extension in your cluster.
470476

471477
### ERROR: ACRSecretError
472478

473-
Below is a list of reasons you might run into this error when using Kubernetes online endpoint:
479+
Below is a list of reasons you might run into this error when creating/updating the Kubernetes online deployments:
474480

475481
* Role assignment has not yet been completed. In this case, please wait for a few seconds and try again later.
476-
* The Azure ARC (For Azure Arc Kubernetes cluster) or AMLArc extension (For AKS) is not properly installed or configured. Please try to check the Azure ARC or AMLArc extension configuration and status.
477-
* The Kubernetes cluster has improper network configuration, please check the proxy, network policy or certificate.
482+
* The Azure ARC (For Azure Arc Kubernetes cluster) or Azure Machine Learning extension (For AKS) is not properly installed or configured. Please try to check the Azure ARC or Azure Machine Learning extension configuration and status.
483+
* The Kubernetes cluster has improper network configuration, please check the proxy, network policy or certificate.
484+
* If you are using a private AKS cluster, it is necessary to setup private endpoints for ACR, storage account, workspace in the AKS vnet.
485+
486+
### ERROR: EndpointNotFound
487+
488+
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in a exist endpoint or create this endpoint first in your cluster.
489+
490+
### ERROR: ValidateScoringFailed
491+
492+
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying.
493+
494+
In this case, you can first check the endpoint URL and then try to re-deploy the deployment.
495+
496+
### ERROR: InvalidDeploymentSpec
497+
498+
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the deployment spec is invalid.
499+
500+
In this case, you can check the error message.
501+
* Make sure the `instance count` is valid.
502+
* If you have enabled auto scaling, make sure the `minimum instance count` and `maximum instance count` are both valid.
503+
504+
### ERROR: ImagePullLoopBackOff
505+
506+
The reason you might run into this error when creating/updating a Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure. <message>
507+
508+
In this case, you can check the cluster network policy and the workspace container registry if cluster can pull image from the container registry.
509+
510+
### ERROR: KubernetesCrashLoopBackOff
511+
512+
Below is a list of reasons you might run into this error when creating/updating the Kubernetes online endpoints/deployments:
513+
* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exist, and check if there are error messgaes in the log.
514+
* There is an error in `score.py` and the container crashed when init your score code, please following [ERROR: ResourceNotReady](#error-resourcenotready) part.
515+
* Your scoring process needs more memory that your deployment config limit is insufficient, you can try to update the deployment with a larger memory limit.
516+
517+
### ERROR: PodUnschedulable
518+
519+
Below is a list of reasons you might run into this error when creating/updating the Kubernetes online endpoints/deployments:
520+
* Unable to schedule pod to nodes, due to insufficient resources in your cluster.
521+
* No node match node affinity/selector.
522+
523+
To mitigate this error, refer to the following steps:
524+
* Check the `node selector` definition of the `instance type` you used, and `node label` configuration of your cluster nodes.
525+
* Check `instance type` and the node SKU size for AKS cluster or the node resource for Arc-Kubernetes cluster.
526+
* If the cluster is under-resourced, you can reduce the instance type resource requirement or use the anohter instance type with smaller resource required.
527+
* If the cluster has no more resource to meet the requirement of the deployment, delete some deployment to release resources.
528+
478529

479530
### ERROR: InferencingClientCallFailed
480531

481-
The reason you might run into this error when using Kubernetes online endpoint is because the k8s-extension of the Kubernetes cluster is not connectable.
532+
The reason you might run into this error when creating/updating Kubernetes online endpoints/deployments is because the k8s-extension of the Kubernetes cluster is not connectable.
482533

483534
In this case, you can detach and then **re-attach** your compute.
484535

@@ -520,6 +571,7 @@ Managed online endpoints have bandwidth limits for each endpoint. You find the l
520571

521572
When you access online endpoints with REST requests, the returned status codes adhere to the standards for [HTTP status codes](https://aka.ms/http-status-codes). Below are details about how endpoint invocation and prediction errors map to HTTP status codes.
522573

574+
#### Common error codes for managed online endpoints
523575
Below are common error codes when consuming managed online endpoints with REST requests:
524576

525577
| Status code| Reason phrase | Why this code might get returned |
@@ -533,11 +585,19 @@ Below are common error codes when consuming managed online endpoints with REST r
533585
| 429 | Rate-limiting | The number of requests per second reached the [limit](./how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints) of managed online endpoints.|
534586
| 500 | Internal server error | AzureML-provisioned infrastructure is failing. |
535587

588+
#### Common error codes for kubernetes online endpoints
589+
536590
Below are common error codes when consuming Kubernetes online endpoints with REST requests:
537591

538592
| Status code| Reason phrase | Why this code might get returned |
539593
| --- | --- | --- |
594+
| 200 | OK | Your model executed successfully, within your latency bound. |
595+
| 401 | Unauthorized | You don't have permission to do the requested action, such as score, or your token is expired. |
596+
| 404 | Not found | The endpoint doesn't have any valid deployment with positive weight. |
597+
| 408 | Request timeout | The model execution took longer than the timeout supplied in `request_timeout_ms` under `request_settings` of your model deployment config.|
540598
| 409 | Conflict error | When an operation is already in progress, any new operation on that same online endpoint will respond with 409 conflict error. For example, If create or update online endpoint operation is in progress and if you trigger a new Delete operation it will throw an error. |
599+
| 424 | Model Error | If your model container returns a non-200 response, Azure returns a 424. Check the `Model Status Code` dimension under the `Requests Per Minute` metric on your endpoint's [Azure Monitor Metric Explorer](../azure-monitor/essentials/metrics-getting-started.md). Or check response headers `ms-azureml-model-error-statuscode` and `ms-azureml-model-error-reason` for more information. |
600+
| 429 | Too many pending requests | Your model is getting more requests than it can handle. We allow maximum 2 * `max_concurrent_requests_per_instance` * `instance_count` requests in parallel at any time. Additional requests are rejected. You can confirm these settings in your model deployment config under `request_settings` and `scale_settings`, respectively. If you're using auto-scaling, your model is getting requests faster than the system can scale up. With auto-scaling, you can try to resend requests with [exponential backoff](https://aka.ms/exponential-backoff). Doing so can give the system time to adjust. Apart from enable auto-scaling, you could also increase the number of instances by using the below [code](#how-to-prevent-503-status-codes). |
541601
| 502 | Has thrown an exception or crashed in the `run()` method of the score.py file | When there's an error in `score.py`, for example an imported package does not exist in the conda environment, a syntax error, or a failure in the `init()` method. You can follow [here](#error-resourcenotready) to debug the file. |
542602
| 503 | Receive large spikes in requests per second | The autoscaler is designed to handle gradual changes in load. If you receive large spikes in requests per second, clients may receive an HTTP status code 503. Even though the autoscaler reacts quickly, it takes AKS a significant amount of time to create more containers. You can follow [here](#how-to-prevent-503-status-codes) to prevent 503 status codes.|
543603
| 504 | Request has timed out | A 504 status code indicates that the request has timed out. The default timeout is 1 minute. You can increase the timeout or try to speed up the endpoint by modifying the score.py to remove unnecessary calls. If these actions don't correct the problem, you can follow [here](#error-resourcenotready) to debug the score.py file. The code may be in a non-responsive state or an infinite loop. |
@@ -580,7 +640,7 @@ There are two things that can help prevent 503 status codes:
580640
```
581641

582642
> [!NOTE]
583-
> If you receive request spikes larger than the new minimum replicas can handle, you may receive 503s again. For example, as traffic to your endpoint increases, you may need to increase the minimum replicas.
643+
> If you receive request spikes larger than the new minimum replicas can handle, you may receive 503 again. For example, as traffic to your endpoint increases, you may need to increase the minimum replicas.
584644

585645
#### How to calculate instance count
586646
To increase the number of instances, you can calculate the required replicas by using the following code:
@@ -618,3 +678,4 @@ We recommend that you use Azure Functions, Azure Application Gateway, or any ser
618678
- [Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
619679
- [Safe rollout for online endpoints](how-to-safely-rollout-online-endpoints.md)
620680
- [Online endpoint YAML reference](reference-yaml-endpoint-online.md)
681+
- [Troubleshoot kubernetes compute ](how-to-troubleshoot-kubernetes-compute.md)

0 commit comments

Comments
 (0)