You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-online-endpoints.md
+1-21Lines changed: 1 addition & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -632,29 +632,13 @@ Below are common error codes when consuming managed online endpoints with REST r
632
632
633
633
Below are common error codes when consuming Kubernetes online endpoints with REST requests:
634
634
635
-
<<<<<<< HEAD
636
-
| Status code| Reason phrase | Why this code might get returned |
637
-
| --- | --- | --- |
638
-
| 200 | OK | Your model executed successfully, within your latency bound. |
639
-
| 401 | Unauthorized | You don't have permission to do the requested action, such as score, or your token is expired. |
640
-
| 404 | Not found | The endpoint doesn't have any valid deployment with positive weight. |
641
-
| 408 | Request timeout | The model execution took longer than the timeout supplied in `request_timeout_ms` under `request_settings` of your model deployment config.|
642
-
| 409 | Conflict error | When an operation is already in progress, any new operation on that same online endpoint will respond with 409 conflict error. For example, If create or update online endpoint operation is in progress and if you trigger a new Delete operation it will throw an error. |
643
-
| 424 | Model Error | If your model container returns a non-200 response, Azure returns a 424. Check the `Model Status Code` dimension under the `Requests Per Minute` metric on your endpoint's [Azure Monitor Metric Explorer](../azure-monitor/essentials/metrics-getting-started.md). Or check response headers `ms-azureml-model-error-statuscode` and `ms-azureml-model-error-reason` for more information. |
644
-
| 429 | Too many pending requests | Your model is getting more requests than it can handle. We allow maximum 2 * `max_concurrent_requests_per_instance` * `instance_count` requests in parallel at any time. Additional requests are rejected. You can confirm these settings in your model deployment config under `request_settings` and `scale_settings`, respectively. If you're using auto-scaling, your model is getting requests faster than the system can scale up. With auto-scaling, you can try to resend requests with [exponential backoff](https://aka.ms/exponential-backoff). Doing so can give the system time to adjust. Apart from enable auto-scaling, you could also increase the number of instances by using the below [code](#how-to-prevent-503-status-codes). |
645
-
| 502 | Has thrown an exception or crashed in the `run()` method of the score.py file | When there's an error in `score.py`, for example an imported package does not exist in the conda environment, a syntax error, or a failure in the `init()` method. You can follow [here](#error-resourcenotready) to debug the file. |
646
-
| 503 | Receive large spikes in requests per second | The autoscaler is designed to handle gradual changes in load. If you receive large spikes in requests per second, clients may receive an HTTP status code 503. Even though the autoscaler reacts quickly, it takes AKS a significant amount of time to create more containers. You can follow [here](#how-to-prevent-503-status-codes) to prevent 503 status codes.|
647
-
| 504 | Request has timed out | A 504 status code indicates that the request has timed out. The default timeout setting is 5s. You can increase the timeout or try to speed up the endpoint by modifying the score.py to remove unnecessary calls. If these actions don't correct the problem, you can follow [here](#error-resourcenotready) to debug the score.py file. The code may be in a non-responsive state or an infinite loop. |
648
-
| 500 | Internal server error | Azure ML-provisioned infrastructure is failing. |
649
-
=======
650
635
| Status code | Reason phrase | Why this code might get returned |
| 409 | Conflict error | When an operation is already in progress, any new operation on that same online endpoint will respond with 409 conflict error. For example, If create or update online endpoint operation is in progress and if you trigger a new Delete operation it will throw an error. |
653
638
| 502 | Has thrown an exception or crashed in the `run()` method of the score.py file | When there's an error in `score.py`, for example an imported package does not exist in the conda environment, a syntax error, or a failure in the `init()` method. You can follow [here](#error-resourcenotready) to debug the file. |
654
639
| 503 | Receive large spikes in requests per second | The autoscaler is designed to handle gradual changes in load. If you receive large spikes in requests per second, clients may receive an HTTP status code 503. Even though the autoscaler reacts quickly, it takes AKS a significant amount of time to create more containers. You can follow [here](#how-to-prevent-503-status-codes) to prevent 503 status codes. |
655
-
| 504 | Request has timed out | A 504 status code indicates that the request has timed out. The default timeout is 1 minute. You can increase the timeout or try to speed up the endpoint by modifying the score.py to remove unnecessary calls. If these actions don't correct the problem, you can follow [here](#error-resourcenotready) to debug the score.py file. The code may be in a non-responsive state or an infinite loop. |
640
+
| 504 | Request has timed out | A 504 status code indicates that the request has timed out. The default timeout setting is 5 seconds. You can increase the timeout or try to speed up the endpoint by modifying the score.py to remove unnecessary calls. If these actions don't correct the problem, you can follow [here](#error-resourcenotready) to debug the score.py file. The code may be in a non-responsive state or an infinite loop. |
656
641
| 500 | Internal server error | Azure ML-provisioned infrastructure is failing. |
657
-
>>>>>>> d6e95b9b9b73a9b3fc7cef21e7d2f4cf6c7974f1
658
642
659
643
660
644
### How to prevent 503 status codes
@@ -731,9 +715,5 @@ We recommend that you use Azure Functions, Azure Application Gateway, or any ser
731
715
732
716
-[Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
733
717
-[Safe rollout for online endpoints](how-to-safely-rollout-online-endpoints.md)
Copy file name to clipboardExpand all lines: articles/machine-learning/reference-kubernetes.md
+7-10Lines changed: 7 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,7 +103,6 @@ For AzureML extension deployment on ARO or OCP cluster, grant privileged access
103
103
> *`{EXTENSION-NAME}`: is the extension name specified with the `az k8s-extension create --name` CLI command.
104
104
>*`{KUBERNETES-COMPUTE-NAMESPACE}`: is the namespace of the Kubernetes compute specified when attaching the compute to the Azure Machine Learning workspace. Skip configuring `system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default` if `KUBERNETES-COMPUTE-NAMESPACE` is `default`.
105
105
106
-
<<<<<<< HEAD
107
106
## Collected log details
108
107
109
108
Some logs about AzureML workloads in the cluster will be collected through extension components, such as status, metrics, life cycle, etc. The following list shows all the log details collected, including the type of logs collected and where they were sent to or stored.
@@ -120,8 +119,6 @@ Some logs about AzureML workloads in the cluster will be collected through exten
120
119
| metrics-controller-manager | Manage the configuration for Prometheus.|Trace logs for status of uploading training job and inference deployment metrics on CPU utilization and memory utilization.|
121
120
| relayserver | relayserver is only needed in arc-connected cluster and will not be installed in AKS cluster.| Relayserver works with Azure Relay to communicate with the cloud services. The logs contain request level info from Azure relay. |
122
121
123
-
=======
124
-
>>>>>>> d6e95b9b9b73a9b3fc7cef21e7d2f4cf6c7974f1
125
122
126
123
## AzureML jobs connect with custom data storage
127
124
@@ -221,7 +218,7 @@ This tutorial helps illustrate how to integrate the [Nginx Ingress Controller](h
221
218
222
219
### Prerequisites
223
220
224
-
-[Deploy the AzureML extension](../machine-learning/how-to-deploy-kubernetes-extension.md) with `inferenceRouterServiceType=ClusterIP` and `allowInsecureConnections=True`, so that the Nginx Ingress Conroller can handle TLS termination by itself instead of handing it over to [azureml-fe](../machine-learning/how-to-kubernetes-inference-routing-azureml-fe.md) when service is exposed over HTTPS.
221
+
-[Deploy the AzureML extension](../machine-learning/how-to-deploy-kubernetes-extension.md) with `inferenceRouterServiceType=ClusterIP` and `allowInsecureConnections=True`, so that the Nginx Ingress Controller can handle TLS termination by itself instead of handing it over to [azureml-fe](../machine-learning/how-to-kubernetes-inference-routing-azureml-fe.md) when service is exposed over HTTPS.
225
222
- For integrating with **Nginx Ingress Controller**, you will need a Kubernetes cluster setup with Nginx Ingress Controller.
226
223
-[**Create a basic controller**](../aks/ingress-basic.md): If you are starting from scratch, refer to these instructions.
227
224
- For integrating with **Azure Application Gateway**, you will need a Kubernetes cluster setup with Azure Application Gateway Ingress Controller.
@@ -398,12 +395,12 @@ More information about how to use ARM template can be found from [ARM template d
398
395
399
396
| Date | Version |Version description |
400
397
|---|---|---|
401
-
| Dec 27, 2022 | 1.1.17 | Move the Fluent-bit from DaemonSet to sidecars. Add MDC support. Refine error messages. Support cluster mode (windows, linux) jobs. Bugfixes|
402
-
| Nov 29, 2022 | 1.1.16 |Add instance type validation by new crd. Support Tolerance. Shorten SVC Name. Workload Core hour. Multiple Bugfixes and improvements. |
403
-
| Sep 13, 2022 | 1.1.10 | Bugfixes.|
404
-
| Aug 29, 2022 | 1.1.9 | Improved health check logic. Bugfixes.|
405
-
| Jun 23, 2022 | 1.1.6 | Bugfixes. |
406
-
| Jun 15, 2022 | 1.1.5 | Updated training to use new common runtime to run jobs. Removed Azure Relay usage for AKS extension. Removed service bus usage from the extension. Updated security context usage. Updated inference azureml-fe to v2. Updated to use Volcano as training job scheduler. Bugfixes. |
398
+
| Dec 27, 2022 | 1.1.17 | Move the Fluent-bit from DaemonSet to sidecars. Add MDC support. Refine error messages. Support cluster mode (windows, linux) jobs. Bug fixes|
399
+
| Nov 29, 2022 | 1.1.16 |Add instance type validation by new *crd*. Support Tolerance. Shorten SVC Name. Workload Core hour. Multiple Bug fixes and improvements. |
400
+
| Sep 13, 2022 | 1.1.10 | Bug fixes.|
401
+
| Aug 29, 2022 | 1.1.9 | Improved health check logic. Bug fixes.|
402
+
| Jun 23, 2022 | 1.1.6 | Bug fixes. |
403
+
| Jun 15, 2022 | 1.1.5 | Updated training to use new common runtime to run jobs. Removed Azure Relay usage for AKS extension. Removed service bus usage from the extension. Updated security context usage. Updated inference azureml-fe to v2. Updated to use Volcano as training job scheduler. Bug fixes. |
407
404
| Oct 14, 2021 | 1.0.37 | PV/PVC volume mount support in AMLArc training job. |
408
405
| Sept 16, 2021 | 1.0.29 | New regions available, WestUS, CentralUS, NorthCentralUS, KoreaCentral. Job queue explainability. See job queue details in AML Workspace Studio. Auto-killing policy. Support max_run_duration_seconds in ScriptRunConfig. The system will attempt to automatically cancel the run if it took longer than the setting value. Performance improvement on cluster auto scaling support. Arc agent and ML extension deployment from on premises container registry.|
409
406
| August 24, 2021 | 1.0.28 | Compute instance type is supported in job YAML. Assign Managed Identity to AMLArc compute.|
0 commit comments