Skip to content

Commit e7a2e89

Browse files
Merge pull request #228543 from dem108/patch-19
Fix duplicate and reorder sections plus add clarity on a few errors
2 parents b484d52 + 34180c9 commit e7a2e89

File tree

1 file changed

+13
-24
lines changed

1 file changed

+13
-24
lines changed

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 13 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.subservice: mlops
88
author: dem108
99
ms.author: sehan
1010
ms.reviewer: mopeakande
11-
ms.date: 01/27/2023
11+
ms.date: 03/02/2023
1212
ms.topic: troubleshooting
1313
ms.custom: devplatv2, devx-track-azurecli, cliv2, event-tier1-build-2022, sdkv2, ignite-2022
1414
#Customer intent: As a data scientist, I want to figure out why my online endpoint deployment failed so that I can fix it.
@@ -265,10 +265,6 @@ This issue happens when the size of the model is larger than the available disk
265265
#### Memory quota
266266
This issue happens when the memory footprint of the model is larger than the available memory. Try a [SKU](reference-managed-online-endpoints-vm-sku-list.md) with more memory.
267267

268-
#### Endpoint quota
269-
270-
Try to delete some unused endpoints in this subscription. If all of your endpoints are actively in use, you can try [requesting an endpoint quota increase](how-to-manage-quotas.md#endpoint-quota-increases).
271-
272268
#### Role assignment quota
273269

274270
When you are creating a managed online endpoint, role assignment is required for the [managed identity](../active-directory/managed-identities-azure-resources/overview.md) to access workspace resources. If you've reached the [role assignment limit](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-rbac-limits), try to delete some unused role assignments in this subscription. You can check all role assignments in the Azure portal by navigating to the Access Control menu.
@@ -281,22 +277,6 @@ Try to delete some unused endpoints in this subscription. If all of your endpoin
281277

282278
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to provision the specified VM size. Retry later or try deploying to a different region.
283279

284-
#### Endpoint quota
285-
286-
Try to delete some unused endpoints in this subscription. If all of your endpoints are actively in use, you can try [requesting an endpoint quota increase](how-to-manage-quotas.md#endpoint-quota-increases).
287-
288-
#### Region-wide VM capacity
289-
290-
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to provision the specified VM size. Retry later or try deploying to a different region.
291-
292-
#### Kubernetes quota
293-
294-
This issue happens when the requested CPU or memory couldn't be satisfied due to all nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are unavailable.
295-
296-
The error message will typically indicate which resource you need more of. For instance, if you see an error message detailing `0/3 nodes are available: 3 Insufficient nvidia.com/gpu`, that means that the service requires GPUs and there are three nodes in the cluster that don't have sufficient GPUs. This can be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU-enabled SKU if you aren't, or changing your environment to not require GPUs.
297-
298-
You can also try adjusting your request in the cluster, you can directly [adjust the resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
299-
300280
#### Other quota
301281

302282
To run the `score.py` provided as part of the deployment, Azure creates a container that includes all the resources that the `score.py` needs, and runs the scoring script on that container.
@@ -329,6 +309,14 @@ Use the **Endpoints** in the studio:
329309
1. Select the **Deployment logs** tab in the endpoint's details page.
330310
1. Use the dropdown to select the deployment whose log you want to see.
331311

312+
#### Kubernetes quota
313+
314+
This issue happens when the requested CPU or memory couldn't be satisfied due to all nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are unavailable.
315+
316+
The error message will typically indicate which resource you need more of. For instance, if you see an error message detailing `0/3 nodes are available: 3 Insufficient nvidia.com/gpu`, that means that the service requires GPUs and there are three nodes in the cluster that don't have sufficient GPUs. This can be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU-enabled SKU if you aren't, or changing your environment to not require GPUs.
317+
318+
You can also try adjusting your request in the cluster, you can directly [adjust the resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
319+
332320
---
333321

334322
### ERROR: BadArgument
@@ -451,12 +439,13 @@ Please check the pod status and logs to fix this issue, you can also try to upda
451439

452440
To run the `score.py` provided as part of the deployment, Azure creates a container that includes all the resources that the `score.py` needs, and runs the scoring script on that container. The error in this scenario is that this container is crashing when running, which means scoring can't happen. This error happens when:
453441

454-
- There's an error in `score.py`. Use `get-logs` to help diagnose common problems:
455-
- A package that was imported but isn't in the conda environment.
442+
- There's an error in `score.py`. Use `get-logs` to diagnose common problems:
443+
- A package that `score.py` tries to import isn't included in the conda environment.
456444
- A syntax error.
457445
- A failure in the `init()` method.
458446
- If `get-logs` isn't producing any logs, it usually means that the container has failed to start. To debug this issue, try [deploying locally](#deploy-locally) instead.
459447
- Readiness or liveness probes aren't set up correctly.
448+
- Container initialization is taking too long so that readiness or liveness probe fails beyond failure threshold. In this case, adjust [probe settings](reference-yaml-deployment-managed-online.md#probesettings) to allow longer time to initialize the container, or try a bigger VM SKU among [supported VM SKUs](reference-managed-online-endpoints-vm-sku-list.md) which will accelerate the initialization.
460449
- There's an error in the environment set up of the container, such as a missing dependency.
461450
- When you face `TypeError: register() takes 3 positional arguments but 4 were given` error, the error may be caused by the dependency between flask v2 and `azureml-inference-server-http`. See [FAQs for inference HTTP server](how-to-inference-server-http.md#1-i-encountered-the-following-error-during-server-startup) for more details.
462451

@@ -678,7 +667,7 @@ These are common error codes when consuming managed online endpoints with REST r
678667
| 401 | Unauthorized | You don't have permission to do the requested action, such as score, or your token is expired. |
679668
| 404 | Not found | The endpoint doesn't have any valid deployment with positive weight. |
680669
| 408 | Request timeout | The model execution took longer than the timeout supplied in `request_timeout_ms` under `request_settings` of your model deployment config. |
681-
| 424 | Model Error | If your model container returns a non-200 response, Azure returns a 424. Check the `Model Status Code` dimension under the `Requests Per Minute` metric on your endpoint's [Azure Monitor Metric Explorer](../azure-monitor/essentials/metrics-getting-started.md). Or check response headers `ms-azureml-model-error-statuscode` and `ms-azureml-model-error-reason` for more information. |
670+
| 424 | Model Error | If your model container returns a non-200 response, Azure returns a 424. Check the `Model Status Code` dimension under the `Requests Per Minute` metric on your endpoint's [Azure Monitor Metric Explorer](../azure-monitor/essentials/metrics-getting-started.md). Or check response headers `ms-azureml-model-error-statuscode` and `ms-azureml-model-error-reason` for more information. If 424 comes with liveness or readiness probe failing, consider adjusting [probe settings](reference-yaml-deployment-managed-online.md#probesettings) to allow longer time to probe liveness or readiness of the container. |
682671
| 429 | Too many pending requests | Your model is getting more requests than it can handle. Azure Machine Learning allows maximum 2 * `max_concurrent_requests_per_instance` * `instance_count` requests in parallel at any time and rejects extra requests. You can confirm these settings in your model deployment config under `request_settings` and `scale_settings`, respectively. If you're using auto-scaling, this error means that your model is getting requests faster than the system can scale up. With auto-scaling, you can try to resend requests with [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff). Doing so can give the system time to adjust. Apart from enabling auto-scaling, you could also increase the number of instances by using the [code to calculate instance count](#how-to-calculate-instance-count). |
683672
| 429 | Rate-limiting | The number of requests per second reached the [limit](./how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints) of managed online endpoints. |
684673
| 500 | Internal server error | Azure Machine Learning-provisioned infrastructure is failing. |

0 commit comments

Comments
 (0)