You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-online-endpoints.md
+14-22Lines changed: 14 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -203,7 +203,6 @@ Below is a list of common deployment errors that are reported as part of the dep
203
203
204
204
*[ImageBuildFailure](#error-imagebuildfailure)
205
205
*[OutOfQuota](#error-outofquota)
206
-
*[OutOfCapacity](#error-outofcapacity)
207
206
*[BadArgument](#error-badargument)
208
207
*[ResourceNotReady](#error-resourcenotready)
209
208
*[ResourceNotFound](#error-resourcenotfound)
@@ -247,6 +246,7 @@ Below is a list of common resources that might run out of quota when using Azure
247
246
*[Memory](#memory-quota)
248
247
*[Role assignments](#role-assignment-quota)
249
248
*[Endpoints](#endpoint-quota)
249
+
*[Region-wide VM capacity](#region-wide-vm-capacity)
250
250
*[Other](#other-quota)
251
251
252
252
Additionally, below is a list of common resources that might run out of quota only for Kubernetes online endpoint:
@@ -261,34 +261,30 @@ A possible mitigation is to check if there are unused deployments that can be de
261
261
262
262
#### Disk quota
263
263
264
-
This issue happens when the size of the model is larger than the available disk space and the model is not able to be downloaded. Try a SKU with more disk space.
265
-
* Try a [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md) with more disk space.
266
-
* Try reducing image and model size.
264
+
This issue happens when the size of the model is larger than the available disk space and the model is not able to be downloaded. Try a [SKU](reference-managed-online-endpoints-vm-sku-list.md) with more disk space or reducing the image and model size.
267
265
268
266
#### Memory quota
269
-
This issue happens when the memory footprint of the model is larger than the available memory. Try a [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md) with more memory.<br>
270
-
271
-
#### Endpoint quota
272
-
273
-
Try to delete some unused endpoints in this subscription.
267
+
This issue happens when the memory footprint of the model is larger than the available memory. Try a [SKU](reference-managed-online-endpoints-vm-sku-list.md) with more memory.
274
268
275
269
#### Role assignment quota
276
270
277
-
When you are creating a managed online endpoint, role assignment is required for the [managed identity](../active-directory/managed-identities-azure-resources/overview.md) to access workspace resources. If you've reached the [role assignment limit](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-rbac-limits), try to delete some unused role assignments in this subscription. You can check all role assignments in the Azure portal by going to the Access Control menu.
271
+
When you are creating a managed online endpoint, role assignment is required for the [managed identity](../active-directory/managed-identities-azure-resources/overview.md) to access workspace resources. If you've reached the [role assignment limit](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-rbac-limits), try to delete some unused role assignments in this subscription. You can check all role assignments in the Azure portal by navigating to the Access Control menu.
278
272
279
-
#### Kubernetes quota
273
+
#### Endpoint quota
280
274
281
-
This issue happens when the requested CPU or memory couldn't be satisfied, such as nodes are cordoned or nodes are unavailable, which means all nodes are unschedulable.
275
+
Try to delete some unused endpoints in this subscription. If all of your endpoints are actively in use, you can try [requesting an endpoint quota increase](how-to-manage-quotas.md#endpoint-quota-increases).
282
276
283
-
Try to delete some unused endpoints in this subscription. Alternatively, follow [How to manage quotas](how-to-manage-quotas.md#endpoint-quota-increases) to request endpoint quota increase.
277
+
#### Region-wide VM capacity
284
278
285
-
Adjust your request in the cluster, you can directly [adjust resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
279
+
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to provision the specified VM size. Retry later or try deploying to a different region.
286
280
287
-
##### Container can't be scheduled
281
+
#### Kubernetes quota
282
+
283
+
This issue happens when the requested CPU, memory could not be provided. At times, nodes may be retained or unavailable, meaning that these nodes are unschedulable. When you are deploying a model to a Kubernetes compute target, Azure Machine Learning will attempt to schedule the service with the requested amount of resources. If there are no nodes available in the cluster with the appropriate amount of resources after 5 minutes, the deployment will fail. To work around this, try to delete some unused endpoints in this subscription. You can also address this error by either adding more nodes, changing the SKU of your nodes, or changing the resource requirements of your service.
288
284
289
-
When you are deploying a model to a Kubernetes compute target, Azure Machine Learning will attempt to schedule the service with the requested amount of resources. If there are no nodes available in the cluster with the appropriate amount of resources after 5 minutes, the deployment will fail. The failure message is `Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00`. You can address this error by either adding more nodes, changing the SKU of your nodes, or changing the resource requirements of your service.
285
+
The error message will typically indicate which resource you need more of. For instance, if you see an error message detailing `0/3 nodes are available: 3 Insufficient nvidia.com/gpu`, that means that the service requires GPUs and there are three nodes in the cluster that don't have sufficient GPUs. This can be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU-enabled SKU if you aren't, or changing your environment to not require GPUs.
290
286
291
-
The error message will typically indicate which resource you need more of - for instance, if you see an error message indicating `0/3 nodes are available: 3 Insufficient nvidia.com/gpu` that means that the service requires GPUs and there are three nodes in the cluster that don't have available GPUs. This could be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU enabled SKU if you aren't or changing your environment to not require GPUs.
287
+
You can also try adjusting your request in the cluster, you can directly [adjust the resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
292
288
293
289
#### Other quota
294
290
@@ -324,10 +320,6 @@ Use the **Endpoints** in the studio:
324
320
325
321
---
326
322
327
-
### ERROR: OutOfCapacity
328
-
329
-
For managed online endpoint, the specified VM Size failed to provision due to a lack of Azure Machine Learning capacity. Retry later or try deploying to a different region.
330
-
331
323
### ERROR: BadArgument
332
324
333
325
Below is a list of reasons you might run into this error when using either managed online endpoint or Kubernetes online endpoint:
@@ -580,7 +572,7 @@ The reason you might run into this error when creating/updating Kubernetes onlin
580
572
581
573
In this case, you can detach and then **re-attach** your compute.
582
574
583
-
> [!NOTE]
575
+
> [!]
584
576
>
585
577
> To troubleshoot errors by re-attaching, please guarantee to re-attach with the exact same configuration as previously detached compute, such as the same compute name and namespace, otherwise you may encounter other errors.
0 commit comments