Skip to content

Commit ed144bb

Browse files
Merge pull request #226945 from vs-li/main
Online Endpoint TSG: Move out of region capacity section to quota
2 parents 8655f38 + e777112 commit ed144bb

File tree

1 file changed

+14
-22
lines changed

1 file changed

+14
-22
lines changed

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 14 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,6 @@ Below is a list of common deployment errors that are reported as part of the dep
203203

204204
* [ImageBuildFailure](#error-imagebuildfailure)
205205
* [OutOfQuota](#error-outofquota)
206-
* [OutOfCapacity](#error-outofcapacity)
207206
* [BadArgument](#error-badargument)
208207
* [ResourceNotReady](#error-resourcenotready)
209208
* [ResourceNotFound](#error-resourcenotfound)
@@ -247,6 +246,7 @@ Below is a list of common resources that might run out of quota when using Azure
247246
* [Memory](#memory-quota)
248247
* [Role assignments](#role-assignment-quota)
249248
* [Endpoints](#endpoint-quota)
249+
* [Region-wide VM capacity](#region-wide-vm-capacity)
250250
* [Other](#other-quota)
251251

252252
Additionally, below is a list of common resources that might run out of quota only for Kubernetes online endpoint:
@@ -261,34 +261,30 @@ A possible mitigation is to check if there are unused deployments that can be de
261261

262262
#### Disk quota
263263

264-
This issue happens when the size of the model is larger than the available disk space and the model is not able to be downloaded. Try a SKU with more disk space.
265-
* Try a [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md) with more disk space.
266-
* Try reducing image and model size.
264+
This issue happens when the size of the model is larger than the available disk space and the model is not able to be downloaded. Try a [SKU](reference-managed-online-endpoints-vm-sku-list.md) with more disk space or reducing the image and model size.
267265

268266
#### Memory quota
269-
This issue happens when the memory footprint of the model is larger than the available memory. Try a [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md) with more memory.<br>
270-
271-
#### Endpoint quota
272-
273-
Try to delete some unused endpoints in this subscription.
267+
This issue happens when the memory footprint of the model is larger than the available memory. Try a [SKU](reference-managed-online-endpoints-vm-sku-list.md) with more memory.
274268

275269
#### Role assignment quota
276270

277-
When you are creating a managed online endpoint, role assignment is required for the [managed identity](../active-directory/managed-identities-azure-resources/overview.md) to access workspace resources. If you've reached the [role assignment limit](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-rbac-limits), try to delete some unused role assignments in this subscription. You can check all role assignments in the Azure portal by going to the Access Control menu.
271+
When you are creating a managed online endpoint, role assignment is required for the [managed identity](../active-directory/managed-identities-azure-resources/overview.md) to access workspace resources. If you've reached the [role assignment limit](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-rbac-limits), try to delete some unused role assignments in this subscription. You can check all role assignments in the Azure portal by navigating to the Access Control menu.
278272

279-
#### Kubernetes quota
273+
#### Endpoint quota
280274

281-
This issue happens when the requested CPU or memory couldn't be satisfied, such as nodes are cordoned or nodes are unavailable, which means all nodes are unschedulable.
275+
Try to delete some unused endpoints in this subscription. If all of your endpoints are actively in use, you can try [requesting an endpoint quota increase](how-to-manage-quotas.md#endpoint-quota-increases).
282276

283-
Try to delete some unused endpoints in this subscription. Alternatively, follow [How to manage quotas](how-to-manage-quotas.md#endpoint-quota-increases) to request endpoint quota increase.
277+
#### Region-wide VM capacity
284278

285-
Adjust your request in the cluster, you can directly [adjust resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
279+
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to provision the specified VM size. Retry later or try deploying to a different region.
286280

287-
##### Container can't be scheduled
281+
#### Kubernetes quota
282+
283+
This issue happens when the requested CPU, memory could not be provided. At times, nodes may be retained or unavailable, meaning that these nodes are unschedulable. When you are deploying a model to a Kubernetes compute target, Azure Machine Learning will attempt to schedule the service with the requested amount of resources. If there are no nodes available in the cluster with the appropriate amount of resources after 5 minutes, the deployment will fail. To work around this, try to delete some unused endpoints in this subscription. You can also address this error by either adding more nodes, changing the SKU of your nodes, or changing the resource requirements of your service.
288284

289-
When you are deploying a model to a Kubernetes compute target, Azure Machine Learning will attempt to schedule the service with the requested amount of resources. If there are no nodes available in the cluster with the appropriate amount of resources after 5 minutes, the deployment will fail. The failure message is `Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00`. You can address this error by either adding more nodes, changing the SKU of your nodes, or changing the resource requirements of your service.
285+
The error message will typically indicate which resource you need more of. For instance, if you see an error message detailing `0/3 nodes are available: 3 Insufficient nvidia.com/gpu`, that means that the service requires GPUs and there are three nodes in the cluster that don't have sufficient GPUs. This can be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU-enabled SKU if you aren't, or changing your environment to not require GPUs.
290286

291-
The error message will typically indicate which resource you need more of - for instance, if you see an error message indicating `0/3 nodes are available: 3 Insufficient nvidia.com/gpu` that means that the service requires GPUs and there are three nodes in the cluster that don't have available GPUs. This could be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU enabled SKU if you aren't or changing your environment to not require GPUs.
287+
You can also try adjusting your request in the cluster, you can directly [adjust the resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
292288

293289
#### Other quota
294290

@@ -324,10 +320,6 @@ Use the **Endpoints** in the studio:
324320

325321
---
326322

327-
### ERROR: OutOfCapacity
328-
329-
For managed online endpoint, the specified VM Size failed to provision due to a lack of Azure Machine Learning capacity. Retry later or try deploying to a different region.
330-
331323
### ERROR: BadArgument
332324

333325
Below is a list of reasons you might run into this error when using either managed online endpoint or Kubernetes online endpoint:
@@ -580,7 +572,7 @@ The reason you might run into this error when creating/updating Kubernetes onlin
580572

581573
In this case, you can detach and then **re-attach** your compute.
582574

583-
> [!NOTE]
575+
> [!]
584576
>
585577
> To troubleshoot errors by re-attaching, please guarantee to re-attach with the exact same configuration as previously detached compute, such as the same compute name and namespace, otherwise you may encounter other errors.
586578

0 commit comments

Comments
 (0)