Skip to content

Commit 369542f

Browse files
authored
Merge pull request #231685 from jiaochenlu/update-230322
update k8s endpoint boundary
2 parents d74ba81 + fab324f commit 369542f

6 files changed

+101
-20
lines changed

articles/machine-learning/how-to-access-resources-from-endpoints-managed-identities.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ ms.custom: devplatv2, cliv2, event-tier1-build-2022, ignite-2022
2020

2121
Learn how to access Azure resources from your scoring script with an online endpoint and either a system-assigned managed identity or a user-assigned managed identity.
2222

23-
Managed endpoints allow Azure Machine Learning to manage the burden of provisioning your compute resource and deploying your machine learning model. Typically your model needs to access Azure resources such as the Azure Container Registry or your blob storage for inferencing; with a managed identity you can access these resources without needing to manage credentials in your code. [Learn more about managed identities](../active-directory/managed-identities-azure-resources/overview.md).
23+
Both managed endpoints and Kubernetes endpoints allow Azure Machine Learning to manage the burden of provisioning your compute resource and deploying your machine learning model. Typically your model needs to access Azure resources such as the Azure Container Registry or your blob storage for inferencing; with a managed identity you can access these resources without needing to manage credentials in your code. [Learn more about managed identities](../active-directory/managed-identities-azure-resources/overview.md).
2424

2525
This guide assumes you don't have a managed identity, a storage account or an online endpoint. If you already have these components, skip to the [give access permission to the managed identity](#give-access-permission-to-the-managed-identity) section.
2626

@@ -147,6 +147,8 @@ This guide assumes you don't have a managed identity, a storage account or an on
147147
## Limitations
148148
149149
* The identity for an endpoint is immutable. During endpoint creation, you can associate it with a system-assigned identity (default) or a user-assigned identity. You can't change the identity after the endpoint has been created.
150+
* If your ARC and blob storage are configured as private, i.e. behind a Vnet, then access from the Kubernetes endpoint should be over the private link regardless of whether your workspace is public or private. More details about private link setting, please refer to [How to secure workspace vnet](./how-to-secure-workspace-vnet.md#azure-container-registry).
151+
150152
151153
## Configure variables for deployment
152154
@@ -555,7 +557,7 @@ Then, get the Principal ID of the System-assigned managed identity:
555557
556558
[!notebook-python[] (~/azureml-examples-main/sdk/python/endpoints/online/managed/managed-identities/online-endpoints-managed-identity-sai.ipynb?name=6-get-sai-details)]
557559
558-
Next, give assign the `Storage Blob Data Reader` role to the endpoint. The Role Definition is retrieved by name and passed along with the Principal ID of the endpoint. The role is applied at the scope of the storage account created above and allows the endpoint to read the file.
560+
Next, assign the `Storage Blob Data Reader` role to the endpoint. The Role Definition is retrieved by name and passed along with the Principal ID of the endpoint. The role is applied at the scope of the storage account created above and allows the endpoint to read the file.
559561
560562
[!notebook-python[] (~/azureml-examples-main/sdk/python/endpoints/online/managed/managed-identities/online-endpoints-managed-identity-sai.ipynb?name=6-give-permission-user-storage-account)]
561563
@@ -777,4 +779,5 @@ Delete the User-assigned managed identity:
777779
* To see which compute resources you can use, see [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md).
778780
* For more on costs, see [View costs for an Azure Machine Learning managed online endpoint](how-to-view-online-endpoints-costs.md).
779781
* For information on monitoring endpoints, see [Monitor managed online endpoints](how-to-monitor-online-endpoints.md).
780-
* For limitations for managed endpoints, see [Manage and increase quotas for resources with Azure Machine Learning](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints).
782+
* For limitations for managed endpoints, see [Manage and increase quotas for resources with Azure Machine Learning-managed online endpoint](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints).
783+
* For limitations for Kubernetes endpoints, see [Manage and increase quotas for resources with Azure Machine Learning-kubernetes online endpoint](how-to-manage-quotas.md#azure-machine-learning-kubernetes-online-endpoints).

articles/machine-learning/how-to-attach-kubernetes-to-workspace.md

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ ms.custom: build-spring-2022, cliv2, sdkv2, event-tier1-build-2022
1414

1515
# Attach a Kubernetes cluster to Azure Machine Learning workspace
1616

17+
[!INCLUDE [dev v2](../../includes/machine-learning-dev-v2.md)]
18+
1719
Once Azure Machine Learning extension is deployed on AKS or Arc Kubernetes cluster, you can attach the Kubernetes cluster to Azure Machine Learning workspace and create compute targets for ML professionals to use.
1820

1921
## Prerequisites
@@ -66,9 +68,7 @@ We support two ways to attach a Kubernetes cluster to Azure Machine Learning wor
6668

6769
### [Azure CLI](#tab/cli)
6870

69-
[!INCLUDE [CLI v2](../../includes/machine-learning-CLI-v2.md)]
70-
71-
The following commands show how to attach an AKS and Azure Arc-enabled Kubernetes cluster, and use it as a compute target with managed identity enabled.
71+
The following CLI v2 commands show how to attach an AKS and Azure Arc-enabled Kubernetes cluster, and use it as a compute target with managed identity enabled.
7272

7373
**AKS cluster**
7474

@@ -114,6 +114,45 @@ Attaching a Kubernetes cluster makes it available to your workspace for training
114114
In the Kubernetes clusters tab, the initial state of your cluster is *Creating*. When the cluster is successfully attached, the state changes to *Succeeded*. Otherwise, the state changes to *Failed*.
115115

116116
:::image type="content" source="media/how-to-attach-arc-kubernetes/kubernetes-creating.png" alt-text="Screenshot of attached settings for configuration of Kubernetes cluster.":::
117+
118+
### [Azure SDK](#tab/sdk)
119+
120+
The following python SDK v2 code shows how to attach an AKS and Azure Arc-enabled Kubernetes cluster, and use it as a compute target with managed identity enabled.
121+
122+
**AKS cluster**
123+
124+
```python
125+
from azure.ai.ml import load_compute
126+
127+
# for AKS cluster, the resource_id should be something like '/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/managedClusters/<CLUSTER_NAME>''
128+
compute_params = [
129+
{"name": "<COMPUTE_NAME>"},
130+
{"type": "kubernetes"},
131+
{
132+
"resource_id": "/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/managedClusters/<CLUSTER_NAME>"
133+
},
134+
]
135+
k8s_compute = load_compute(source=None, params_override=compute_params)
136+
ml_client.begin_create_or_update(k8s_compute).result()
137+
```
138+
139+
**Arc Kubernetes cluster**
140+
141+
```python
142+
from azure.ai.ml import load_compute
143+
144+
# for arc connected cluster, the resource_id should be something like '/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/connectedClusters/<CLUSTER_NAME>''
145+
compute_params = [
146+
{"name": "<COMPUTE_NAME>"},
147+
{"type": "kubernetes"},
148+
{
149+
"resource_id": "/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/connectedClusters/<CLUSTER_NAME>"
150+
},
151+
]
152+
k8s_compute = load_compute(source=None, params_override=compute_params)
153+
ml_client.begin_create_or_update(k8s_compute).result()
154+
155+
```
117156

118157
---
119158

articles/machine-learning/how-to-deploy-kubernetes-extension.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ We list four typical extension deployment scenarios for reference. To deploy ext
100100

101101
For Azure Machine Learning extension deployment on AKS cluster, make sure to specify `managedClusters` value for `--cluster-type` parameter. Run the following Azure CLI command to deploy Azure Machine Learning extension:
102102
```azurecli
103-
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True inferenceLoadBalancerHA=False --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
103+
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True InferenceRouterHA=False --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
104104
```
105105

106106
- **Use Arc Kubernetes cluster outside of Azure for a quick proof of concept, to run training jobs only**

articles/machine-learning/how-to-manage-quotas.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,20 @@ To determine the current usage for an endpoint, [view the metrics](how-to-monito
125125

126126
To request an exception from the Azure Machine Learning product team, use the steps in the [Request quota increases](#request-quota-increases).
127127

128+
### Azure Machine Learning kubernetes online endpoints
129+
130+
Azure Machine Learning kubernetes online endpoints have limits described in the following table.
131+
132+
| **Resource** | **Limit** |
133+
| --- | --- |
134+
| Endpoint name| Same as [managed online endpoint](#azure-machine-learning-managed-online-endpoints) |
135+
| Deployment name| Same as [managed online endpoint](#azure-machine-learning-managed-online-endpoints)|
136+
| Number of endpoints per subscription | 50 |
137+
| Number of deployments per subscription | 200 |
138+
| Number of deployments per endpoint | 20 |
139+
| Max request time-out at endpoint level | 300 seconds |
140+
141+
The sum of kubernetes online endpoints and managed online endpoints under each subscription cannot exceed 50. Similarly, the sum of kubernetes online deployments and managed online deployments under each subscription cannot exceed 200.
128142

129143
### Azure Machine Learning pipelines
130144
[Azure Machine Learning pipelines](concept-ml-pipelines.md) have the following limits.
@@ -219,11 +233,11 @@ When you're requesting a quota increase, select the service that you have in min
219233

220234
1. Scroll to **Machine Learning Service: Virtual Machine Quota**.
221235

222-
:::image type="content" source="./media/how-to-manage-quotas/virtual-machine-quota.png" lightbox="./media/how-to-manage-quotas/virtual-machine-quota.png" alt-text="Screenshot of the VM quota details form.":::
236+
:::image type="content" source="./media/how-to-manage-quotas/virtual-machine-quota.png" lightbox="./media/how-to-manage-quotas/virtual-machine-quota.png" alt-text="Screenshot of the VM quota details.":::
223237

224-
2. Under **Additonal Details** specify the request details with the number of additional vCPUs required to run your Machine Learning Endpoint.
238+
2. Under **Additional Details** specify the request details with the number of additional vCPUs required to run your Machine Learning Endpoint.
225239

226-
:::image type="content" source="./media/how-to-manage-quotas/vm-quota-request-additional-info.png" lightbox="./media/how-to-manage-quotas/vm-quota-request-additional-info.png" alt-text="Screenshot of the VM quota additional details form.":::
240+
:::image type="content" source="./media/how-to-manage-quotas/vm-quota-request-additional-info.png" lightbox="./media/how-to-manage-quotas/vm-quota-request-additional-info.png" alt-text="Screenshot of the VM quota additional details.":::
227241

228242
> [!NOTE]
229243
> [Free trial subscriptions](https://azure.microsoft.com/offers/ms-azr-0044p) are not eligible for limit or quota increases. If you have a free trial subscription, you can upgrade to a [pay-as-you-go](https://azure.microsoft.com/offers/ms-azr-0003p/) subscription. For more information, see [Upgrade Azure free trial to pay-as-you-go](../cost-management-billing/manage/upgrade-azure-subscription.md) and [Azure free account FAQ](https://azure.microsoft.com/free/free-account-faq).

articles/machine-learning/how-to-troubleshoot-online-endpoints.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ This is a list of common deployment errors that are reported as part of the depl
208208
* [ResourceNotFound](#error-resourcenotfound)
209209
* [OperationCanceled](#error-operationcanceled)
210210

211-
If you are creating or updating a Kubernetes online deployment, you can see [Common errors specific to Kubernetes deployments](#).
211+
If you are creating or updating a Kubernetes online deployment, you can see [Common errors specific to Kubernetes deployments](#common-errors-specific-to-kubernetes-deployments).
212212

213213

214214
### ERROR: ImageBuildFailure
@@ -273,6 +273,21 @@ When you are creating a managed online endpoint, role assignment is required for
273273

274274
Try to delete some unused endpoints in this subscription. If all of your endpoints are actively in use, you can try [requesting an endpoint quota increase](how-to-manage-quotas.md#endpoint-quota-increases).
275275

276+
For Kubernetes online endpoints, there is the endpoint quota boundary at the cluster level as well, you can check the [Kubernetes online endpoint quota](how-to-manage-quotas.md#azure-machine-learning-kubernetes-online-endpoints) section for more details.
277+
278+
#### Kubernetes quota
279+
280+
This issue happens when the requested CPU or memory couldn't be satisfied due to all nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are unavailable.
281+
282+
The error message will typically indicate the resource insufficient in cluster, for example, `OutOfQuota: Kubernetes unschedulable. Details:0/1 nodes are available: 1 Too many pods...`, which means that there are too many pods in the cluster and not enough resources to deploy the new model based on your request.
283+
284+
You can try the following mitigation to address this issue:
285+
* For IT ops who maintain the Kubernetes cluster, you can try to add more nodes or clear some unused pods in the cluster to release some resources.
286+
* For machine learning engineers who deploy models, you can try to reduce the resource request of your deployment:
287+
* If you directly define the resource request in the deployment configuration via resource section, you can try to reduce the resource request.
288+
* If you use `instance type` to define resource for model deployment, you can contact the IT ops to adjust the instance type resource configuration, more detail you can refer to [How to manage Kubernetes instance type](how-to-manage-kubernetes-instance-types.md).
289+
290+
276291
#### Region-wide VM capacity
277292

278293
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to provision the specified VM size. Retry later or try deploying to a different region.
@@ -309,15 +324,7 @@ Use the **Endpoints** in the studio:
309324
1. Select the **Deployment logs** tab in the endpoint's details page.
310325
1. Use the dropdown to select the deployment whose log you want to see.
311326

312-
#### Kubernetes quota
313-
314-
This issue happens when the requested CPU or memory couldn't be satisfied due to all nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are unavailable.
315-
316-
The error message will typically indicate which resource you need more of. For instance, if you see an error message detailing `0/3 nodes are available: 3 Insufficient nvidia.com/gpu`, that means that the service requires GPUs and there are three nodes in the cluster that don't have sufficient GPUs. This can be addressed by adding more nodes if you're using a GPU SKU, switching to a GPU-enabled SKU if you aren't, or changing your environment to not require GPUs.
317-
318-
You can also try adjusting your request in the cluster, you can directly [adjust the resource request of the instance type](how-to-manage-kubernetes-instance-types.md).
319-
320-
---
327+
----
321328

322329
### ERROR: BadArgument
323330

articles/machine-learning/v1/how-to-enable-data-collection.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,24 @@ Once collection is enabled, the data you collect helps you:
2828

2929
* Retrain your model with the collected data.
3030

31+
## Limitations
32+
33+
* The model data collection feature can only work with Ubuntu 18.04 image.
34+
35+
>[!IMPORTANT]
36+
>
37+
> As of 03/10/2023, the Ubuntu 18.04 image is now deprecated. **Support for Ubuntu 18.04 images will be dropped starting January 2023 when it reaches EOL on April 30, 2023.**
38+
>
39+
> The MDC feature is incompatible with any other image than Ubuntu 18.04, which is no available after the Ubuntu 18.04 image is deprecated.
40+
>
41+
> mMore information you can refer to:
42+
> * [openmpi3.1.2-ubuntu18.04 release-notes](https://github.com/Azure/AzureML-Containers/blob/master/base/cpu/openmpi3.1.2-ubuntu18.04/release-notes.md)
43+
> * [data science virtual machine release notes](../data-science-virtual-machine/release-notes.md#september-20-2022)
44+
45+
>[!NOTE]
46+
>
47+
> The data collection feature is currently in preview, any preview features are not recommended for production workloads.
48+
3149
## What is collected and where it goes
3250

3351
The following data can be collected:

0 commit comments

Comments
 (0)