Skip to content

Commit 8f311e8

Browse files
authored
Merge pull request #222715 from jiaochenlu/update-230102
Update k8s compute TSG and log info
2 parents 2ed27d6 + a2d4608 commit 8f311e8

10 files changed

+390
-33
lines changed

articles/machine-learning/how-to-deploy-kubernetes-extension.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,9 @@ In this article, you can learn:
4343
- [Disabling local accounts](../aks/managed-aad.md#disable-local-accounts) for AKS is **not supported** by Azure Machine Learning. When the AKS Cluster is deployed, local accounts are enabled by default.
4444
- If your AKS cluster has an [Authorized IP range enabled to access the API server](../aks/api-server-authorized-ip-ranges.md), enable the AzureML control plane IP ranges for the AKS cluster. The AzureML control plane is deployed across paired regions. Without access to the API server, the machine learning pods can't be deployed. Use the [IP ranges](https://www.microsoft.com/download/confirmation.aspx?id=56519) for both the [paired regions](../availability-zones/cross-region-replication-azure.md) when enabling the IP ranges in an AKS cluster.
4545
- Azure Machine Learning does not guarantee support for all preview stage features in AKS. For example, [Azure AD pod identity](../aks/use-azure-ad-pod-identity.md) is not supported.
46-
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
46+
- If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
47+
- We currently don't support attaching your AKS cluster across subscription, which means that your AKS cluster must be in the same subscription as your workspace.
48+
- The workaround to meet your cross-subscription requirement is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
4749

4850

4951
## Review AzureML extension configuration settings
@@ -60,7 +62,7 @@ You can use AzureML CLI command `k8s-extension create` to deploy AzureML extensi
6062
|`sslSecret`| The name of the Kubernetes secret in the `azureml` namespace. This config is used to store `cert.pem` (PEM-encoded TLS/SSL cert) and `key.pem` (PEM-encoded TLS/SSL key), which are required for inference HTTPS endpoint support when ``allowInsecureConnections`` is set to `False`. For a sample YAML definition of `sslSecret`, see [Configure sslSecret](./how-to-secure-kubernetes-online-endpoint.md#configure-sslsecret). Use this config or a combination of `sslCertPemFile` and `sslKeyPemFile` protected config settings. |N/A| Optional | Optional |
6163
|`sslCname` |An TLS/SSL CNAME is used by inference HTTPS endpoint. **Required** if `allowInsecureConnections=False` | N/A | Optional | Optional|
6264
| `inferenceRouterHA` |`True` or `False`, default `True`. By default, AzureML extension will deploy three inference router replicas for high availability, which requires at least three worker nodes in a cluster. Set to `False` if your cluster has fewer than three worker nodes, in this case only one inference router service is deployed. | N/A| Optional | Optional |
63-
|`nodeSelector` | By default, the deployed kubernetes resources are randomly deployed to one or more nodes of the cluster, and DaemonSet resources are deployed to ALL nodes. If you want to restrict the extension deployment to specific nodes with label `key1=value1` and `key2=value2`, use `nodeSelector.key1=value1`, `nodeSelector.key2=value2` correspondingly. | Optional| Optional | Optional |
65+
|`nodeSelector` | By default, the deployed kubernetes resources and your machine learning workloads are randomly deployed to one or more nodes of the cluster, and DaemonSet resources are deployed to ALL nodes. If you want to restrict the extension deployment and your training/inference workloads to specific nodes with label `key1=value1` and `key2=value2`, use `nodeSelector.key1=value1`, `nodeSelector.key2=value2` correspondingly. | Optional| Optional | Optional |
6466
|`installNvidiaDevicePlugin` | `True` or `False`, default `False`. [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin#nvidia-device-plugin-for-kubernetes) is required for ML workloads on NVIDIA GPU hardware. By default, AzureML extension deployment won't install NVIDIA Device Plugin regardless Kubernetes cluster has GPU hardware or not. User can specify this setting to `True`, to install it, but make sure to fulfill [Prerequisites](https://github.com/NVIDIA/k8s-device-plugin#prerequisites). | Optional |Optional |Optional |
6567
|`installPromOp`|`True` or `False`, default `True`. AzureML extension needs prometheus operator to manage prometheus. Set to `False` to reuse the existing prometheus operator. For more information about reusing the existing prometheus operator, refer to [reusing the prometheus operator](./how-to-troubleshoot-kubernetes-extension.md#prometheus-operator)| Optional| Optional | Optional |
6668
|`installVolcano`| `True` or `False`, default `True`. AzureML extension needs volcano scheduler to schedule the job. Set to `False` to reuse existing volcano scheduler. For more information about reusing the existing volcano scheduler, refer to [reusing volcano scheduler](./how-to-troubleshoot-kubernetes-extension.md#volcano-scheduler) | Optional| N/A | Optional |
@@ -83,7 +85,7 @@ If you plan to deploy AzureML extension for real-time inference workload and wan
8385
* Type `LoadBalancer`. Exposes `azureml-fe` externally using a cloud provider's load balancer. To specify this value, ensure that your cluster supports load balancer provisioning. Note most on-premises Kubernetes clusters might not support external load balancer.
8486
* Type `NodePort`. Exposes `azureml-fe` on each Node's IP at a static port. You'll be able to contact `azureml-fe`, from outside of cluster, by requesting `<NodeIP>:<NodePort>`. Using `NodePort` also allows you to set up your own load balancing solution and TLS/SSL termination for `azureml-fe`.
8587
* Type `ClusterIP`. Exposes `azureml-fe` on a cluster-internal IP, and it makes `azureml-fe` only reachable from within the cluster. For `azureml-fe` to serve inference requests coming outside of cluster, it requires you to set up your own load balancing solution and TLS/SSL termination for `azureml-fe`.
86-
* To ensure high availability of `azureml-fe` routing service, AzureML extension deployment by default creates three replicas of `azureml-fe` for clusters having three nodes or more. If your cluster has **less than 3 nodes**, set `inferenceLoadbalancerHA=False`.
88+
* To ensure high availability of `azureml-fe` routing service, AzureML extension deployment by default creates three replicas of `azureml-fe` for clusters having three nodes or more. If your cluster has **less than 3 nodes**, set `inferenceRouterHA=False`.
8789
* You also want to consider using **HTTPS** to restrict access to model endpoints and secure the data that clients submit. For this purpose, you would need to specify either `sslSecret` config setting or combination of `sslKeyPemFile` and `sslCertPemFile` config-protected settings.
8890
* By default, AzureML extension deployment expects config settings for **HTTPS** support. For development or testing purposes, **HTTP** support is conveniently provided through config setting `allowInsecureConnections=True`.
8991

articles/machine-learning/how-to-manage-kubernetes-instance-types.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,14 @@ and [resources](https://kubernetes.io/docs/concepts/configuration/manage-resourc
2424

2525
In short, a `nodeSelector` lets you specify which node a pod should run on. The node must have a corresponding label. In the `resources` section, you can set the compute resources (CPU, memory and NVIDIA GPU) for the pod.
2626

27+
>[!IMPORTANT]
28+
>
29+
> If you have [specified a nodeSelector when deploying the AzureML extension](./how-to-deploy-kubernetes-extension.md#review-azureml-extension-configuration-settings), the nodeSelector will be applied to all instance types. This means that:
30+
> - For each instance type creating, the specified nodeSelector should be a subset of the extension-specified nodeSelector.
31+
> - If you use an instance type **with nodeSelector**, the workload will run on any node matching both the extension-specified nodeSelector and the instance type-specified nodeSelector.
32+
> - If you use an instance type **without a nodeSelector**, the workload will run on any node mathcing the extension-specified nodeSelector.
33+
34+
2735
## Default instance type
2836

2937
By default, a `defaultinstancetype` with the following definition is created when you attach a Kubernetes cluster to an AzureML workspace:

articles/machine-learning/how-to-secure-kubernetes-online-endpoint.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ TLS/SSL certificates expire and must be renewed. Typically, this happens every y
169169
If you directly configured the PEM files in the extension deployment command before, you need to run the extension update command and specify the new PEM file's path:
170170

171171
```azurecli
172-
az k8s-extension update --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config-protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
172+
az k8s-extension update --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config sslCname=<ssl cname> --config-protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
173173
```
174174

175175
## Disable TLS
@@ -181,7 +181,7 @@ To disable TLS for a model deployed to Kubernetes:
181181
1. Run the following Azure CLI command in your Kubernetes cluster, and then perform an update. This command assumes that you're using AKS.
182182

183183
```azurecli
184-
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableInference=True inferenceRouterServiceType=LoadBalancer allowInsercureconnection=True --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
184+
az k8s-extension update --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableInference=True inferenceRouterServiceType=LoadBalancer allowInsercureconnection=True --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
185185
```
186186

187187
> [!WARNING]

articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ Below is a list of error types in **compute scope** that you might encounter whe
4141
* [ERROR: GenericComputeError](#error-genericcomputeerror)
4242
* [ERROR: ComputeNotFound](#error-computenotfound)
4343
* [ERROR: ComputeNotAccessible](#error-computenotaccessible)
44+
* [ERROR: InvalidComputeInformation](#error-invalidcomputeinformation)
45+
* [ERROR: InvalidComputeNoKubernetesConfiguration](#error-invalidcomputenokubernetesconfiguration)
4446

4547

4648
#### ERROR: GenericComputeError
@@ -71,7 +73,7 @@ Cannot find Kubernetes compute.
7173

7274
This error should occur when:
7375
* The system can't find the compute when create/update new online endpoint/deployment.
74-
* The compute of existing online endpoints/deployments have been removed.
76+
* The compute of existing online endpoints/deployments have been removed.
7577

7678
You can check the following items to troubleshoot the issue:
7779
* Try to recreate the endpoint and deployment.
@@ -87,12 +89,40 @@ The Kubernetes compute is not accessible.
8789

8890
This error should occur when the workspace MSI (managed identity) doesn't have access to the AKS cluster. You can check if the workspace MSI has the access to the AKS, and if not, you can follow this [document](how-to-identity-based-service-authentication.md) to manage access and identity.
8991

92+
#### ERROR: InvalidComputeInformation
93+
94+
The error message is as follows:
95+
96+
```bash
97+
The compute information is invalid.
98+
```
99+
There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster.
100+
101+
You can check the following items to troubleshoot the issue:
102+
* Check whether the compute target you used is correct and existing in your workspace.
103+
* Try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
104+
105+
#### ERROR: InvalidComputeNoKubernetesConfiguration
106+
107+
The error message is as follows:
108+
109+
```bash
110+
The compute kubeconfig is invalid.
111+
```
112+
113+
This error should occur when the system failed to find any configuration to connect to cluster, such as:
114+
* For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.
115+
* For AKS cluster, there is no AKS configuration can be found.
116+
117+
To rebuild the configuration of compute connection in your cluster, you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
118+
90119
### Kubernetes cluster error
91120

92121
Below is a list of error types in **cluster scope** that you might encounter when using Kubernetes compute to create online endpoints and online deployments for real-time model inference, which you can trouble shoot by following the guideline:
93122

94123
* [ERROR: GenericClusterError](#error-genericclustererror)
95124
* [ERROR: ClusterNotReachable](#error-clusternotreachable)
125+
* [ERROR: ClusterNotFound](#error-clusternotfound)
96126

97127
#### ERROR: GenericClusterError
98128

@@ -112,7 +142,7 @@ For AKS clusters:
112142

113143

114144
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
115-
1. Check if the Kubernetes API server is accessible by running `kubectl` command in cluster.
145+
* Check if the Kubernetes API server is accessible by running `kubectl` command in cluster.
116146

117147
#### ERROR: ClusterNotReachable
118148

@@ -132,6 +162,23 @@ For AKS clusters:
132162
For an AKS cluster or an Azure Arc enabled Kubernetes cluster:
133163
* Check if the Kubernetes API server is accessible by running `kubectl` command in cluster.
134164

165+
#### ERROR: ClusterNotFound
166+
167+
The error message is as follows:
168+
169+
```bash
170+
Cannot found Kubernetes cluster.
171+
```
172+
173+
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
174+
175+
You can check the following items to troubleshoot the issue:
176+
* First, check the cluster resource ID in the Azure portal to verify whether Kubernetes cluster resource still exists and is running normally.
177+
* If the cluster exists and is running, then you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
178+
179+
> [!TIP]
180+
> More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
181+
135182

136183
## Training guide
137184

articles/machine-learning/how-to-troubleshoot-kubernetes-extension.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,47 @@ volcano-scheduler.conf: |
225225
- name: nodeorder
226226
- name: binpack
227227
```
228-
You need to use the same config settings as above, and disable `job/validate` webhook in the volcano admission, so that AzureML training workloads can perform properly.
228+
You need to use the same config settings as above, and you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**, so that AzureML training workloads can perform properly.
229+
230+
#### Volcano scheduler integration supporting cluster autoscaler
231+
As discussed in this [thread](https://github.com/volcano-sh/volcano/issues/2558) , the **gang plugin** is not working well with the cluster autoscaler(CA) and also the node autoscaler in AKS.
232+
233+
If you use the volcano that comes with the AzureML extension via setting `installVolcano=true`, the extension will have a scheduler config by default, which configures the **gang** plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS cluster will not be supported with the volcano installed by extension.
234+
235+
For the case above, if you prefer the AKS cluster autoscaler could work normally, you can configure this `volcanoScheduler.schedulerConfigMap` parameter through updating extension, and specify a custom config of **no gang** volcano scheduler to it, for example:
236+
237+
```yaml
238+
volcano-scheduler.conf: |
239+
actions: "enqueue, allocate, backfill"
240+
tiers:
241+
- plugins:
242+
- name: sla
243+
arguments:
244+
sla-waiting-time: 1m
245+
- plugins:
246+
- name: conformance
247+
- plugins:
248+
- name: overcommit
249+
- name: drf
250+
- name: predicates
251+
- name: proportion
252+
- name: nodeorder
253+
- name: binpack
254+
```
255+
256+
To use this config in your AKS cluster, you need to follow the steps below:
257+
1. Create a configmap file with the above config in the azureml namespace. This namespace will generally be created when you install the AzureML extension.
258+
1. Set `volcanoScheduler.schedulerConfigMap=<configmap name>` in the extension config to apply this configmap. And you need to skip the resource validation when installing the extension by configuring `amloperator.skipResourceValidation=true`. For example:
259+
```azurecli
260+
az k8s-extension update --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config volcanoScheduler.schedulerConfigMap=<configmap name> amloperator.skipResourceValidation=true --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
261+
```
262+
263+
> [!NOTE]
264+
> Since the gang plugin is removed, there's potential that the deadlock happens when volcano schedules the job.
265+
>
266+
> * To avoid this situation, you can **use same instance type across the jobs**.
267+
>
268+
> Note that you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**.
269+
229270
230271

0 commit comments

Comments
 (0)