You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-troubleshoot-kubernetes-extension.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,14 +7,14 @@ ms.author: larryfr
7
7
ms.reviewer: jinzhong
8
8
ms.service: azure-machine-learning
9
9
ms.subservice: core
10
-
ms.date: 03/10/2024
10
+
ms.date: 03/05/2025
11
11
ms.topic: how-to
12
12
ms.custom: build-spring-2022, cliv2, sdkv2
13
13
---
14
14
15
15
# Troubleshoot Azure Machine Learning extension
16
16
17
-
In this article, you learn how to troubleshoot common problems you may encounter with [Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md) deployment in your AKS or Arc-enabled Kubernetes.
17
+
In this article, you learn how to troubleshoot common problems you might encounter with [Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md) deployment in your Azure Kubernetes Service (AKS) or Arc-enabled Kubernetes.
18
18
19
19
## How is Azure Machine Learning extension installed
20
20
Azure Machine Learning extension is released as a helm chart and installed by Helm V3. All components of Azure Machine Learning extension are installed in `azureml` namespace. You can use the following commands to check the extension status.
@@ -62,13 +62,13 @@ Use the following steps to mitigate the issue.
62
62
* When the resource is also used by other components in your cluster and can't be modified. Refer to [deploy Azure Machine Learning extension](./how-to-deploy-kubernetes-extension.md#review-azure-machine-learning-extension-configuration-settings) to see if there's a configuration setting to disable the conflict resource.
63
63
64
64
## HealthCheck of extension
65
-
When the installation failed and didn't hit any of the above error messages, you can use the built-in health check job to make a comprehensive check on the extension. Azure machine learning extension contains a `HealthCheck` job to precheck your cluster readiness when you try to install, update or delete the extension. The HealthCheck job outputs a report, which is saved in a configmap named `arcml-healthcheck` in `azureml` namespace. The error codes and possible solutions for the report are listed in [Error Code of HealthCheck](#error-code-of-healthcheck).
65
+
When the installation failed and didn't hit any of the previous error messages, you can use the built-in health check job to make a comprehensive check on the extension. Azure Machine Learning extension contains a `HealthCheck` job to precheck your cluster readiness when you try to install, update, or delete the extension. The HealthCheck job outputs a report, which is saved in a configmap named `arcml-healthcheck` in `azureml` namespace. The error codes and possible solutions for the report are listed in [Error Code of HealthCheck](#error-code-of-healthcheck).
The health check is triggered whenever you install, update or delete the extension. The health check report is structured with several parts `pre-install`, `pre-rollback`, `pre-upgrade` and `pre-delete`.
71
+
The health check is triggered whenever you install, update, or delete the extension. The health check report is structured with several parts `pre-install`, `pre-rollback`, `pre-upgrade`, and `pre-delete`.
72
72
73
73
- If the extension is installed failed, you should look into `pre-install` and `pre-delete`.
74
74
- If the extension is updated failed, you should look into `pre-upgrade` and `pre-rollback`.
@@ -91,7 +91,7 @@ This table shows how to troubleshoot the error codes returned by the HealthCheck
91
91
|E45002 | PROMETHEUS_CONFLICT | The Prometheus Operator installed is conflict with your existing Prometheus Operator. For more information, see [Prometheus operator](#prometheus-operator) |
92
92
|E45003 | BAD_NETWORK_CONNECTIVITY | You need to meet [network-requirements](./how-to-access-azureml-behind-firewall.md#scenario-use-kubernetes-compute).|
93
93
|E45004 | AZUREML_FE_ROLE_CONFLICT |Azure Machine Learning extension isn't supported in the [legacy AKS](./how-to-attach-kubernetes-anywhere.md#comparison-of-kubernetescompute-and-legacy-akscompute-targets). To install Azure Machine Learning extension, you need to [delete the legacy azureml-fe components](v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources).|
94
-
|E45005 | AZUREML_FE_DEPLOYMENT_CONFLICT | Azure Machine Learning extension isn't supported in the [legacy AKS](./how-to-attach-kubernetes-anywhere.md#comparison-of-kubernetescompute-and-legacy-akscompute-targets). To install Azure Machine Learning extension, you need to run the command below this form to delete the legacy azureml-fe components, more detail you can referto [here](v1/how-to-create-attach-kubernetes.md#update-the-cluster).|
94
+
|E45005 | AZUREML_FE_DEPLOYMENT_CONFLICT | Azure Machine Learning extension isn't supported in the [legacy AKS](./how-to-attach-kubernetes-anywhere.md#comparison-of-kubernetescompute-and-legacy-akscompute-targets). To install Azure Machine Learning extension, you need to run the command below this form to delete the legacy azureml-fe components, more detail you can refer to [here](v1/how-to-create-attach-kubernetes.md#update-the-cluster).|
95
95
96
96
Commands to delete the legacy azureml-fe components in the AKS cluster:
97
97
```shell
@@ -152,7 +152,7 @@ In this case, the existing prometheus operator manages all Prometheus instances.
152
152
```
153
153
154
154
### DCGM exporter
155
-
[Dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) is the official tool recommended by NVIDIA forcollecting GPU metrics. We've integrated it into Azure Machine Learning extension. But, by default, dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify ```installDcgmExporter``` flag to ```true``` to enable it. As it's NVIDIA's official tool, you may already have it installedin your GPU cluster. If so, you can set```installDcgmExporter``` to ```false``` and follow the steps to integrate your dcgm-exporter into Azure Machine Learning extension. Another thing to note is that dcgm-exporter allows user to config which metrics to expose. For Azure Machine Learning extension, make sure ```DCGM_FI_DEV_GPU_UTIL```, ```DCGM_FI_DEV_FB_FREE``` and ```DCGM_FI_DEV_FB_USED``` metrics are exposed.
155
+
[Dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) is the official tool recommended by NVIDIA forcollecting GPU metrics. It's integrated into Azure Machine Learning extension. But, by default, dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify ```installDcgmExporter``` flag to ```true``` to enable it. As it's NVIDIA's official tool, you might already have it installedin your GPU cluster. If so, you can set```installDcgmExporter``` to ```false``` and follow the steps to integrate your dcgm-exporter into Azure Machine Learning extension. Another thing to note is that dcgm-exporter allows user to config which metrics to expose. For Azure Machine Learning extension, make sure ```DCGM_FI_DEV_GPU_UTIL```, ```DCGM_FI_DEV_FB_FREE``` and ```DCGM_FI_DEV_FB_USED``` metrics are exposed.
156
156
157
157
1. Make sure you have Aureml extension and dcgm-exporter installed successfully. Dcgm-exporter can be installed by [Dcgm-exporter helm chart](https://github.com/NVIDIA/dcgm-exporter) or [Gpu-operator helm chart](https://github.com/NVIDIA/gpu-operator)
158
158
@@ -237,10 +237,10 @@ volcano-scheduler.conf: |
237
237
- name: nodeorder
238
238
- name: binpack
239
239
```
240
-
You need to use this same config settings, and you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**, so that Azure Machine Learning training workloads can perform properly.
240
+
You need to use this same config setting, and you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**, so that Azure Machine Learning training workloads can perform properly.
As discussed in this [thread](https://github.com/volcano-sh/volcano/issues/2558) , the **gang plugin** is not working well with the cluster autoscaler(CA) and also the node autoscaler in AKS.
243
+
As discussed in this [thread](https://github.com/volcano-sh/volcano/issues/2558) , the **gang plugin** isn't working well with the cluster autoscaler(CA) and also the node autoscaler in AKS.
244
244
245
245
If you use the volcano that comes with the Azure Machine Learning extension via setting `installVolcano=true`, the extension has a scheduler config by default, which configures the **gang** plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS cluster won't be supported with the volcano installed by extension.
246
246
@@ -265,7 +265,7 @@ volcano-scheduler.conf: |
265
265
```
266
266
267
267
To use this config in your AKS cluster, you need to follow the following steps:
268
-
1. Create a configmap file with the above config in the `azureml` namespace. This namespace will generally be created when you install the Azure Machine Learning extension.
268
+
1. Create a configmap file with the previous config in the `azureml` namespace. This namespace will generally be created when you install the Azure Machine Learning extension.
269
269
1. Set `volcanoScheduler.schedulerConfigMap=<configmap name>` in the extension config to apply this configmap. And you need to skip the resource validation when installing the extension by configuring `amloperator.skipResourceValidation=true`. For example:
@@ -276,9 +276,9 @@ To use this config in your AKS cluster, you need to follow the following steps:
276
276
>
277
277
>* To avoid this situation, you can **use same instance type across the jobs**.
278
278
>
279
-
> Using a scheduler configuration other than the default provided by the Azure Machine Learning extension may not be fully supported. Proceed with caution.
279
+
> Using a scheduler configuration other than the default provided by the Azure Machine Learning extension might not be fully supported. Proceed with caution.
280
280
>
281
-
> Note that you need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**.
281
+
>You need to disable `job/validate` webhook in the volcano admission if your **volcano version is lower than 1.6**.
282
282
283
283
### Ingress Nginx controller
284
284
@@ -303,11 +303,11 @@ az ml extension update --config nginxIngress.controller="k8s.io/amlarc-ingress-n
303
303
304
304
**Symptom**
305
305
306
-
The nginx ingress controller installed with the Azure Machine Learning extension crashes due to out-of-memory (OOM) errors even when there is no workload. The controller logs do not show any useful information to diagnose the problem.
306
+
The nginx ingress controller installed with the Azure Machine Learning extension crashes due to out-of-memory (OOM) errors even when there's no workload. The controller logs don't show any useful information to diagnose the problem.
307
307
308
308
**Possible Cause**
309
309
310
-
This issue may occur if the nginx ingress controller runs on a node with many CPUs. By default, the nginx ingress controller spawns worker processes according to the number of CPUs, which may consume more resources and cause OOM errors on nodes with more CPUs. This is a known [issue](https://github.com/kubernetes/ingress-nginx/issues/8166) reported on GitHub
310
+
This issue might occur if the nginx ingress controller runs on a node with many CPUs. By default, the nginx ingress controller spawns worker processes according to the number of CPUs, which might consume more resources and cause OOM errors on nodes with more CPUs. This is a known [issue](https://github.com/kubernetes/ingress-nginx/issues/8166) reported on GitHub
0 commit comments