You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Troubleshoot AKS AI Toolchain Operator Add-on Errors
3
3
description: Learn how to resolve errors that occur when you try to enable the Azure Kubernetes Service (AKS) AI toolchain operator add-on.
4
-
ms.date: 05/09/2025
4
+
ms.date: 05/16/2025
5
5
ms.reviewer: sachidesai, v-weizhu
6
6
ms.service: azure-kubernetes-service
7
7
ms.custom: sap:Extensions, Policies and Add-Ons, references_regions
8
8
---
9
9
# Troubleshoot AKS AI toolchain operator add-on errors
10
10
11
-
This article provides guidance on resolving errors that might occur when you enable the Microsoft Azure Kubernetes Service (AKS) AI Toolchain Operator (KAITO) add-on during a cluster creation or update.
11
+
This article provides guidance on resolving errors that might occur when you enable the Microsoft Azure Kubernetes Service (AKS) AI Toolchain Operator (KAITO) add-on during cluster creation or update.
12
12
13
13
## Prerequisites
14
14
15
-
Ensure the following tools are installed and configured. They'll be used in the following sections.
15
+
Ensure the following tools are installed and configured. They're used in the following sections.
16
16
17
17
-[Azure CLI](/cli/azure/install-azure-cli)
18
18
-[kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), the Kubernetes command-line client
@@ -23,17 +23,17 @@ The KAITO add-on consists of two controllers: the `gpu-provisioner` controller a
23
23
24
24
| Error message | Cause |
25
25
| --- | --- |
26
-
| Workspace was not created |[Cause 1: KAITO custom resource not configured properly](#cause-1-misconfiguration-in-kaito-custom-resource)|
26
+
| Workspace was not created |[Cause 1: Incorrect KAITO custom resource configuration](#cause-1-misconfiguration-in-kaito-custom-resources)|
27
27
| GPU node was not created |[Cause 2: GPU quota limitations](#cause-2-gpu-quota-limitations)|
28
28
| Resource ready condition is not `True`|[Cause 3: Long pull time for model inference images](#cause-3-long-pull-time-for-model-inference-images)|
29
29
30
-
## Cause 1: Misconfiguration in KAITO custom resource
30
+
## Cause 1: Misconfiguration in KAITO custom resources
31
31
32
32
After you enable the add-on and deploy a preset or custom workspace custom resource (CR), the `workspace` controller includes a validation webhook. This webhook blocks common mistakes of setting wrong values in the CR specification.
33
33
34
34
To resolve this issue, follow these steps:
35
35
36
-
1. Check yourgpu-provisioner and workspace pod logs.
36
+
1. Check your `gpu-provisioner` and `workspace` pod logs.
37
37
2. Ensure that any updates to the GPU virtual machine (VM) size meet the requirements of your model size.
38
38
3. Once the workspace CR is successfully created, track the deployment progress by running the following commands:
39
39
@@ -47,19 +47,19 @@ To resolve this issue, follow these steps:
47
47
48
48
## Cause 2: GPU quota limitations
49
49
50
-
The gpu-provisioner controller might fail to create GPU nodes due to quota limitations in your subscription or region. In this case, you can check the machine CR status (internal CR created by the workspace controller) for error messages. The machine CR created by the workspace controller has a kaito.sh/workspace label key with the workspace's name as the value.
50
+
The `gpu-provisioner` controller might fail to create GPU nodes due to quota limitations in your subscription or region. In this case, you can check the machine CR status (internal CR created by the `workspace` controller) for error messages. The machine CR created by the `workspace` controller has a `kaito.sh/workspace` label key whose value is the workspace's name.
51
51
52
52
To resolve this issue, use one of the following methods:
53
53
54
54
- Request an [increase in the subscription quota](/azure/quotas/quickstart-increase-quota-portal?) for the required GPU VM family of your deployment.
55
-
- Check [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster.
55
+
- Check the [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster.
56
56
57
57
If the required GPU VM size is unavailable in your current region, consider switching to a different region or selecting an alternative GPU VM size.
58
58
59
59
## Cause 3: Long pull time for model inference images
60
60
61
61
If the image access mode is set to private, the model inference image might not be pulled. This issue can occur for images with specified URLs and pull secrets.
62
62
63
-
The inference images are typically large (30 GB -100 GB), so a longer image pull time is expected. Depending on your AKS cluster's networking setup, the pull process might take up to tens of minutes.
63
+
Inference images are typically large (30 GB -100 GB), so a longer image pull time is expected. Depending on your AKS cluster's networking setup, the pull process might take up to tens of minutes.
64
64
65
65
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
0 commit comments