Skip to content

Commit 029841d

Browse files
authored
Updates from editor
1 parent f957e8d commit 029841d

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed
Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
---
22
title: Troubleshoot AKS AI Toolchain Operator Add-on Errors
33
description: Learn how to resolve errors that occur when you try to enable the Azure Kubernetes Service (AKS) AI toolchain operator add-on.
4-
ms.date: 05/09/2025
4+
ms.date: 05/16/2025
55
ms.reviewer: sachidesai, v-weizhu
66
ms.service: azure-kubernetes-service
77
ms.custom: sap:Extensions, Policies and Add-Ons, references_regions
88
---
99
# Troubleshoot AKS AI toolchain operator add-on errors
1010

11-
This article provides guidance on resolving errors that might occur when you enable the Microsoft Azure Kubernetes Service (AKS) AI Toolchain Operator (KAITO) add-on during a cluster creation or update.
11+
This article provides guidance on resolving errors that might occur when you enable the Microsoft Azure Kubernetes Service (AKS) AI Toolchain Operator (KAITO) add-on during cluster creation or update.
1212

1313
## Prerequisites
1414

15-
Ensure the following tools are installed and configured. They'll be used in the following sections.
15+
Ensure the following tools are installed and configured. They're used in the following sections.
1616

1717
- [Azure CLI](/cli/azure/install-azure-cli)
1818
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), the Kubernetes command-line client
@@ -23,17 +23,17 @@ The KAITO add-on consists of two controllers: the `gpu-provisioner` controller a
2323

2424
| Error message | Cause |
2525
| --- | --- |
26-
| Workspace was not created | [Cause 1: KAITO custom resource not configured properly](#cause-1-misconfiguration-in-kaito-custom-resource) |
26+
| Workspace was not created | [Cause 1: Incorrect KAITO custom resource configuration](#cause-1-misconfiguration-in-kaito-custom-resources) |
2727
| GPU node was not created | [Cause 2: GPU quota limitations](#cause-2-gpu-quota-limitations) |
2828
| Resource ready condition is not `True` | [Cause 3: Long pull time for model inference images](#cause-3-long-pull-time-for-model-inference-images)|
2929

30-
## Cause 1: Misconfiguration in KAITO custom resource
30+
## Cause 1: Misconfiguration in KAITO custom resources
3131

3232
After you enable the add-on and deploy a preset or custom workspace custom resource (CR), the `workspace` controller includes a validation webhook. This webhook blocks common mistakes of setting wrong values in the CR specification.
3333

3434
To resolve this issue, follow these steps:
3535

36-
1. Check yourgpu-provisioner and workspace pod logs.
36+
1. Check your `gpu-provisioner` and `workspace` pod logs.
3737
2. Ensure that any updates to the GPU virtual machine (VM) size meet the requirements of your model size.
3838
3. Once the workspace CR is successfully created, track the deployment progress by running the following commands:
3939

@@ -47,19 +47,19 @@ To resolve this issue, follow these steps:
4747
4848
## Cause 2: GPU quota limitations
4949
50-
The gpu-provisioner controller might fail to create GPU nodes due to quota limitations in your subscription or region. In this case, you can check the machine CR status (internal CR created by the workspace controller) for error messages. The machine CR created by the workspace controller has a kaito.sh/workspace label key with the workspace's name as the value.
50+
The `gpu-provisioner` controller might fail to create GPU nodes due to quota limitations in your subscription or region. In this case, you can check the machine CR status (internal CR created by the `workspace` controller) for error messages. The machine CR created by the `workspace` controller has a `kaito.sh/workspace` label key whose value is the workspace's name.
5151
5252
To resolve this issue, use one of the following methods:
5353
5454
- Request an [increase in the subscription quota](/azure/quotas/quickstart-increase-quota-portal?) for the required GPU VM family of your deployment.
55-
- Check [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster.
55+
- Check the [GPU instance availability](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/table?msockid=182ea2d5e1ff6eb61ccbb1b8e5ff608a) in the specific region of your AKS cluster.
5656
5757
If the required GPU VM size is unavailable in your current region, consider switching to a different region or selecting an alternative GPU VM size.
5858
5959
## Cause 3: Long pull time for model inference images
6060
6161
If the image access mode is set to private, the model inference image might not be pulled. This issue can occur for images with specified URLs and pull secrets.
6262
63-
The inference images are typically large (30 GB -100 GB), so a longer image pull time is expected. Depending on your AKS cluster's networking setup, the pull process might take up to tens of minutes.
63+
Inference images are typically large (30 GB -100 GB), so a longer image pull time is expected. Depending on your AKS cluster's networking setup, the pull process might take up to tens of minutes.
6464
6565
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

0 commit comments

Comments
 (0)