Skip to content

Commit d08119a

Browse files
authored
{AKS}: containerized aks-agent (#9451)
* cleanup evals and UTs * add vendored_sdks * containerized aks agent * add UT * update doc and release history * fix style check errors * update chart repo and version * remove wrong file and update default cluster role rule * address comments * dont expose api key in configmap * add init file for vendored_sdks/azure_mgmt_containerservice * fix(windows): fix import fcntl failure on windows * fix(windows): use tempfile library * fix(windows): pick correct helm.exe in zip for windows * remove unused import * bump helm version
1 parent 49f5024 commit d08119a

File tree

464 files changed

+36071
-102219
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

464 files changed

+36071
-102219
lines changed

src/aks-agent/HISTORY.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,18 @@ To release a new version, please select a new version number (usually plus 1 to
1212
Pending
1313
+++++++
1414

15+
1.0.0b12
16+
++++++++
17+
* [BREAKING CHANGE]:
18+
* aks-agent is now containerized and deployed per Kubernetes cluster along with a managed aks-mcp instance
19+
* aks-agent is deployed on the AKS cluster as Helm charts during `az aks agent-init`
20+
* aks agent commands now require --resource-group and --name parameters to specify the target AKS cluster
21+
* Add `az aks agent-cleanup` to cleanup the AKS agent from the cluster
22+
* [SECURITY]:
23+
* Kubernetes RBAC: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
24+
* Azure Workload Identity: Supports Azure workload identity for secure, keyless access to Azure resources
25+
* LLM credentials are stored securely in Kubernetes secrets with encryption at rest
26+
1527
1.0.0b11
1628
++++++++
1729
* Fix(agent-init): replace max_tokens with max_completion_tokens for connection check of Azure OpenAI service.

src/aks-agent/README.rst

Lines changed: 37 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,34 @@ Introduction
77

88
The AKS Agent extension provides the "az aks agent" command, an AI-powered assistant that helps analyze and troubleshoot Azure Kubernetes Service (AKS) clusters using Large Language Models (LLMs). The agent combines cluster context, configurable toolsets, and LLMs to answer natural-language questions about your cluster (for example, "Why are my pods not starting?") and can investigate issues in both interactive and non-interactive (batch) modes.
99

10-
New in this version: **az aks agent-init** command for easy LLM model configuration!
10+
New in this version: **az aks agent-init** command for containerized agent deployment!
1111

12-
You can now use `az aks agent-init` to interactively add and configure LLM models before asking questions. This command guides you through the setup process, allowing you to add multiple models as needed. When asking questions with `az aks agent`, you can:
12+
The `az aks agent-init` command deploys the AKS agent as a Helm chart directly in your AKS cluster with enterprise-grade security:
1313

14-
- Use `--config-file` to specify your own model configuration file
15-
- Use `--model` to select a previously configured model
16-
- If neither is provided, the last configured LLM will be used by default
14+
- **Kubernetes RBAC**: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
15+
- **Workload Identity**: Leverages Azure workload identity for secure, keyless access to Azure resources
16+
- **Interactive LLM Configuration**: Guides you through setting up LLM models with encrypted storage in Kubernetes secrets
1717

18-
This makes it much easier to manage and switch between multiple models for your AKS troubleshooting workflows.
18+
When asking questions with `az aks agent`:
19+
20+
- The agent automatically uses the last configured model
21+
- Use `--model` to select a specific model when you have multiple models configured
22+
23+
This architecture provides better security, scalability, and manageability for production AKS troubleshooting workflows.
1924

2025
Key capabilities
2126
----------------
2227

2328

29+
- **Containerized Deployment**: Agent runs as a Helm chart in your AKS cluster with `az aks agent-init`.
30+
- **Secure Access**: Uses Kubernetes RBAC for cluster resources and Azure workload identity for Azure resources.
31+
- **LLM Configuration**: Interactively configure LLM models with credentials stored securely in Kubernetes secrets.
32+
- Support for multiple LLM providers (Azure OpenAI, OpenAI, Anthropic, Gemini, etc.).
33+
- Automatically uses the last configured model by default.
34+
- Optionally use --model to select a specific model when you have multiple models configured.
2435
- Interactive and non-interactive modes (use --no-interactive for batch runs).
25-
- Support for multiple LLM providers (Azure OpenAI, OpenAI, etc.) via interactive configuration.
26-
- **Easy model setup with `az aks agent-init`**: interactively add and configure LLM models, run multiple times to add more models.
27-
- Configurable via a JSON/YAML config file provided with --config-file, or select a model with --model.
28-
- If no config or model is specified, the last configured LLM is used automatically.
2936
- Control echo and tool output visibility with --no-echo-request and --show-tool-output.
3037
- Refresh the available toolsets with --refresh-toolsets.
31-
- Stay in traditional toolset mode by default, or opt in to aks-mcp integration with ``--aks-mcp`` when you need the enhanced capabilities.
3238

3339
Prerequisites
3440
-------------
@@ -37,98 +43,6 @@ For more details about supported model providers and required
3743
variables, see: https://docs.litellm.ai/docs/providers
3844

3945

40-
LLM Configuration Explained
41-
---------------------------
42-
43-
The AKS Agent uses YAML configuration files to define LLM connections. Each configuration contains a provider specification and the required environment variables for that provider.
44-
45-
Configuration Structure
46-
^^^^^^^^^^^^^^^^^^^^^^^^
47-
48-
.. code-block:: yaml
49-
50-
llms:
51-
- provider: azure
52-
MODEL_NAME: gpt-4.1
53-
AZURE_API_KEY: *******
54-
AZURE_API_BASE: https://{azure-openai-service}.openai.azure.com/
55-
AZURE_API_VERSION: 2025-04-01-preview
56-
57-
Field Explanations
58-
^^^^^^^^^^^^^^^^^^
59-
60-
**provider**
61-
The LiteLLM provider route that determines which LLM service to use. This follows the LiteLLM provider specification from https://docs.litellm.ai/docs/providers.
62-
63-
Common values:
64-
65-
* ``azure`` - Azure OpenAI Service
66-
* ``openai`` - OpenAI API and OpenAI-compatible APIs (e.g., local models, other services)
67-
* ``anthropic`` - Anthropic Claude
68-
* ``gemini`` - Google's Gemini
69-
* ``openai_compatible`` - OpenAI-compatible APIs (e.g., local models, other services)
70-
71-
**MODEL_NAME**
72-
The specific model or deployment name to use. This varies by provider:
73-
74-
* For Azure OpenAI: Your deployment name (e.g., ``gpt-4.1``, ``gpt-35-turbo``)
75-
* For OpenAI: Model name (e.g., ``gpt-4``, ``gpt-3.5-turbo``)
76-
* For other providers: Check the specific model names in LiteLLM documentation
77-
78-
**Environment Variables by Provider**
79-
80-
The remaining fields are environment variables required by each provider. These correspond to the authentication and configuration requirements of each LLM service:
81-
82-
**Azure OpenAI (provider: azure)**
83-
* ``AZURE_API_KEY`` - Your Azure OpenAI API key
84-
* ``AZURE_API_BASE`` - Your Azure OpenAI endpoint URL (e.g., https://your-resource.openai.azure.com/)
85-
* ``AZURE_API_VERSION`` - API version (e.g., 2024-02-01, 2025-04-01-preview)
86-
87-
**OpenAI (provider: openai)**
88-
* ``OPENAI_API_KEY`` - Your OpenAI API key (starts with sk-)
89-
90-
**Gemini (provider: gemini)**
91-
* ``GOOGLE_API_KEY`` - Your Google Cloud API key
92-
* ``GOOGLE_API_ENDPOINT`` - Base URL for the Gemini API endpoint
93-
94-
**Anthropic (provider: anthropic)**
95-
* ``ANTHROPIC_API_KEY`` - Your Anthropic API key
96-
97-
**OpenAI Compatible (provider: openai_compatible)**
98-
* ``OPENAI_API_BASE`` - Base URL for the API endpoint
99-
* ``OPENAI_API_KEY`` - API key (if required by the service)
100-
101-
Multiple Model Configuration
102-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
103-
104-
You can configure multiple models in a single file:
105-
106-
.. code-block:: yaml
107-
108-
llms:
109-
- provider: azure
110-
MODEL_NAME: gpt-4
111-
AZURE_API_KEY: your-azure-key
112-
AZURE_API_BASE: https://your-azure-endpoint.openai.azure.com/
113-
AZURE_API_VERSION: 2024-02-01
114-
- provider: openai
115-
MODEL_NAME: gpt-4
116-
OPENAI_API_KEY: your-openai-key
117-
- provider: anthropic
118-
MODEL_NAME: claude-3-sonnet-20240229
119-
ANTHROPIC_API_KEY: your-anthropic-key
120-
121-
When using ``--model``, specify the provider and model as ``provider/model_name`` (e.g., ``azure/gpt-4``, ``openai/gpt-4``).
122-
123-
Security Note
124-
^^^^^^^^^^^^^
125-
126-
API keys and credentials in configuration files should be kept secure. Consider using:
127-
128-
* Restricted file permissions (``chmod 600 config.yaml``)
129-
* Environment variable substitution where supported
130-
* Separate configuration files for different environments (dev/prod)
131-
13246
Quick start and examples
13347
=========================
13448

@@ -139,14 +53,21 @@ Install the extension
13953
14054
az extension add --name aks-agent
14155
142-
Configure LLM models interactively
143-
----------------------------------
56+
Initialize and configure the AKS agent
57+
---------------------------------------
14458

14559
.. code-block:: bash
14660
147-
az aks agent-init
61+
az aks agent-init --resource-group MyResourceGroup --name MyManagedCluster
62+
63+
This command will configure the LLM configuration and:
14864

149-
This command will guide you through adding a new LLM model. You can run it multiple times to add more models or update existing models. All configured models are saved locally and can be selected when asking questions.
65+
1. Guide you through LLM model configuration with credentials stored securely in Kubernetes secrets
66+
2. Deploy the AKS agent Helm chart in your cluster
67+
3. Configure Kubernetes RBAC for secure cluster resource access
68+
4. Optionally configure Azure workload identity for Azure resource access
69+
70+
You can run it multiple times to update configurations or add more models.
15071

15172
Run the agent (Azure OpenAI example) :
15273
-----------------------------------
@@ -163,12 +84,6 @@ Run the agent (Azure OpenAI example) :
16384
16485
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
16586
166-
**3. Use a custom config file:**
167-
168-
.. code-block:: bash
169-
170-
az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml
171-
17287
17388
Run the agent (OpenAI example)
17489
------------------------------
@@ -185,34 +100,27 @@ Run the agent (OpenAI example)
185100
186101
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model gpt-4o
187102
188-
**3. Use a custom config file:**
189-
190-
.. code-block:: bash
191-
192-
az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml
193-
194103
Run in non-interactive batch mode
195104
---------------------------------
196105

197106
.. code-block:: bash
198107
199-
az aks agent "Diagnose networking issues" --no-interactive --max-steps 15 --model azure/my-gpt4.1-deployment
108+
az aks agent "Diagnose networking issues" --no-interactive --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
200109
201-
Opt in to MCP mode
202-
------------------
110+
Clean up the AKS agent
111+
-----------------------
203112

204-
Traditional toolsets remain the default. Enable the aks-mcp integration when you want the enhanced toolsets by passing ``--aks-mcp``. You can return to traditional mode on a subsequent run with ``--no-aks-mcp``.
113+
To uninstall the AKS agent and clean up all Kubernetes resources:
205114

206115
.. code-block:: bash
207116
208-
az aks agent --aks-mcp "Check node health with MCP" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
117+
az aks agent-cleanup --resource-group MyResourceGroup --name MyManagedCluster
209118
210-
Using a configuration file
211-
--------------------------
119+
This command will:
212120

213-
Pass a config file with --config-file to predefine model, credentials, and toolsets. See
214-
the example config and more detailed examples in the help definition at
215-
`src/aks-agent/azext_aks_agent/_help.py`.
121+
1. Uninstall the AKS agent Helm chart from your cluster
122+
2. Remove all associated Kubernetes resources (deployments, pods, secrets, RBAC configurations)
123+
3. Clean up the LLM configuration secrets
216124

217125
More help
218126
---------

src/aks-agent/azext_aks_agent/__init__.py

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,27 +3,26 @@
33
# Licensed under the MIT License. See License.txt in the project root for license information.
44
# --------------------------------------------------------------------------------------------
55

6-
7-
import os
6+
from azext_aks_agent._client_factory import CUSTOM_MGMT_AKS
87

98
# pylint: disable=unused-import
10-
import azext_aks_agent._help
11-
from azext_aks_agent._consts import (
12-
CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY,
13-
CONST_AGENT_NAME,
14-
CONST_AGENT_NAME_ENV_KEY,
15-
CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY,
16-
CONST_PRIVACY_NOTICE_BANNER,
17-
CONST_PRIVACY_NOTICE_BANNER_ENV_KEY,
18-
)
199
from azure.cli.core import AzCommandsLoader
20-
from azure.cli.core.api import get_config_dir
10+
from azure.cli.core.profiles import register_resource_type
11+
12+
13+
def register_aks_agent_resource_type():
14+
register_resource_type(
15+
"latest",
16+
CUSTOM_MGMT_AKS,
17+
None,
18+
)
2119

2220

2321
class ContainerServiceCommandsLoader(AzCommandsLoader):
2422

2523
def __init__(self, cli_ctx=None):
2624
from azure.cli.core.commands import CliCommandType
25+
register_aks_agent_resource_type()
2726

2827
aks_agent_custom = CliCommandType(operations_tmpl='azext_aks_agent.custom#{}')
2928
super().__init__(
@@ -44,14 +43,3 @@ def load_arguments(self, command):
4443

4544

4645
COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
47-
48-
49-
# NOTE(mainred): holmesgpt leverages the environment variables to customize its behavior.
50-
def customize_holmesgpt():
51-
os.environ[CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY] = "true"
52-
os.environ[CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY] = get_config_dir()
53-
os.environ[CONST_AGENT_NAME_ENV_KEY] = CONST_AGENT_NAME
54-
os.environ[CONST_PRIVACY_NOTICE_BANNER_ENV_KEY] = CONST_PRIVACY_NOTICE_BANNER
55-
56-
57-
customize_holmesgpt()
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
from azure.cli.core.commands.client_factory import get_mgmt_service_client
7+
from azure.cli.core.profiles import CustomResourceType
8+
9+
CUSTOM_MGMT_AKS = CustomResourceType('azext_aks_agent.vendored_sdks.azure_mgmt_containerservice.2025_10_01',
10+
'ContainerServiceClient')
11+
12+
# Note: cf_xxx, as the client_factory option value of a command group at command declaration, it should ignore
13+
# parameters other than cli_ctx; get_xxx_client is used as the client of other services in the command implementation,
14+
# and usually accepts subscription_id as a parameter to reconfigure the subscription when sending the request
15+
16+
17+
# container service clients
18+
def get_container_service_client(cli_ctx, subscription_id=None):
19+
return get_mgmt_service_client(cli_ctx, CUSTOM_MGMT_AKS, subscription_id=subscription_id)
20+
21+
22+
def cf_managed_clusters(cli_ctx, *_):
23+
return get_container_service_client(cli_ctx).managed_clusters

src/aks-agent/azext_aks_agent/_consts.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,20 @@
3030
CONST_MCP_GITHUB_REPO = "Azure/aks-mcp"
3131
CONST_MCP_BINARY_DIR = "bin"
3232

33-
# Color constants for terminal output
34-
HELP_COLOR = "cyan" # same as AI_COLOR for now
35-
ERROR_COLOR = "red"
33+
# Kubernetes WebSocket exec protocol constants
34+
RESIZE_CHANNEL = 4 # WebSocket channel for terminal resize messages
35+
# WebSocket heartbeat configuration (matching kubectl client-go)
36+
# Based on kubernetes/client-go/tools/remotecommand/websocket.go#L59-L65
37+
# pingPeriod = 5 * time.Second
38+
# pingReadDeadline = (pingPeriod * 12) + (1 * time.Second)
39+
# The read deadline is calculated to allow up to 12 missed pings plus 1 second buffer
40+
# This provides tolerance for network delays while detecting actual connection failures
41+
HEARTBEAT_INTERVAL = 5.0 # pingPeriod: 5 seconds between pings
42+
HEARTBEAT_TIMEOUT = (HEARTBEAT_INTERVAL * 12) + 1 # pingReadDeadline: 61 seconds total timeout
43+
44+
AGENT_NAMESPACE = "kube-system"
45+
AGENT_LABEL_SELECTOR = "app.kubernetes.io/name=aks-agent"
46+
AKS_MCP_LABEL_SELECTOR = "app.kubernetes.io/name=aks-mcp"
47+
48+
# Helm Configuration
49+
HELM_VERSION = "3.16.0"

0 commit comments

Comments
 (0)