Skip to content

Commit ae8fe30

Browse files
committed
add aks-agent extension
1 parent 8cb6833 commit ae8fe30

22 files changed

+1790
-0
lines changed

src/aks-agent/HISTORY.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
.. :changelog:
2+
3+
Release History
4+
===============
5+
6+
Guidance
7+
++++++++
8+
If there is no rush to release a new version, please just add a description of the modification under the *Pending* section.
9+
10+
To release a new version, please select a new version number (usually plus 1 to last patch version, X.Y.Z -> Major.Minor.Patch, more details in `\doc <https://semver.org/>`_), and then add a new section named as the new version number in this file, the content should include the new modifications and everything from the *Pending* section. Finally, update the `VERSION` variable in `setup.py` with this new version number.
11+
12+
Pending
13+
+++++++
14+
15+
0.1.0
16+
+++++++
17+
* Add interactive AI-powered debugging tool `az aks agent`.

src/aks-agent/README.rst

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
Azure CLI AKS Agent Extension
2+
===============================
3+
4+
Introduction
5+
============
6+
7+
The AKS Agent extension provides the "az aks agent" command, an AI-powered assistant that
8+
helps analyze and troubleshoot Azure Kubernetes Service (AKS) clusters using Large Language
9+
Models (LLMs). The agent combines cluster context, configurable toolsets, and LLMs to answer
10+
natural-language questions about your cluster (for example, "Why are my pods not starting?")
11+
and can investigate issues in both interactive and non-interactive (batch) modes.
12+
13+
Key capabilities
14+
----------------
15+
16+
- Interactive and non-interactive modes (use --no-interactive for batch runs).
17+
- Support for multiple LLM providers (Azure OpenAI, OpenAI, etc.) via environment variables.
18+
- Configurable via a JSON/YAML config file provided with --config-file.
19+
- Control echo and tool output visibility with --no-echo-request and --show-tool-output.
20+
- Refresh the available toolsets with --refresh-toolsets.
21+
22+
Prerequisites
23+
-------------
24+
25+
Before using the agent, make sure provider-specific environment variables are set. For
26+
example, Azure OpenAI typically requires AZURE_API_BASE, AZURE_API_VERSION, and AZURE_API_KEY,
27+
while OpenAI requires OPENAI_API_KEY. For more details about supported providers and required
28+
variables, see: https://docs.litellm.ai/docs/providers
29+
30+
Quick start and examples
31+
========================
32+
33+
Install the extension
34+
---------------------
35+
36+
.. code-block:: bash
37+
38+
az extension add --name aks-agent
39+
40+
Run the agent (Azure OpenAI example)
41+
-----------------------------------
42+
43+
.. code-block:: bash
44+
45+
export AZURE_API_BASE="https://my-azureopenai-service.openai.azure.com/"
46+
export AZURE_API_VERSION="2025-01-01-preview"
47+
export AZURE_API_KEY="sk-xxx"
48+
49+
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
50+
51+
Run the agent (OpenAI example)
52+
------------------------------
53+
54+
.. code-block:: bash
55+
56+
export OPENAI_API_KEY="sk-xxx"
57+
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model gpt-4o
58+
59+
Run in non-interactive batch mode
60+
---------------------------------
61+
62+
.. code-block:: bash
63+
64+
az aks agent "Diagnose networking issues" --no-interactive --max-steps 15 --model azure/my-gpt4.1-deployment
65+
66+
Using a configuration file
67+
--------------------------
68+
69+
Pass a config file with --config-file to predefine model, credentials, and toolsets. See
70+
the example config and more detailed examples in the help definition at
71+
`src/aks-agent/azext_aks_agent/_help.py`.
72+
73+
More help
74+
---------
75+
76+
For a complete list of parameters, detailed examples and help text, run:
77+
78+
.. code-block:: bash
79+
80+
az aks agent -h
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
7+
from azure.cli.core import AzCommandsLoader
8+
9+
# pylint: disable=unused-import
10+
import azext_aks_agent._help
11+
12+
13+
class ContainerServiceCommandsLoader(AzCommandsLoader):
14+
15+
def __init__(self, cli_ctx=None):
16+
from azure.cli.core.commands import CliCommandType
17+
18+
aks_agent_custom = CliCommandType(operations_tmpl='azext_aks_agent.custom#{}')
19+
super().__init__(
20+
cli_ctx=cli_ctx,
21+
custom_command_type=aks_agent_custom,
22+
)
23+
24+
def load_command_table(self, args):
25+
super().load_command_table(args)
26+
from azext_aks_agent.commands import load_command_table
27+
load_command_table(self, args)
28+
return self.command_table
29+
30+
def load_arguments(self, command):
31+
super().load_arguments(command)
32+
from azext_aks_agent._params import load_arguments
33+
load_arguments(self, command)
34+
35+
36+
COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
# aks agent constants
7+
CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY = "HOLMES_CONFIGPATH_DIR"
8+
CONST_AGENT_NAME = "AKS AGENT"
9+
CONST_AGENT_NAME_ENV_KEY = "AGENT_NAME"
10+
CONST_AGENT_CONFIG_FILE_NAME = "aksAgent.yaml"
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# coding=utf-8
2+
# --------------------------------------------------------------------------------------------
3+
# Copyright (c) Microsoft Corporation. All rights reserved.
4+
# Licensed under the MIT License. See License.txt in the project root for license information.
5+
# --------------------------------------------------------------------------------------------
6+
7+
# pylint: disable=too-many-lines
8+
9+
from knack.help_files import helps
10+
11+
12+
helps[
13+
"aks agent"
14+
] = """
15+
type: command
16+
short-summary: Run AI assistant to analyze and troubleshoot Kubernetes clusters.
17+
long-summary: |-
18+
This command allows you to ask questions about your Azure Kubernetes cluster and get answers using AI models.
19+
Environment variables must be set to use the AI model, please refer to https://docs.litellm.ai/docs/providers to learn more about supported AI providers and models and required environment variables.
20+
parameters:
21+
- name: --name -n
22+
type: string
23+
short-summary: Name of the managed cluster.
24+
- name: --resource-group -g
25+
type: string
26+
short-summary: Name of the resource group.
27+
- name: --model
28+
type: string
29+
short-summary: Model to use for the LLM.
30+
- name: --api-key
31+
type: string
32+
short-summary: API key to use for the LLM (if not given, uses environment variables AZURE_API_KEY, OPENAI_API_KEY).
33+
- name: --config-file
34+
type: string
35+
short-summary: Path to configuration file.
36+
- name: --max-steps
37+
type: int
38+
short-summary: Maximum number of steps the LLM can take to investigate the issue.
39+
- name: --no-interactive
40+
type: bool
41+
short-summary: Disable interactive mode. When set, the agent will not prompt for input and will run in batch mode.
42+
- name: --no-echo-request
43+
type: bool
44+
short-summary: Disable echoing back the question provided to AKS Agent in the output.
45+
- name: --show-tool-output
46+
type: bool
47+
short-summary: Show the output of each tool that was called during the analysis.
48+
- name: --refresh-toolsets
49+
type: bool
50+
short-summary: Refresh the toolsets status.
51+
52+
examples:
53+
- name: Ask about pod issues in the cluster with Azure OpenAI
54+
text: |-
55+
export AZURE_API_BASE="https://my-azureopenai-service.openai.azure.com/"
56+
export AZURE_API_VERSION="2025-01-01-preview"
57+
export AZURE_API_KEY="sk-xxx"
58+
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
59+
- name: Ask about pod issues in the cluster with OpenAI
60+
text: |-
61+
export OPENAI_API_KEY="sk-xxx"
62+
az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model gpt-4o
63+
- name: Run in interactive mode without a question
64+
text: az aks agent "Check the pod status in my cluster" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment --api-key "sk-xxx"
65+
- name: Run in non-interactive batch mode
66+
text: az aks agent "Diagnose networking issues" --no-interactive --max-steps 15 --model azure/my-gpt4.1-deployment
67+
- name: Show detailed tool output during analysis
68+
text: az aks agent "Why is my service workload unavailable in namespace workload-ns?" --show-tool-output --model azure/my-gpt4.1-deployment
69+
- name: Use custom configuration file
70+
text: az aks agent "Check kubernetes pod resource usage" --config-file /path/to/custom.yaml --model azure/my-gpt4.1-deployment
71+
- name: Run agent with no echo of the original question
72+
text: az aks agent "What is the status of my cluster?" --no-echo-request --model azure/my-gpt4.1-deployment
73+
- name: Refresh toolsets to get the latest available tools
74+
text: az aks agent "What is the status of my cluster?" --refresh-toolsets --model azure/my-gpt4.1-deployment
75+
- name: Run agent with config file
76+
text: |
77+
az aks agent "Check kubernetes pod resource usage" --config-file /path/to/custom.yaml
78+
Here is an example of config file:
79+
```json
80+
model: "gpt-4o"
81+
api_key: "..."
82+
# define a list of mcp servers, mcp server can be defined
83+
mcp_servers:
84+
aks_mcp:
85+
description: "The AKS-MCP is a Model Context Protocol (MCP) server that enables AI assistants to interact with Azure Kubernetes Service (AKS) clusters"
86+
url: "http://localhost:8003/sse"
87+
88+
# try adding your own tools or toggle the built-in toolsets here
89+
# e.g. query company-specific data, fetch logs from your existing observability tools, etc
90+
# To check how to add a customized toolset, please refer to https://docs.robusta.dev/master/configuration/holmesgpt/custom_toolsets.html#custom-toolsets
91+
# To find all built-in toolsets, please refer to https://docs.robusta.dev/master/configuration/holmesgpt/builtin_toolsets.html
92+
toolsets:
93+
# add a new json processor toolset
94+
json_processor:
95+
description: "A toolset for processing JSON data using jq"
96+
prerequisites:
97+
- command: "jq --version" # Ensure jq is installed
98+
tools:
99+
- name: "process_json"
100+
description: "A tool that uses jq to process JSON input"
101+
command: "echo '{{ json_input }}' | jq '.'" # Example jq command to format JSON
102+
# disable a built-in toolsets
103+
aks/core:
104+
enabled: false
105+
```
106+
"""
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
# pylint: disable=too-many-statements,too-many-lines
7+
import os.path
8+
9+
from azure.cli.core.api import get_config_dir
10+
11+
from azext_aks_agent._consts import CONST_AGENT_CONFIG_FILE_NAME
12+
13+
from azext_aks_agent._validators import validate_agent_config_file
14+
15+
16+
def load_arguments(self, _):
17+
with self.argument_context("aks agent") as c:
18+
c.positional(
19+
"prompt",
20+
help="Ask any question and answer using available tools.",
21+
)
22+
c.argument(
23+
"resource_group_name",
24+
options_list=["--resource-group", "-g"],
25+
help="Name of resource group.",
26+
required=False,
27+
)
28+
c.argument(
29+
"name",
30+
options_list=["--name", "-n"],
31+
help="Name of the managed cluster.",
32+
required=False,
33+
)
34+
c.argument(
35+
"max_steps",
36+
type=int,
37+
default=10,
38+
required=False,
39+
help="Maximum number of steps the LLM can take to investigate the issue.",
40+
)
41+
c.argument(
42+
"config_file",
43+
default=os.path.join(get_config_dir(), CONST_AGENT_CONFIG_FILE_NAME),
44+
validator=validate_agent_config_file,
45+
required=False,
46+
help="Path to the config file.",
47+
)
48+
c.argument(
49+
"model",
50+
help="The model to use for the LLM.",
51+
required=False,
52+
type=str,
53+
)
54+
c.argument(
55+
"api-key",
56+
help="API key to use for the LLM (if not given, uses environment variables AZURE_API_KEY, OPENAI_API_KEY)",
57+
required=False,
58+
type=str,
59+
)
60+
c.argument(
61+
"no_interactive",
62+
help="Disable interactive mode. When set, the agent will not prompt for input and will run in batch mode.",
63+
action="store_true",
64+
)
65+
c.argument(
66+
"no_echo_request",
67+
help="Disable echoing back the question provided to AKS Agent in the output.",
68+
action="store_true",
69+
)
70+
c.argument(
71+
"show_tool_output",
72+
help="Show the output of each tool that was called.",
73+
action="store_true",
74+
)
75+
c.argument(
76+
"refresh_toolsets",
77+
help="Refresh the toolsets status.",
78+
action="store_true",
79+
)
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
from __future__ import unicode_literals
7+
8+
import os
9+
import os.path
10+
11+
import yaml
12+
from azext_aks_agent._consts import CONST_AGENT_CONFIG_FILE_NAME
13+
from azure.cli.core.api import get_config_dir
14+
from azure.cli.core.azclierror import InvalidArgumentValueError
15+
from knack.log import get_logger
16+
17+
logger = get_logger(__name__)
18+
19+
20+
def _validate_param_yaml_file(yaml_path, param_name):
21+
if not yaml_path:
22+
return
23+
if not os.path.exists(yaml_path):
24+
raise InvalidArgumentValueError(
25+
f"--{param_name}={yaml_path}: file is not found."
26+
)
27+
if not os.access(yaml_path, os.R_OK):
28+
raise InvalidArgumentValueError(
29+
f"--{param_name}={yaml_path}: file is not readable."
30+
)
31+
try:
32+
with open(yaml_path, "r") as file:
33+
yaml.safe_load(file)
34+
except yaml.YAMLError as e:
35+
raise InvalidArgumentValueError(
36+
f"--{param_name}={yaml_path}: file is not a valid YAML file: {e}"
37+
)
38+
except Exception as e:
39+
raise InvalidArgumentValueError(
40+
f"--{param_name}={yaml_path}: An error occurred while reading the config file: {e}"
41+
)
42+
43+
44+
def validate_agent_config_file(namespace):
45+
config_file = namespace.config_file
46+
if not config_file:
47+
return
48+
# default config file path can be empty
49+
default_config_path = os.path.join(get_config_dir(), CONST_AGENT_CONFIG_FILE_NAME)
50+
if config_file == default_config_path and not os.path.exists(config_file):
51+
return
52+
53+
_validate_param_yaml_file(config_file, "config-file")

src/aks-agent/azext_aks_agent/agent/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)