Azure
diff --git a/‎src/aks-agent/HISTORY.rst‎
Lines changed: 12 additions & 0 deletions b/‎src/aks-agent/HISTORY.rst‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎src/aks-agent/README.rst‎
Lines changed: 37 additions & 129 deletions b/‎src/aks-agent/README.rst‎
Lines changed: 37 additions & 129 deletions
diff --git a/‎src/aks-agent/azext_aks_agent/__init__.py‎
Lines changed: 11 additions & 23 deletions b/‎src/aks-agent/azext_aks_agent/__init__.py‎
Lines changed: 11 additions & 23 deletions
diff --git a/‎src/aks-agent/azext_aks_agent/_client_factory.py‎
Lines changed: 23 additions & 0 deletions b/‎src/aks-agent/azext_aks_agent/_client_factory.py‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎src/aks-agent/azext_aks_agent/_consts.py‎
Lines changed: 17 additions & 3 deletions b/‎src/aks-agent/azext_aks_agent/_consts.py‎
Lines changed: 17 additions & 3 deletions
@@ -12,6 +12,18 @@ To release a new version, please select a new version number (usually plus 1 to
 Pending
 +++++++
 
+1.0.0b12
+++++++++
+* [BREAKING CHANGE]:
+  * aks-agent is now containerized and deployed per Kubernetes cluster along with a managed aks-mcp instance
+  * aks-agent is deployed on the AKS cluster as Helm charts during `az aks agent-init`
+  * aks agent commands now require --resource-group and --name parameters to specify the target AKS cluster
+  * Add `az aks agent-cleanup` to cleanup the AKS agent from the cluster
+* [SECURITY]:
+  * Kubernetes RBAC: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
+  * Azure Workload Identity: Supports Azure workload identity for secure, keyless access to Azure resources
+  * LLM credentials are stored securely in Kubernetes secrets with encryption at rest
+
 1.0.0b11
 ++++++++
 * Fix(agent-init): replace max_tokens with max_completion_tokens for connection check of Azure OpenAI service.
 
@@ -7,28 +7,34 @@ Introduction
 
 The AKS Agent extension provides the "az aks agent" command, an AI-powered assistant that helps analyze and troubleshoot Azure Kubernetes Service (AKS) clusters using Large Language Models (LLMs). The agent combines cluster context, configurable toolsets, and LLMs to answer natural-language questions about your cluster (for example, "Why are my pods not starting?") and can investigate issues in both interactive and non-interactive (batch) modes.
 
-New in this version: **az aks agent-init** command for easy LLM model configuration!
+New in this version: **az aks agent-init** command for containerized agent deployment!
 
-You can now use `az aks agent-init` to interactively add and configure LLM models before asking questions. This command guides you through the setup process, allowing you to add multiple models as needed. When asking questions with `az aks agent`, you can:
+The `az aks agent-init` command deploys the AKS agent as a Helm chart directly in your AKS cluster with enterprise-grade security:
 
-- Use `--config-file` to specify your own model configuration file
-- Use `--model` to select a previously configured model
-- If neither is provided, the last configured LLM will be used by default
+- **Kubernetes RBAC**: Uses cluster roles to securely access Kubernetes resources with least-privilege principles
+- **Workload Identity**: Leverages Azure workload identity for secure, keyless access to Azure resources
+- **Interactive LLM Configuration**: Guides you through setting up LLM models with encrypted storage in Kubernetes secrets
 
-This makes it much easier to manage and switch between multiple models for your AKS troubleshooting workflows.
+When asking questions with `az aks agent`:
+
+- The agent automatically uses the last configured model
+- Use `--model` to select a specific model when you have multiple models configured
+
+This architecture provides better security, scalability, and manageability for production AKS troubleshooting workflows.
 
 Key capabilities
 ----------------
 
 
+- **Containerized Deployment**: Agent runs as a Helm chart in your AKS cluster with `az aks agent-init`.
+- **Secure Access**: Uses Kubernetes RBAC for cluster resources and Azure workload identity for Azure resources.
+- **LLM Configuration**: Interactively configure LLM models with credentials stored securely in Kubernetes secrets.
+- Support for multiple LLM providers (Azure OpenAI, OpenAI, Anthropic, Gemini, etc.).
+- Automatically uses the last configured model by default.
+- Optionally use --model to select a specific model when you have multiple models configured.
 - Interactive and non-interactive modes (use --no-interactive for batch runs).
-- Support for multiple LLM providers (Azure OpenAI, OpenAI, etc.) via interactive configuration.
-- **Easy model setup with `az aks agent-init`**: interactively add and configure LLM models, run multiple times to add more models.
-- Configurable via a JSON/YAML config file provided with --config-file, or select a model with --model.
-- If no config or model is specified, the last configured LLM is used automatically.
 - Control echo and tool output visibility with --no-echo-request and --show-tool-output.
 - Refresh the available toolsets with --refresh-toolsets.
-- Stay in traditional toolset mode by default, or opt in to aks-mcp integration with ``--aks-mcp`` when you need the enhanced capabilities.
 
 Prerequisites
 -------------
@@ -37,98 +43,6 @@ For more details about supported model providers and required
 variables, see: https://docs.litellm.ai/docs/providers
 
 
-LLM Configuration Explained
----------------------------
-
-The AKS Agent uses YAML configuration files to define LLM connections. Each configuration contains a provider specification and the required environment variables for that provider.
-
-Configuration Structure
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code-block:: yaml
-
-    llms:
-    - provider: azure
-      MODEL_NAME: gpt-4.1
-      AZURE_API_KEY: *******
-      AZURE_API_BASE: https://{azure-openai-service}.openai.azure.com/
-      AZURE_API_VERSION: 2025-04-01-preview
-
-Field Explanations
-^^^^^^^^^^^^^^^^^^
-
-**provider**
-    The LiteLLM provider route that determines which LLM service to use. This follows the LiteLLM provider specification from https://docs.litellm.ai/docs/providers.
-
-    Common values:
-
-    * ``azure`` - Azure OpenAI Service
-    * ``openai`` - OpenAI API and OpenAI-compatible APIs (e.g., local models, other services)
-    * ``anthropic`` - Anthropic Claude
-    * ``gemini`` - Google's Gemini
-    * ``openai_compatible`` - OpenAI-compatible APIs (e.g., local models, other services)
-
-**MODEL_NAME**
-    The specific model or deployment name to use. This varies by provider:
-
-    * For Azure OpenAI: Your deployment name (e.g., ``gpt-4.1``, ``gpt-35-turbo``)
-    * For OpenAI: Model name (e.g., ``gpt-4``, ``gpt-3.5-turbo``)
-    * For other providers: Check the specific model names in LiteLLM documentation
-
-**Environment Variables by Provider**
-
-The remaining fields are environment variables required by each provider. These correspond to the authentication and configuration requirements of each LLM service:
-
-**Azure OpenAI (provider: azure)**
-    * ``AZURE_API_KEY`` - Your Azure OpenAI API key
-    * ``AZURE_API_BASE`` - Your Azure OpenAI endpoint URL (e.g., https://your-resource.openai.azure.com/)
-    * ``AZURE_API_VERSION`` - API version (e.g., 2024-02-01, 2025-04-01-preview)
-
-**OpenAI (provider: openai)**
-    * ``OPENAI_API_KEY`` - Your OpenAI API key (starts with sk-)
-
-**Gemini (provider: gemini)**
-    * ``GOOGLE_API_KEY`` - Your Google Cloud API key
-    * ``GOOGLE_API_ENDPOINT`` - Base URL for the Gemini API endpoint
-
-**Anthropic (provider: anthropic)**
-    * ``ANTHROPIC_API_KEY`` - Your Anthropic API key
-
-**OpenAI Compatible (provider: openai_compatible)**
-    * ``OPENAI_API_BASE`` - Base URL for the API endpoint
-    * ``OPENAI_API_KEY`` - API key (if required by the service)
-
-Multiple Model Configuration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-You can configure multiple models in a single file:
-
-.. code-block:: yaml
-
-    llms:
-    - provider: azure
-      MODEL_NAME: gpt-4
-      AZURE_API_KEY: your-azure-key
-      AZURE_API_BASE: https://your-azure-endpoint.openai.azure.com/
-      AZURE_API_VERSION: 2024-02-01
-    - provider: openai
-      MODEL_NAME: gpt-4
-      OPENAI_API_KEY: your-openai-key
-    - provider: anthropic
-      MODEL_NAME: claude-3-sonnet-20240229
-      ANTHROPIC_API_KEY: your-anthropic-key
-
-When using ``--model``, specify the provider and model as ``provider/model_name`` (e.g., ``azure/gpt-4``, ``openai/gpt-4``).
-
-Security Note
-^^^^^^^^^^^^^
-
-API keys and credentials in configuration files should be kept secure. Consider using:
-
-* Restricted file permissions (``chmod 600 config.yaml``)
-* Environment variable substitution where supported
-* Separate configuration files for different environments (dev/prod)
-
 Quick start and examples
 =========================
 
@@ -139,14 +53,21 @@ Install the extension
 
     az extension add --name aks-agent
 
-Configure LLM models interactively
-----------------------------------
+Initialize and configure the AKS agent
+---------------------------------------
 
 .. code-block:: bash
 
-    az aks agent-init
+    az aks agent-init --resource-group MyResourceGroup --name MyManagedCluster
+
+This command will configure the LLM configuration and:
 
-This command will guide you through adding a new LLM model. You can run it multiple times to add more models or update existing models. All configured models are saved locally and can be selected when asking questions.
+1. Guide you through LLM model configuration with credentials stored securely in Kubernetes secrets
+2. Deploy the AKS agent Helm chart in your cluster
+3. Configure Kubernetes RBAC for secure cluster resource access
+4. Optionally configure Azure workload identity for Azure resource access
+
+You can run it multiple times to update configurations or add more models.
 
 Run the agent (Azure OpenAI example) :
 -----------------------------------
@@ -163,12 +84,6 @@ Run the agent (Azure OpenAI example) :
 
     az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
 
-**3. Use a custom config file:**
-
-.. code-block:: bash
-
-    az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml
-
 
 Run the agent (OpenAI example)
 ------------------------------
@@ -185,34 +100,27 @@ Run the agent (OpenAI example)
     
     az aks agent "Why are my pods not starting?" --name MyManagedCluster --resource-group MyResourceGroup --model gpt-4o
 
-**3. Use a custom config file:**
-
-.. code-block:: bash
-
-    az aks agent "Why are my pods not starting?" --config-file /path/to/your/model_config.yaml
-
 Run in non-interactive batch mode
 ---------------------------------
 
 .. code-block:: bash
 
-    az aks agent "Diagnose networking issues" --no-interactive --max-steps 15 --model azure/my-gpt4.1-deployment
+    az aks agent "Diagnose networking issues" --no-interactive --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
 
-Opt in to MCP mode
-------------------
+Clean up the AKS agent
+-----------------------
 
-Traditional toolsets remain the default. Enable the aks-mcp integration when you want the enhanced toolsets by passing ``--aks-mcp``. You can return to traditional mode on a subsequent run with ``--no-aks-mcp``.
+To uninstall the AKS agent and clean up all Kubernetes resources:
 
 .. code-block:: bash
 
-    az aks agent --aks-mcp "Check node health with MCP" --name MyManagedCluster --resource-group MyResourceGroup --model azure/my-gpt4.1-deployment
+    az aks agent-cleanup --resource-group MyResourceGroup --name MyManagedCluster
 
-Using a configuration file
---------------------------
+This command will:
 
-Pass a config file with --config-file to predefine model, credentials, and toolsets. See
-the example config and more detailed examples in the help definition at
-`src/aks-agent/azext_aks_agent/_help.py`.
+1. Uninstall the AKS agent Helm chart from your cluster
+2. Remove all associated Kubernetes resources (deployments, pods, secrets, RBAC configurations)
+3. Clean up the LLM configuration secrets
 
 More help
 ---------
 
@@ -3,27 +3,26 @@
 # Licensed under the MIT License. See License.txt in the project root for license information.
 # --------------------------------------------------------------------------------------------
 
-
-import os
+from azext_aks_agent._client_factory import CUSTOM_MGMT_AKS
 
 # pylint: disable=unused-import
-import azext_aks_agent._help
-from azext_aks_agent._consts import (
-    CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY,
-    CONST_AGENT_NAME,
-    CONST_AGENT_NAME_ENV_KEY,
-    CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY,
-    CONST_PRIVACY_NOTICE_BANNER,
-    CONST_PRIVACY_NOTICE_BANNER_ENV_KEY,
-)
 from azure.cli.core import AzCommandsLoader
-from azure.cli.core.api import get_config_dir
+from azure.cli.core.profiles import register_resource_type
+
+
+def register_aks_agent_resource_type():
+    register_resource_type(
+        "latest",
+        CUSTOM_MGMT_AKS,
+        None,
+    )
 
 
 class ContainerServiceCommandsLoader(AzCommandsLoader):
 
     def __init__(self, cli_ctx=None):
         from azure.cli.core.commands import CliCommandType
+        register_aks_agent_resource_type()
 
         aks_agent_custom = CliCommandType(operations_tmpl='azext_aks_agent.custom#{}')
         super().__init__(
@@ -44,14 +43,3 @@ def load_arguments(self, command):
 
 
 COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
-
-
-# NOTE(mainred): holmesgpt leverages the environment variables to customize its behavior.
-def customize_holmesgpt():
-    os.environ[CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY] = "true"
-    os.environ[CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY] = get_config_dir()
-    os.environ[CONST_AGENT_NAME_ENV_KEY] = CONST_AGENT_NAME
-    os.environ[CONST_PRIVACY_NOTICE_BANNER_ENV_KEY] = CONST_PRIVACY_NOTICE_BANNER
-
-
-customize_holmesgpt()
@@ -0,0 +1,23 @@
+# --------------------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See License.txt in the project root for license information.
+# --------------------------------------------------------------------------------------------
+
+from azure.cli.core.commands.client_factory import get_mgmt_service_client
+from azure.cli.core.profiles import CustomResourceType
+
+CUSTOM_MGMT_AKS = CustomResourceType('azext_aks_agent.vendored_sdks.azure_mgmt_containerservice.2025_10_01',
+                                     'ContainerServiceClient')
+
+# Note: cf_xxx, as the client_factory option value of a command group at command declaration, it should ignore
+# parameters other than cli_ctx; get_xxx_client is used as the client of other services in the command implementation,
+# and usually accepts subscription_id as a parameter to reconfigure the subscription when sending the request
+
+
+# container service clients
+def get_container_service_client(cli_ctx, subscription_id=None):
+    return get_mgmt_service_client(cli_ctx, CUSTOM_MGMT_AKS, subscription_id=subscription_id)
+
+
+def cf_managed_clusters(cli_ctx, *_):
+    return get_container_service_client(cli_ctx).managed_clusters
@@ -30,6 +30,20 @@
 CONST_MCP_GITHUB_REPO = "Azure/aks-mcp"
 CONST_MCP_BINARY_DIR = "bin"
 
-# Color constants for terminal output
-HELP_COLOR = "cyan"  # same as AI_COLOR for now
-ERROR_COLOR = "red"
+# Kubernetes WebSocket exec protocol constants
+RESIZE_CHANNEL = 4  # WebSocket channel for terminal resize messages
+# WebSocket heartbeat configuration (matching kubectl client-go)
+# Based on kubernetes/client-go/tools/remotecommand/websocket.go#L59-L65
+# pingPeriod = 5 * time.Second
+# pingReadDeadline = (pingPeriod * 12) + (1 * time.Second)
+# The read deadline is calculated to allow up to 12 missed pings plus 1 second buffer
+# This provides tolerance for network delays while detecting actual connection failures
+HEARTBEAT_INTERVAL = 5.0                              # pingPeriod: 5 seconds between pings
+HEARTBEAT_TIMEOUT = (HEARTBEAT_INTERVAL * 12) + 1    # pingReadDeadline: 61 seconds total timeout
+
+AGENT_NAMESPACE = "kube-system"
+AGENT_LABEL_SELECTOR = "app.kubernetes.io/name=aks-agent"
+AKS_MCP_LABEL_SELECTOR = "app.kubernetes.io/name=aks-mcp"
+
+# Helm Configuration
+HELM_VERSION = "3.16.0"