Skip to content

Commit 92525e0

Browse files
authored
{AKS}: refine init experience (#9351)
1 parent 796b871 commit 92525e0

20 files changed

+1010
-395
lines changed

src/aks-agent/HISTORY.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ To release a new version, please select a new version number (usually plus 1 to
1111

1212
Pending
1313
+++++++
14+
15+
1.0.0b8
16+
+++++++
17+
* Error handling: dont raise traceback for init prompt and holmesgpt interaction.
18+
* Improve aks agent-init user experience
19+
* Improve the user holmesgpt interaction error handling
1420
* Fix stdin reading hang in CI/CD pipelines by using select with timeout for non-interactive mode.
1521
* Update pytest marker registration and fix datetime.utcnow() deprecation warning in tests.
1622
* Improve test framework with real-time stderr output visibility and subprocess timeout.

src/aks-agent/README.rst

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,98 @@ For more details about supported model providers and required
3737
variables, see: https://docs.litellm.ai/docs/providers
3838

3939

40+
LLM Configuration Explained
41+
---------------------------
42+
43+
The AKS Agent uses YAML configuration files to define LLM connections. Each configuration contains a provider specification and the required environment variables for that provider.
44+
45+
Configuration Structure
46+
^^^^^^^^^^^^^^^^^^^^^^^^
47+
48+
.. code-block:: yaml
49+
50+
llms:
51+
- provider: azure
52+
MODEL_NAME: gpt-4.1
53+
AZURE_API_KEY: *******
54+
AZURE_API_BASE: https://{azure-openai-service}.openai.azure.com/
55+
AZURE_API_VERSION: 2025-04-01-preview
56+
57+
Field Explanations
58+
^^^^^^^^^^^^^^^^^^
59+
60+
**provider**
61+
The LiteLLM provider route that determines which LLM service to use. This follows the LiteLLM provider specification from https://docs.litellm.ai/docs/providers.
62+
63+
Common values:
64+
65+
* ``azure`` - Azure OpenAI Service
66+
* ``openai`` - OpenAI API and OpenAI-compatible APIs (e.g., local models, other services)
67+
* ``anthropic`` - Anthropic Claude
68+
* ``gemini`` - Google's Gemini
69+
* ``openai_compatible`` - OpenAI-compatible APIs (e.g., local models, other services)
70+
71+
**MODEL_NAME**
72+
The specific model or deployment name to use. This varies by provider:
73+
74+
* For Azure OpenAI: Your deployment name (e.g., ``gpt-4.1``, ``gpt-35-turbo``)
75+
* For OpenAI: Model name (e.g., ``gpt-4``, ``gpt-3.5-turbo``)
76+
* For other providers: Check the specific model names in LiteLLM documentation
77+
78+
**Environment Variables by Provider**
79+
80+
The remaining fields are environment variables required by each provider. These correspond to the authentication and configuration requirements of each LLM service:
81+
82+
**Azure OpenAI (provider: azure)**
83+
* ``AZURE_API_KEY`` - Your Azure OpenAI API key
84+
* ``AZURE_API_BASE`` - Your Azure OpenAI endpoint URL (e.g., https://your-resource.openai.azure.com/)
85+
* ``AZURE_API_VERSION`` - API version (e.g., 2024-02-01, 2025-04-01-preview)
86+
87+
**OpenAI (provider: openai)**
88+
* ``OPENAI_API_KEY`` - Your OpenAI API key (starts with sk-)
89+
90+
**Gemini (provider: gemini)**
91+
* ``GOOGLE_API_KEY`` - Your Google Cloud API key
92+
* ``GOOGLE_API_ENDPOINT`` - Base URL for the Gemini API endpoint
93+
94+
**Anthropic (provider: anthropic)**
95+
* ``ANTHROPIC_API_KEY`` - Your Anthropic API key
96+
97+
**OpenAI Compatible (provider: openai_compatible)**
98+
* ``OPENAI_API_BASE`` - Base URL for the API endpoint
99+
* ``OPENAI_API_KEY`` - API key (if required by the service)
100+
101+
Multiple Model Configuration
102+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
103+
104+
You can configure multiple models in a single file:
105+
106+
.. code-block:: yaml
107+
108+
llms:
109+
- provider: azure
110+
MODEL_NAME: gpt-4
111+
AZURE_API_KEY: your-azure-key
112+
AZURE_API_BASE: https://your-azure-endpoint.openai.azure.com/
113+
AZURE_API_VERSION: 2024-02-01
114+
- provider: openai
115+
MODEL_NAME: gpt-4
116+
OPENAI_API_KEY: your-openai-key
117+
- provider: anthropic
118+
MODEL_NAME: claude-3-sonnet-20240229
119+
ANTHROPIC_API_KEY: your-anthropic-key
120+
121+
When using ``--model``, specify the provider and model as ``provider/model_name`` (e.g., ``azure/gpt-4``, ``openai/gpt-4``).
122+
123+
Security Note
124+
^^^^^^^^^^^^^
125+
126+
API keys and credentials in configuration files should be kept secure. Consider using:
127+
128+
* Restricted file permissions (``chmod 600 config.yaml``)
129+
* Environment variable substitution where supported
130+
* Separate configuration files for different environments (dev/prod)
131+
40132
Quick start and examples
41133
=========================
42134

src/aks-agent/azext_aks_agent/__init__.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,20 @@
44
# --------------------------------------------------------------------------------------------
55

66

7-
from azure.cli.core import AzCommandsLoader
7+
import os
88

99
# pylint: disable=unused-import
1010
import azext_aks_agent._help
11+
from azext_aks_agent._consts import (
12+
CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY,
13+
CONST_AGENT_NAME,
14+
CONST_AGENT_NAME_ENV_KEY,
15+
CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY,
16+
CONST_PRIVACY_NOTICE_BANNER,
17+
CONST_PRIVACY_NOTICE_BANNER_ENV_KEY,
18+
)
19+
from azure.cli.core import AzCommandsLoader
20+
from azure.cli.core.api import get_config_dir
1121

1222

1323
class ContainerServiceCommandsLoader(AzCommandsLoader):
@@ -34,3 +44,14 @@ def load_arguments(self, command):
3444

3545

3646
COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
47+
48+
49+
# NOTE(mainred): holmesgpt leverages the environment variables to customize its behavior.
50+
def customize_holmesgpt():
51+
os.environ[CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY] = "true"
52+
os.environ[CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY] = get_config_dir()
53+
os.environ[CONST_AGENT_NAME_ENV_KEY] = CONST_AGENT_NAME
54+
os.environ[CONST_PRIVACY_NOTICE_BANNER_ENV_KEY] = CONST_PRIVACY_NOTICE_BANNER
55+
56+
57+
customize_holmesgpt()

src/aks-agent/azext_aks_agent/_help.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88

99
from knack.help_files import helps
1010

11-
1211
helps[
1312
"aks agent"
1413
] = """
@@ -78,10 +77,11 @@
7877
Here is an example of config file:
7978
```json
8079
llms:
81-
- provider: "azure"
82-
MODEL_NAME: "gpt-4.1"
83-
AZURE_API_BASE: "https://<your-base-url>"
84-
AZURE_API_KEY: "<your-api-key>"
80+
- provider: azure
81+
MODEL_NAME: gpt-4.1
82+
AZURE_API_KEY: *******
83+
AZURE_API_BASE: https://{azure-openai-service-name}.openai.azure.com/
84+
AZURE_API_VERSION: 2025-04-01-preview
8585
# define a list of mcp servers, mcp server can be defined
8686
mcp_servers:
8787
aks_mcp:

src/aks-agent/azext_aks_agent/agent/agent.py

Lines changed: 68 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,13 @@
33
# Licensed under the MIT License. See License.txt in the project root for license information.
44
# --------------------------------------------------------------------------------------------
55

6-
import logging
76
import os
87
import select
98
import sys
109

11-
from azext_aks_agent._consts import (
12-
CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY,
13-
CONST_AGENT_NAME,
14-
CONST_AGENT_NAME_ENV_KEY,
15-
CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY,
16-
CONST_PRIVACY_NOTICE_BANNER,
17-
CONST_PRIVACY_NOTICE_BANNER_ENV_KEY,
18-
)
10+
from azext_aks_agent.agent.logging import init_log
1911
from azure.cli.core.api import get_config_dir
12+
from azure.cli.core.azclierror import CLIInternalError
2013
from azure.cli.core.commands.client_factory import get_subscription_id
2114
from knack.util import CLIError
2215

@@ -25,34 +18,6 @@
2518
from .telemetry import CLITelemetryClient
2619

2720

28-
# NOTE(mainred): environment variables to disable prometheus toolset loading should be set before importing holmes.
29-
def customize_holmesgpt():
30-
os.environ[CONST_DISABLE_PROMETHEUS_TOOLSET_ENV_KEY] = "true"
31-
os.environ[CONST_AGENT_CONFIG_PATH_DIR_ENV_KEY] = get_config_dir()
32-
os.environ[CONST_AGENT_NAME_ENV_KEY] = CONST_AGENT_NAME
33-
os.environ[CONST_PRIVACY_NOTICE_BANNER_ENV_KEY] = CONST_PRIVACY_NOTICE_BANNER
34-
35-
36-
# NOTE(mainred): holmes leverage the log handler RichHandler to provide colorful, readable and well-formatted logs
37-
# making the interactive mode more user-friendly.
38-
# And we removed exising log handlers to avoid duplicate logs.
39-
# Also make the console log consistent, we remove the telemetry and data logger to skip redundant logs.
40-
def init_log():
41-
# NOTE(mainred): we need to disable INFO logs from LiteLLM before LiteLLM library is loaded, to avoid logging the
42-
# debug logs from heading of LiteLLM.
43-
logging.getLogger("LiteLLM").setLevel(logging.WARNING)
44-
logging.getLogger("telemetry.main").setLevel(logging.WARNING)
45-
logging.getLogger("telemetry.process").setLevel(logging.WARNING)
46-
logging.getLogger("telemetry.save").setLevel(logging.WARNING)
47-
logging.getLogger("telemetry.client").setLevel(logging.WARNING)
48-
logging.getLogger("az_command_data_logger").setLevel(logging.WARNING)
49-
50-
from holmes.utils.console.logging import init_logging
51-
52-
# TODO: make log verbose configurable, currently disabled by [].
53-
return init_logging([])
54-
55-
5621
def _get_mode_state_file() -> str:
5722
"""Get the path to the mode state file."""
5823
config_dir = get_config_dir()
@@ -168,8 +133,6 @@ def aks_agent(
168133
raise CLIError(
169134
"Please upgrade the python version to 3.10 or above to use aks agent."
170135
)
171-
# customizing holmesgpt should called before importing holmes
172-
customize_holmesgpt()
173136

174137
# Initialize variables
175138
interactive = not no_interactive
@@ -213,85 +176,88 @@ def aks_agent(
213176
# MCP Lifecycle Manager
214177
mcp_lifecycle = MCPLifecycleManager()
215178

216-
try:
217-
config = None
179+
config = None
218180

219-
if use_aks_mcp:
220-
try:
221-
config_params = {
222-
'config_file': config_file,
223-
'model': model,
224-
'api_key': api_key,
225-
'max_steps': max_steps,
226-
'verbose': show_tool_output
227-
}
228-
mcp_info = mcp_lifecycle.setup_mcp_sync(config_params)
229-
config = mcp_info['config']
230-
231-
if show_tool_output:
232-
from .user_feedback import ProgressReporter
233-
ProgressReporter.show_status_message("MCP mode active - enhanced capabilities enabled", "info")
234-
235-
except Exception as e: # pylint: disable=broad-exception-caught
236-
# Fallback to traditional mode on any MCP setup failure
237-
from .error_handler import AgentErrorHandler
238-
mcp_error = AgentErrorHandler.handle_mcp_setup_error(e, "MCP initialization")
239-
if show_tool_output:
240-
console.print(f"[yellow]MCP setup failed, using traditional mode: {mcp_error.message}[/yellow]")
241-
if mcp_error.suggestions:
242-
console.print("[dim]Suggestions for next time:[/dim]")
243-
for suggestion in mcp_error.suggestions[:3]: # Show only first 3 suggestions
244-
console.print(f"[dim] • {suggestion}[/dim]")
245-
use_aks_mcp = False
246-
current_mode = "traditional"
247-
248-
# Fallback to traditional mode if MCP setup failed or was disabled
249-
if not config:
250-
config = _setup_traditional_mode_sync(config_file, model, api_key, max_steps, show_tool_output)
251-
if show_tool_output:
252-
console.print("[yellow]Traditional mode active (MCP disabled)[/yellow]")
181+
if use_aks_mcp:
182+
try:
183+
config_params = {
184+
'config_file': config_file,
185+
'model': model,
186+
'api_key': api_key,
187+
'max_steps': max_steps,
188+
'verbose': show_tool_output
189+
}
190+
mcp_info = mcp_lifecycle.setup_mcp_sync(config_params)
191+
config = mcp_info['config']
253192

254-
# Save the current mode to state file for next run
255-
_save_current_mode(current_mode)
193+
if show_tool_output:
194+
from .user_feedback import ProgressReporter
195+
ProgressReporter.show_status_message("MCP mode active - enhanced capabilities enabled", "info")
256196

257-
# Use smart refresh logic
258-
effective_refresh_toolsets = smart_refresh
197+
except Exception as e: # pylint: disable=broad-exception-caught
198+
# Fallback to traditional mode on any MCP setup failure
199+
from .error_handler import AgentErrorHandler
200+
mcp_error = AgentErrorHandler.handle_mcp_setup_error(e, "MCP initialization")
201+
if show_tool_output:
202+
console.print(f"[yellow]MCP setup failed, using traditional mode: {mcp_error.message}[/yellow]")
203+
if mcp_error.suggestions:
204+
console.print("[dim]Suggestions for next time:[/dim]")
205+
for suggestion in mcp_error.suggestions[:3]: # Show only first 3 suggestions
206+
console.print(f"[dim] • {suggestion}[/dim]")
207+
use_aks_mcp = False
208+
current_mode = "traditional"
209+
210+
# Fallback to traditional mode if MCP setup failed or was disabled
211+
if not config:
212+
config = _setup_traditional_mode_sync(config_file, model, api_key, max_steps, show_tool_output)
259213
if show_tool_output:
260-
from .user_feedback import ProgressReporter
261-
ProgressReporter.show_status_message(
262-
f"Toolset refresh: {effective_refresh_toolsets} (Mode: {current_mode})", "info"
263-
)
214+
console.print("[yellow]Traditional mode active (MCP disabled)[/yellow]")
264215

265-
# Create AI client once with proper refresh settings
216+
# Save the current mode to state file for next run
217+
_save_current_mode(current_mode)
218+
219+
# Use smart refresh logic
220+
effective_refresh_toolsets = smart_refresh
221+
if show_tool_output:
222+
from .user_feedback import ProgressReporter
223+
ProgressReporter.show_status_message(
224+
f"Toolset refresh: {effective_refresh_toolsets} (Mode: {current_mode})", "info"
225+
)
226+
227+
# Validate inputs
228+
if not prompt and not interactive and not piped_data:
229+
raise CLIError(
230+
"Either the 'prompt' argument must be provided (unless using --interactive mode)."
231+
)
232+
try:
233+
# prepare the toolsets
266234
ai = config.create_console_toolcalling_llm(
267235
dal=None,
268236
refresh_toolsets=effective_refresh_toolsets,
269237
)
238+
except Exception as e:
239+
raise CLIError(f"Failed to create AI executor: {str(e)}")
240+
241+
# Handle piped data
242+
if piped_data:
243+
if prompt:
244+
# User provided both piped data and a prompt
245+
prompt = f"Here's some piped output:\n\n{piped_data}\n\n{prompt}"
246+
else:
247+
# Only piped data, no prompt - ask what to do with it
248+
prompt = f"Here's some piped output:\n\n{piped_data}\n\nWhat can you tell me about this output?"
270249

271-
# Validate inputs
272-
if not prompt and not interactive and not piped_data:
273-
raise CLIError(
274-
"Either the 'prompt' argument must be provided (unless using --interactive mode)."
275-
)
276-
277-
# Handle piped data
278-
if piped_data:
279-
if prompt:
280-
# User provided both piped data and a prompt
281-
prompt = f"Here's some piped output:\n\n{piped_data}\n\n{prompt}"
282-
else:
283-
# Only piped data, no prompt - ask what to do with it
284-
prompt = f"Here's some piped output:\n\n{piped_data}\n\nWhat can you tell me about this output?"
285-
286-
# Phase 2: Holmes Execution (synchronous - no event loop conflicts)
287-
is_mcp_mode = current_mode == "mcp"
250+
# Phase 2: Holmes Execution (synchronous - no event loop conflicts)
251+
is_mcp_mode = current_mode == "mcp"
252+
try:
288253
if interactive:
289254
_run_interactive_mode_sync(ai, cmd, resource_group_name, name,
290255
prompt, console, show_tool_output, is_mcp_mode, telemetry)
291256
else:
292257
_run_noninteractive_mode_sync(ai, config, cmd, resource_group_name, name,
293258
prompt, console, echo, show_tool_output, is_mcp_mode)
294-
259+
except Exception as e: # pylint: disable=broad-exception-caught
260+
raise CLIInternalError(f"Error occurred during execution: {str(e)}")
295261
finally:
296262
# Phase 3: MCP Cleanup (isolated async if needed)
297263
mcp_lifecycle.cleanup_mcp_sync()

0 commit comments

Comments
 (0)