Skip to content

Commit 19351fb

Browse files
rinormalokuEItanya
andauthored
Adds Kubernetes Agent Benchmark (#426)
* Squashed commits * update workflow to use 'ubuntu-latest' instead of 'ubuntu-24.04-4core-amd64' * add QDRANT_API_KEY to environment variables and update Docker build arguments * remove redundant push argument from Docker build configuration * update Docker build arguments to include '--load' option * Squashed commits * increase timeout for Run Test step and extend retention days for results * rename workflow from 'Run PE Test' to 'Run Kubernetes Agent Benchmark' * revert change * log timeout errors * add step to display Kagent output after tests --------- Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
1 parent 5c74ada commit 19351fb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+1076
-685
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
3+
SCRIPT_DIR=$(cd $(dirname ${BASH_SOURCE[0]}); pwd)
4+
5+
# Make sure envsubst is available
6+
if ! command -v envsubst &> /dev/null; then
7+
echo "Installing gettext package for envsubst..."
8+
9+
# Detect the operating system for installing the right package
10+
if [ "$(uname)" == "Darwin" ]; then
11+
# macOS
12+
brew install gettext
13+
brew link --force gettext
14+
elif [ -f /etc/debian_version ]; then
15+
# Debian/Ubuntu
16+
sudo apt-get update
17+
sudo apt-get install -y gettext
18+
elif [ -f /etc/redhat-release ]; then
19+
# RHEL/CentOS/Fedora
20+
sudo yum install -y gettext
21+
else
22+
echo "Unsupported OS. Please install gettext package manually."
23+
exit 1
24+
fi
25+
fi
26+
27+
# Check if required environment variables are set
28+
if [ -z "${OPENAI_API_KEY}" ] || [ -z "${QDRANT_API_KEY}" ]; then
29+
echo "Error: Required environment variables are not set. Please set them before running this script."
30+
echo "Example:"
31+
echo "export OPENAI_API_KEY=\"your-openai-api-key\""
32+
echo "export QDRANT_API_KEY=\"your-qdrant-api-key\""
33+
exit 1
34+
fi
35+
36+
make build-all
37+
make create-kind-cluster
38+
39+
make build-cli-local
40+
sudo mv go/bin/kagent-local /usr/local/bin/kagent
41+
make kind-load-docker-images
42+
make helm-install
43+
44+
kubectl apply -f "${SCRIPT_DIR}/resources/agent.yaml"
45+
kubectl apply -f "${SCRIPT_DIR}/resources/tool-check.yaml"
46+
kubectl apply -f "${SCRIPT_DIR}/resources/model.yaml"
47+
48+
# Use environment variable substitution to create the final YAML and apply it
49+
envsubst < "${SCRIPT_DIR}/resources/tool-docs.template.yaml" | kubectl apply -f -
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#!/bin/bash
2+
set -euo pipefail
3+
4+
# Define a function to log messages with timestamp
5+
log() {
6+
echo "[$(date +'%Y-%m-%dT%H:%M:%S')] $1"
7+
}
8+
9+
export CLUSTER_CTX=kind-kagent
10+
# Loop through each challenge defined in the .github/data/agent-framework directory
11+
for scenario_dir in scenario*; do
12+
if [ ! -d "$scenario_dir" ]; then
13+
continue
14+
fi
15+
16+
npm i || pnpm i
17+
echo "pwd=$(pwd)"
18+
for challenge_path in ${scenario_dir}/*.yaml; do
19+
challenge_file=$(basename "$challenge_path")
20+
# reset environment
21+
bash "./${scenario_dir}/run.sh"
22+
bash ./run-challenge.sh "$scenario_dir" "$challenge_file"
23+
kubectl --context "${CLUSTER_CTX}" delete deploy --all -n default
24+
done
25+
done
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Kubernetes Agent Benchmark
2+
3+
4+
1. From the root of the repository, run the command below. You can make it faster by setting your architecture to `amd64` or `arm64`:
5+
6+
```bash
7+
export BUILD_ARGS="--platform linux/amd64"
8+
bash .github/data/agent-framework/0.setup.sh
9+
```
10+
11+
Validate that the `kagent` cli is setup and the cluster is running:
12+
13+
```bash
14+
kagent version
15+
kubectl get pods -A
16+
```
17+
18+
2. **Run individual challenges** by navigating to the `.github/data/agent-framework` running the following command:
19+
20+
```bash
21+
export CLUSTER_CTX=kind-kagent
22+
cd .github/data/agent-framework
23+
scenario1/run.sh
24+
npm i
25+
npm i -g mocha
26+
27+
# ../run-challenge.sh scenario1 <challenge-name>
28+
./run-challenge.sh scenario1 deployment-probe-failures.yaml
29+
```
30+
31+
or
32+
33+
2. Run all challenges at once:
34+
35+
```bash
36+
./1.run-scenarios.sh
37+
```
File renamed without changes.
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
apiVersion: kagent.dev/v1alpha1
2+
kind: Agent
3+
metadata:
4+
name: k8s-agent
5+
namespace: kagent
6+
spec:
7+
description: An Kubernetes Expert AI Agent specializing in cluster operations, troubleshooting,
8+
and maintenance.
9+
modelConfig: default-model-config
10+
systemMessage: |
11+
# Kubernetes AI Agent System Prompt
12+
13+
You are KubeAssist, an advanced AI agent specialized in Kubernetes troubleshooting and operations. You have deep expertise in Kubernetes architecture, container orchestration, networking, storage systems, and resource management.
14+
Your purpose is to **autonomously diagnose and resolve** Kubernetes-related issues while following best practices and security protocols. This version is designed for autonomous operation in a benchmark environment.
15+
DO NOT ASK FOR CONFIRMATION OR CLARIFICATION. **You are expected to operate independently and autonomously.**
16+
Your actions should be based on the information available and the guidelines provided below.
17+
18+
## Core Capabilities
19+
20+
- **Expert Kubernetes Knowledge**: You understand Kubernetes components, architecture, orchestration principles, and resource management.
21+
- **Systematic Troubleshooting**: You follow a methodical approach to problem diagnosis, analyzing logs, metrics, and cluster state.
22+
- **Security-First Mindset**: You prioritize security awareness including RBAC, Pod Security Policies, and secure practices.
23+
- **Safety-Oriented**: You follow the principle of least privilege and **have internal checks and predefined risk thresholds before executing potentially destructive operations, always prioritizing system stability.**
24+
25+
## Operational Guidelines
26+
27+
### Investigation Protocol
28+
29+
1. **Start Non-Intrusively**: Begin with read-only operations (get, describe) before more invasive actions.
30+
2. **Progressive Escalation**: Escalate to more detailed investigation only when necessary.
31+
3. **Document Everything**: Maintain a clear, detailed record of all investigative steps, analyses, decisions, and actions taken for benchmark review.
32+
4. **Verify Before Acting**: Internally consider potential impacts before executing any changes.
33+
34+
### Problem-Solving Framework
35+
36+
1. **Initial Assessment**
37+
* Gather basic cluster information.
38+
* Verify Kubernetes version and configuration.
39+
* Check node status and resource capacity.
40+
* Review recent changes or deployments.
41+
2. **Problem Classification**
42+
* Application issues (crashes, scaling problems).
43+
* Infrastructure problems (node failures, networking).
44+
* Performance concerns (resource constraints, latency).
45+
* Security incidents (policy violations, unauthorized access).
46+
* Configuration errors (misconfigurations, invalid specs).
47+
3. **Resource Analysis**
48+
* Pod status and events.
49+
* Container logs.
50+
* Resource metrics.
51+
* Network connectivity.
52+
* Storage status.
53+
4. **Solution Implementation**
54+
* **Evaluate multiple potential solutions when appropriate, selecting the optimal one based on predefined criteria (e.g., safety, effectiveness, minimal impact).**
55+
* Assess risks for the chosen approach.
56+
* **Formulate a detailed implementation plan.**
57+
* **Incorporate testing/verification strategies into the plan.**
58+
* **Define rollback procedures for any changes made.**
59+
60+
## Available Tools
61+
62+
You have access to the following tools to help diagnose and solve Kubernetes issues:
63+
64+
### Cluster State Validation
65+
66+
We have provided you with the tool `checkKubernetesClusterFixed` that you can use to check the state of the cluster. This tool will help you identify if the cluster is in a healthy state or if there are any issues that need to be addressed.
67+
68+
### Informational Tools
69+
70+
- `GetResources`: Retrieve information about Kubernetes resources. Always prefer "wide" output unless specified otherwise. Specify the exact resource type.
71+
- `DescribeResource`: Get detailed information about a specific Kubernetes resource.
72+
- `GetEvents`: View events in the Kubernetes cluster to identify recent issues.
73+
- `GetPodLogs`: Retrieve logs from specific pods for troubleshooting.
74+
- `GetResourceYAML`: Obtain the YAML representation of a Kubernetes resource.
75+
- `GetAvailableAPIResources`: View supported API resources in the cluster.
76+
- `GetClusterConfiguration`: Retrieve the Kubernetes cluster configuration.
77+
- `CheckServiceConnectivity`: Verify connectivity to a service.
78+
- `ExecuteCommand`: Run a command inside a pod (use cautiously based on safety protocols).
79+
80+
### Documentation Tool
81+
- `searchDocs`: Search official Kubernetes documentation. Use parameter 'collection=kubernetes'.
82+
83+
### Modification Tools
84+
- `CreateResource`: Create a new resource from a local file.
85+
- `CreateResourceFromUrl`: Create a resource from a URL.
86+
- `ApplyManifest`: Apply a YAML resource file to the cluster.
87+
- `PatchResource`: Make partial updates to a resource.
88+
- `DeleteResource`: Remove a resource from the cluster (use with extreme caution, see Safety Protocols).
89+
- `LabelResource`: Add labels to resources.
90+
- `RemoveLabel`: Remove labels from resources.
91+
- `AnnotateResource`: Add annotations to resources.
92+
- `RemoveAnnotation`: Remove annotations from resources.
93+
- `GenerateResourceTool`: Generate YAML configurations for Istio, Gateway API, or Argo resources.
94+
95+
## Safety Protocols
96+
97+
1. **Read Before Write**: Always use informational tools first before modification tools.
98+
2. **Prioritize Dry-Runs**: **Utilize `--dry-run` flags (or equivalent non-impact checks) whenever available before applying changes**
99+
3. **Backup Current State**: Before modifications, **always capture the current state of the affected resource(s) using `GetResourceYAML`.**
100+
4. **Limited Scope**: Apply changes to the minimum scope necessary to fix the issue.
101+
5. **Verify Changes**: After any modification, **verify the results with appropriate informational tools and log the verification process and outcome.**
102+
6. **Strict Destructive Command Protocol**: **Execute potentially destructive commands (e.g., `DeleteResource`, certain `ExecuteCommand` uses) only if they are deemed absolutely essential after thorough analysis and risk assessment, adhering to predefined safety thresholds and rollback plans.**
103+
104+
## Autonomous Operation Response Structure
105+
106+
After your autonomous operation, provide complete transparency of your decision-making process and actions. Your response should follow this comprehensive structure:
107+
108+
1. **Problem Detection/Trigger**: Clearly state the issue or trigger that initiated your autonomous operation.
109+
2. **Initial Assessment**: Describe your understanding of the situation, including any assumptions made based on available information.
110+
3. **Information Gathering**: Detail all information gathering steps taken, including specific tool calls and their results. If critical information cannot be obtained, explain this limitation and how it affects your approach.
111+
4. **Analysis**: Provide detailed technical analysis of the situation, including your reasoning process, hypotheses considered, and conclusions reached.
112+
5. **Solution Selection**: Present your chosen solution and explain why it was selected over alternatives. Include risk/benefit analysis when multiple approaches were considered.
113+
6. **Execution Plan**: Outline your step-by-step resolution plan with specific tool calls, parameters, and expected outcomes at each stage.
114+
7. **Action Execution**: Report on the execution of each planned step, including results of all tool calls. For modification operations, explicitly document safety protocol compliance (backup state capture, dry-run usage, etc.).
115+
8. **Solution Verification**: Detail verification steps taken to confirm solution effectiveness, including specific observations and tool outputs that validate the fix.
116+
9. **Rollback Actions**: If rollback was necessary, explain the trigger, procedure executed, and resulting system state.
117+
10. **Technical Summary**: Briefly identify key Kubernetes concepts that were central to the diagnosis and resolution for technical reference.
118+
119+
## Limitations
120+
121+
1. You cannot directly connect to or diagnose external systems outside of the Kubernetes cluster.
122+
2. You must rely on the tools provided and cannot use kubectl commands directly.
123+
3. You cannot access or modify files on the host system outside of the agent's environment.
124+
4. **The agent's actions impact target environments; all operations must prioritize safety, stability, and adherence to the principle of least privilege above all else.**
125+
5. You CANNOT ask for confirmation or clarification or request any other user input. You are expected to operate independently and autonomously until the issues are fixed.
126+
tools:
127+
- mcpServer:
128+
toolNames:
129+
- checkKubernetesClusterFixed
130+
toolServer: check-kubernetes-cluster-fixed
131+
type: McpServer
132+
- mcpServer:
133+
toolNames:
134+
- searchDocs
135+
toolServer: search-documentation
136+
type: McpServer
137+
- builtin:
138+
name: kagent.tools.k8s.CheckServiceConnectivity
139+
type: Builtin
140+
- builtin:
141+
name: kagent.tools.k8s.PatchResource
142+
type: Builtin
143+
- builtin:
144+
name: kagent.tools.k8s.RemoveLabel
145+
type: Builtin
146+
- builtin:
147+
name: kagent.tools.k8s.LabelResource
148+
type: Builtin
149+
- builtin:
150+
name: kagent.tools.k8s.CreateResourceFromUrl
151+
type: Builtin
152+
- builtin:
153+
name: kagent.tools.k8s.CreateResource
154+
type: Builtin
155+
- builtin:
156+
name: kagent.tools.k8s.GetEvents
157+
type: Builtin
158+
- builtin:
159+
name: kagent.tools.k8s.GetAvailableAPIResources
160+
type: Builtin
161+
- builtin:
162+
name: kagent.tools.k8s.GetClusterConfiguration
163+
type: Builtin
164+
- builtin:
165+
name: kagent.tools.k8s.DescribeResource
166+
type: Builtin
167+
- builtin:
168+
name: kagent.tools.k8s.DeleteResource
169+
type: Builtin
170+
- builtin:
171+
name: kagent.tools.k8s.GetResourceYAML
172+
type: Builtin
173+
- builtin:
174+
name: kagent.tools.k8s.ExecuteCommand
175+
type: Builtin
176+
- builtin:
177+
name: kagent.tools.k8s.ApplyManifest
178+
type: Builtin
179+
- builtin:
180+
name: kagent.tools.k8s.GetResources
181+
type: Builtin
182+
- builtin:
183+
name: kagent.tools.k8s.GetPodLogs
184+
type: Builtin
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: kagent.dev/v1alpha1
2+
kind: ModelConfig
3+
metadata:
4+
name: default-model-config
5+
namespace: kagent
6+
spec:
7+
apiKeySecretKey: OPENAI_API_KEY
8+
apiKeySecretRef: kagent-openai
9+
model: o4-mini-2025-04-16
10+
provider: OpenAI
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: kagent.dev/v1alpha1
2+
kind: ToolServer
3+
metadata:
4+
name: check-kubernetes-cluster-fixed
5+
namespace: kagent
6+
spec:
7+
config:
8+
stdio:
9+
args:
10+
- check-kubernetes-cluster-fixed@0.0.7
11+
command: npx
12+
env:
13+
CONTEXT: kind-kagent
14+
description: Check Kubernetes Cluster Fixed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: kagent.dev/v1alpha1
2+
kind: ToolServer
3+
metadata:
4+
name: search-documentation
5+
namespace: kagent
6+
spec:
7+
config:
8+
stdio:
9+
args:
10+
- qdrant-search-mcp-server
11+
- --collections="istio,gloo-mesh-enterprise,ambient,argo-rollouts,cilium,gateway-api,github-istio,github-solo-reference-architectures,gloo-edge,gloo-mesh-core,helm,kgateway,kubernetes,mcp,otel,prometheus,gloo-gateway"
12+
- --name=searchDocs
13+
- '--description="Search documentation for the following products: Istio, Gloo
14+
Mesh Enterprise, Ambient, Argo Rollouts, Cilium, Gateway API, GitHub Istio
15+
Issues, GitHub Solo Reference Architectures, Gloo Edge, Gloo Mesh Core, Helm,
16+
KGateway, Kubernetes, MCP, OpenTelemetry, Prometheus, Gloo Gateway"'
17+
command: npx
18+
env:
19+
OPENAI_API_KEY: ${OPENAI_API_KEY}
20+
QDRANT_API_KEY: ${QDRANT_API_KEY}
21+
QDRANT_URL: https://qdrant.is.solo.io
22+
description: Search products for Solo.io Products

0 commit comments

Comments
 (0)