Skip to content

Commit 10d3ae3

Browse files
committed
Add gevals action (#2)
* feat: add initial eval tasks Signed-off-by: Calum Murray <[email protected]> * feat: add initial gevals workflow Signed-off-by: Calum Murray <[email protected]> --------- Signed-off-by: Calum Murray <[email protected]>
1 parent ac053ff commit 10d3ae3

File tree

119 files changed

+2636
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+2636
-0
lines changed

.github/workflows/gevals.yaml

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
name: Gevals MCP Evaluation
2+
3+
on:
4+
# Weekly schedule - runs every Monday at 9 AM UTC
5+
schedule:
6+
- cron: '0 9 * * 1'
7+
8+
# Manual trigger via PR comments
9+
issue_comment:
10+
types: [created]
11+
12+
# Allow manual workflow dispatch for testing
13+
workflow_dispatch:
14+
inputs:
15+
task-filter:
16+
description: 'Regular expression to filter tasks (optional)'
17+
required: false
18+
default: ''
19+
verbose:
20+
description: 'Enable verbose output'
21+
required: false
22+
type: boolean
23+
default: false
24+
25+
concurrency:
26+
# Only run once for latest commit per ref and cancel other (previous) runs.
27+
group: ${{ github.workflow }}-${{ github.ref }}
28+
cancel-in-progress: true
29+
30+
env:
31+
GO_VERSION: 1.25
32+
33+
defaults:
34+
run:
35+
shell: bash
36+
37+
jobs:
38+
# Check if workflow should run based on trigger
39+
check-trigger:
40+
name: Check if evaluation should run
41+
runs-on: ubuntu-latest
42+
# Only run on PR comments in the main repository
43+
if: |
44+
github.event_name == 'schedule' ||
45+
github.event_name == 'workflow_dispatch' ||
46+
(github.event_name == 'issue_comment' &&
47+
github.event.issue.pull_request &&
48+
contains(github.event.comment.body, '/run-gevals'))
49+
outputs:
50+
should-run: ${{ steps.check.outputs.should-run }}
51+
pr-number: ${{ steps.check.outputs.pr-number }}
52+
steps:
53+
- name: Check trigger conditions
54+
id: check
55+
run: |
56+
if [[ "${{ github.event_name }}" == "issue_comment" ]]; then
57+
# Check if commenter is a maintainer (has write access)
58+
PERMISSION=$(curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
59+
"https://api.github.com/repos/${{ github.repository }}/collaborators/${{ github.event.comment.user.login }}/permission" \
60+
| jq -r '.permission')
61+
62+
if [[ "$PERMISSION" == "admin" || "$PERMISSION" == "write" ]]; then
63+
echo "should-run=true" >> $GITHUB_OUTPUT
64+
echo "pr-number=${{ github.event.issue.number }}" >> $GITHUB_OUTPUT
65+
else
66+
echo "should-run=false" >> $GITHUB_OUTPUT
67+
echo "User ${{ github.event.comment.user.login }} does not have permission to trigger evaluations"
68+
fi
69+
else
70+
echo "should-run=true" >> $GITHUB_OUTPUT
71+
fi
72+
73+
# Setup local Kind cluster for testing
74+
setup-cluster:
75+
name: Setup Kind cluster
76+
needs: check-trigger
77+
if: needs.check-trigger.outputs.should-run == 'true'
78+
runs-on: ubuntu-latest
79+
steps:
80+
- name: Checkout
81+
uses: actions/checkout@v4
82+
with:
83+
# Checkout PR branch if triggered by comment
84+
ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', needs.check-trigger.outputs.pr-number) || github.ref }}
85+
86+
- name: Setup Kind
87+
run: |
88+
# Install Kind if not already available
89+
if ! command -v kind &> /dev/null; then
90+
curl -Lo ./kind https://kind.sigs.k8s.io/dl/latest/kind-linux-amd64
91+
chmod +x ./kind
92+
sudo mv ./kind /usr/local/bin/kind
93+
fi
94+
95+
# Create Kind cluster
96+
kind create cluster --name mcp-eval-cluster --wait 5m
97+
98+
- name: Verify cluster
99+
run: |
100+
kubectl cluster-info
101+
kubectl get nodes
102+
103+
# Keep cluster info for next job
104+
- name: Save kubeconfig
105+
run: |
106+
mkdir -p ${{ runner.temp }}/kubeconfig
107+
kind get kubeconfig --name mcp-eval-cluster > ${{ runner.temp }}/kubeconfig/config
108+
109+
- name: Upload kubeconfig
110+
uses: actions/upload-artifact@v4
111+
with:
112+
name: kubeconfig
113+
path: ${{ runner.temp }}/kubeconfig/config
114+
retention-days: 1
115+
116+
# Run gevals evaluation
117+
run-evaluation:
118+
name: Run MCP Evaluation
119+
needs: [check-trigger, setup-cluster]
120+
if: needs.check-trigger.outputs.should-run == 'true'
121+
runs-on: ubuntu-latest
122+
steps:
123+
- name: Checkout
124+
uses: actions/checkout@v4
125+
with:
126+
# Checkout PR branch if triggered by comment
127+
ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', needs.check-trigger.outputs.pr-number) || github.ref }}
128+
129+
- name: Download kubeconfig
130+
uses: actions/download-artifact@v4
131+
with:
132+
name: kubeconfig
133+
path: ${{ runner.temp }}/kubeconfig
134+
135+
- name: Setup Go
136+
uses: actions/setup-go@v5
137+
with:
138+
go-version: ${{ env.GO_VERSION }}
139+
140+
- name: Build MCP server
141+
run: make build
142+
143+
- name: Start MCP server
144+
run: |
145+
export KUBECONFIG=${{ runner.temp }}/kubeconfig/config
146+
# Start MCP server in background
147+
./kubernetes-mcp-server --port 8080 &
148+
MCP_PID=$!
149+
echo "MCP_PID=$MCP_PID" >> $GITHUB_ENV
150+
151+
# Wait for server to be ready
152+
for i in {1..30}; do
153+
if curl -s http://localhost:8080/health > /dev/null 2>&1; then
154+
echo "MCP server is ready"
155+
break
156+
fi
157+
echo "Waiting for MCP server to start... ($i/30)"
158+
sleep 2
159+
done
160+
161+
- name: Run gevals evaluation
162+
uses: genmcp/gevals/.github/actions/gevals-action@main
163+
with:
164+
eval-config: 'evals/openai-agent/eval.yaml'
165+
gevals-version: 'latest'
166+
task-filter: ${{ github.event.inputs.task-filter || '' }}
167+
output-format: 'json'
168+
verbose: ${{ github.event.inputs.verbose || 'false' }}
169+
upload-artifacts: 'true'
170+
artifact-name: 'gevals-results'
171+
fail-on-error: 'false'
172+
task-pass-threshold: '0.8'
173+
assertion-pass-threshold: '0.8'
174+
working-directory: '.'
175+
env:
176+
KUBECONFIG: ${{ runner.temp }}/kubeconfig/config
177+
# OpenAI Agent configuration
178+
MODEL_BASE_URL: ${{ secrets.MODEL_BASE_URL }}
179+
MODEL_KEY: ${{ secrets.MODEL_KEY }}
180+
MODEL_NAME: ${{ secrets.MODEL_NAME }}
181+
# LLM Judge configuration
182+
JUDGE_BASE_URL: ${{ secrets.JUDGE_BASE_URL }}
183+
JUDGE_API_KEY: ${{ secrets.JUDGE_API_KEY }}
184+
JUDGE_MODEL_NAME: ${{ secrets.JUDGE_MODEL_NAME }}
185+
186+
- name: Stop MCP server
187+
if: always()
188+
run: |
189+
if [ -n "$MCP_PID" ]; then
190+
kill $MCP_PID || true
191+
fi
192+
193+
- name: Post results comment on PR
194+
if: github.event_name == 'issue_comment' && always()
195+
uses: actions/github-script@v7
196+
with:
197+
script: |
198+
const fs = require('fs');
199+
const path = require('path');
200+
201+
// Find the results file
202+
const resultsPattern = /gevals-.*-out\.json/;
203+
const files = fs.readdirSync('.');
204+
const resultsFile = files.find(f => resultsPattern.test(f));
205+
206+
if (!resultsFile) {
207+
await github.rest.issues.createComment({
208+
owner: context.repo.owner,
209+
repo: context.repo.repo,
210+
issue_number: ${{ needs.check-trigger.outputs.pr-number }},
211+
body: '❌ Gevals evaluation completed but no results file was found.'
212+
});
213+
return;
214+
}
215+
216+
// Read and parse results
217+
const results = JSON.parse(fs.readFileSync(resultsFile, 'utf8'));
218+
219+
// Calculate summary stats
220+
const totalTasks = results.length;
221+
const passedTasks = results.filter(r => r.taskPassed && r.allAssertionsPassed).length;
222+
const failedTasks = totalTasks - passedTasks;
223+
224+
// Build comment body
225+
let comment = '## 🤖 Gevals MCP Evaluation Results\n\n';
226+
comment += `**Summary:** ${passedTasks}/${totalTasks} tasks passed\n\n`;
227+
228+
if (failedTasks > 0) {
229+
comment += '### ❌ Failed Tasks\n\n';
230+
results.filter(r => !r.taskPassed || !r.allAssertionsPassed).forEach(task => {
231+
comment += `- **${task.taskName}**\n`;
232+
comment += ` - Task Passed: ${task.taskPassed ? '✅' : '❌'}\n`;
233+
comment += ` - Assertions Passed: ${task.allAssertionsPassed ? '✅' : '❌'}\n`;
234+
});
235+
comment += '\n';
236+
}
237+
238+
comment += '### ✅ Passed Tasks\n\n';
239+
results.filter(r => r.taskPassed && r.allAssertionsPassed).forEach(task => {
240+
comment += `- ${task.taskName}\n`;
241+
});
242+
243+
comment += `\n[View full results](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})`;
244+
245+
await github.rest.issues.createComment({
246+
owner: context.repo.owner,
247+
repo: context.repo.repo,
248+
issue_number: ${{ needs.check-trigger.outputs.pr-number }},
249+
body: comment
250+
});
251+
252+
# Cleanup
253+
cleanup:
254+
name: Cleanup Kind cluster
255+
needs: [setup-cluster, run-evaluation]
256+
if: always()
257+
runs-on: ubuntu-latest
258+
steps:
259+
- name: Delete Kind cluster
260+
run: |
261+
if command -v kind &> /dev/null; then
262+
kind delete cluster --name mcp-eval-cluster || true
263+
fi

evals/README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Kubernetes MCP Server Test Examples
2+
3+
This directory contains examples for testing the **same Kubernetes MCP server** using different AI agents.
4+
5+
## Structure
6+
7+
```
8+
kube-mcp-server/
9+
├── README.md # This file
10+
├── mcp-config.yaml # Shared MCP server configuration
11+
├── tasks/ # Shared test tasks
12+
│ ├── create-pod.yaml
13+
│ ├── setup.sh
14+
│ ├── verify.sh
15+
│ └── cleanup.sh
16+
├── claude-code/ # Claude Code agent configuration
17+
│ ├── agent.yaml
18+
│ ├── eval.yaml
19+
│ └── eval-inline.yaml
20+
└── openai-agent/ # OpenAI-compatible agent configuration
21+
├── agent.yaml
22+
├── eval.yaml
23+
└── eval-inline.yaml
24+
```
25+
26+
## What This Tests
27+
28+
Both examples test the **same Kubernetes MCP server** using **shared task definitions**:
29+
- Creates an nginx pod named `web-server` in the `create-pod-test` namespace
30+
- Verifies the pod is running
31+
- Validates that the agent called appropriate Kubernetes tools
32+
- Cleans up resources
33+
34+
The tasks and MCP configuration are shared - only the agent configuration differs.
35+
36+
## Prerequisites
37+
38+
- Kubernetes cluster (kind, minikube, or any cluster)
39+
- kubectl configured
40+
- Kubernetes MCP server running at `http://localhost:8080/mcp`
41+
- Built binaries: `gevals` and `agent`
42+
43+
## Running Examples
44+
45+
### Option 1: Claude Code
46+
47+
```bash
48+
./gevals eval examples/kube-mcp-server/claude-code/eval.yaml
49+
```
50+
51+
**Requirements:**
52+
- Claude Code installed and in PATH
53+
54+
**Tool Usage:**
55+
- Claude typically uses pod-specific tools like `pods_run`, `pods_create`
56+
57+
---
58+
59+
### Option 2: OpenAI-Compatible Agent (Built-in)
60+
61+
```bash
62+
# Set your model credentials
63+
export MODEL_BASE_URL='https://your-api-endpoint.com/v1'
64+
export MODEL_KEY='your-api-key'
65+
export MODEL_NAME='your-model-name'
66+
67+
# Run the test
68+
./gevals eval examples/kube-mcp-server/openai-agent/eval.yaml
69+
```
70+
71+
**Note:** Different AI models may choose different tools from the MCP server (`pods_*` or `resources_*`) to accomplish the same task. Both approaches work correctly.
72+
73+
## Assertions
74+
75+
Both examples use flexible assertions that accept either tool approach:
76+
77+
```yaml
78+
toolPattern: "(pods_.*|resources_.*)" # Accepts both pod-specific and generic resource tools
79+
```
80+
81+
This makes the tests robust across different AI models that may prefer different tools.
82+
83+
## Key Difference: Agent Configuration
84+
85+
### Claude Code (claude-code/agent.yaml)
86+
```yaml
87+
commands:
88+
argTemplateMcpServer: "--mcp-config {{ .File }}"
89+
argTemplateAllowedTools: "mcp__{{ .ServerName }}__{{ .ToolName }}"
90+
runPrompt: |-
91+
claude {{ .McpServerFileArgs }} --print "{{ .Prompt }}"
92+
```
93+
94+
### OpenAI Agent (openai-agent/agent.yaml)
95+
```yaml
96+
builtin:
97+
type: "openai-agent"
98+
model: "gpt-4"
99+
```
100+
101+
Uses the built-in OpenAI agent with model configuration.
102+
103+
## Expected Results
104+
105+
Both examples should produce:
106+
- ✅ Task passed - pod created successfully
107+
- ✅ Assertions passed - appropriate tools were called
108+
- ✅ Verification passed - pod exists and is running
109+
110+
Results saved to: `gevals-<eval-name>-out.json`

evals/claude-code/agent.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
kind: Agent
2+
metadata:
3+
name: "claude-code"
4+
builtin:
5+
type: "claude-code"

0 commit comments

Comments
 (0)