Skip to content

Commit 029ec37

Browse files
vishalg0wdaclaude
andauthored
feat(eval): add self-contained agent evaluation framework (#5)
* feat(eval): add self-contained agent eval with convenience CLI Add docsDir to scenarios so each scenario specifies its own docs corpus. The runner auto-builds and caches indexes (keyed by file listing hash), eliminating manual --server-command/--server-arg wiring. New CLI options: - --suite <name>: resolves to fixtures/agent-scenarios/<name>.json - --prompt <text>: ad-hoc single scenario (requires --docs-dir) - --docs-dir <path>: default docsDir fallback for scenarios - --server-command is now optional when scenarios have docsDir Also includes the full agent eval framework: runner, assertions, metrics, observer, types, test fixtures, and tests. * refactor(eval): simplify runner config, pre-build indexes in CLI - Move index building from runner to bin.ts (pre-build before eval starts, deduplicate builds for shared docsDirs) - Set indexDir directly on scenario objects instead of parallel Map - Remove resolvedDocsDirs, cliBinPath, serverBinPath, cacheDir from AgentEvalConfig; make server optional (not needed when all scenarios have indexDir) - Revert runAgentScenario signature to (scenario, config, observer) - Runner computes SERVER_BIN_PATH from import.meta.url, no config needed - Extract loadScenarios() helper from inline CLI action - Observer: respect stderr.isTTY for color detection (not just NO_COLOR) - Fixtures: remove per-scenario maxTurns/maxBudgetUsd so CLI --flags can override without being shadowed by scenario-level defaults * feat(eval): improve observer output, add multi-lang scenarios, conditional assertions - Add when_env conditional assertions that skip when env vars are missing - Add scenario suites for Go, Python, and acmeauth fixtures - Replace flaky text assertions with file-based script assertions - Improve observer: syntax highlighting, pretty-formatted tool results, panel truncation, indented panels, minimal tool call/result display - Prevent agent from executing scripts (eval assertions handle that) - Plumb env vars to agent subprocess - Remove redundant JSON stdout dump from CLI * docs: add agent-eval documentation Add docs/agent-eval.md with full spec (scenario format, assertion types, CLI reference, docs sources, result format, environment variables). Update packages/eval/README.md and root README.md to reference both search-quality and agent eval modes. * feat(eval): add file assertions, k:v scenarios, --include filter, mise tasks - Add `file_contains` and `file_matches` assertion types that check workspace files directly instead of the agent's final text response - Convert scenario files from arrays to k:v objects keyed by scenario ID - Add `--include <ids>` CLI option for filtering scenarios by ID - Fix false-positive script assertions: require program to exit 0 before grepping output (`output=$(cmd 2>&1) && echo "$output" | grep`) - Treat compilation/typecheck assertions as soft failures (|| true) - Add mise convenience tasks: agent-eval, agent-eval:debug, agent-eval:prompt, agent-eval:suites - Move .env.example to repo root - Update docs with new assertion types, scenario format, and CLI options * address PR #5 review: configurable tool descriptions, realistic system prompt, docs improvements - Add optional tool_descriptions to CorpusMetadata; wire through core, server, CLI (--tool-description-search, --tool-description-get-doc), and eval packages - Change default search_docs description from "library/SDK" to "documentation set" - Discourage context>0 on first get_doc call in tool and parameter descriptions - Replace .env.example with mise.local.toml.example; remove redundant package.json agent-eval script (already covered by mise task) - Use realistic system prompt in agent eval runner — mention tools as available rather than mandating use, reducing activation metric bias - Expand agent-eval docs: two use cases (own SDK vs OSS project), portable JSON output, CI integration workflow section * chore: updater * chore: add changeset for agent eval feature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8d1fa0d commit 029ec37

36 files changed

+3580
-31
lines changed

.changeset/add-agent-eval.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
"@speakeasy-api/docs-mcp-eval": minor
3+
"@speakeasy-api/docs-mcp-server": patch
4+
"@speakeasy-api/docs-mcp-core": patch
5+
---
6+
7+
Add agent evaluation harness for end-to-end testing of MCP tool usage
8+
9+
Introduces a self-contained agent eval framework that uses Claude Agent SDK to run realistic coding scenarios against docs-mcp, with assertion-based scoring, file content validation, and an interactive CLI (`docs-mcp-eval`). Includes multi-language scenarios for several SDKs, build caching, history tracking, and configurable tool descriptions in the server.

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,8 @@ tests/fixtures/index/
1111
.agents/
1212
.claude/
1313
package-lock.json
14+
.eval-results/
15+
.env
16+
.env.local
17+
!.env.example
18+
.cache/

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ Structured as a Turborepo with four packages:
176176
| `@speakeasy-api/docs-mcp-cli` | CLI for validation, manifest bootstrap (`fix`), and deterministic indexing (`build`) |
177177
| `@speakeasy-api/docs-mcp-core` | Core retrieval primitives, AST parsing, chunking, and LanceDB queries |
178178
| `@speakeasy-api/docs-mcp-server` | Lean runtime MCP server surface |
179-
| `@speakeasy-api/docs-mcp-eval` | Standalone evaluation and benchmarking harness |
179+
| `@speakeasy-api/docs-mcp-eval` | Standalone evaluation harness — search-quality benchmarks and end-to-end agent evaluation |
180180

181181
```text
182182
+---------------------------+
@@ -378,7 +378,10 @@ Open `http://localhost:3001`. Requires a running HTTP server (step 3 with `--tra
378378

379379
## Evaluation
380380

381-
Docs MCP includes a standalone evaluation harness for measuring search quality with transparent, repeatable benchmarks. See the [Evaluation Framework](docs/eval.md) for how to build an eval suite, run benchmarks across embedding providers, and interpret results.
381+
Docs MCP includes a standalone evaluation harness with two modes:
382+
383+
- **Search-quality eval** (`run`) — drives the MCP server directly via stdio JSON-RPC, measuring retrieval metrics (MRR, NDCG, precision, latency). See [docs/eval.md](docs/eval.md).
384+
- **Agent eval** (`agent-eval`) — spawns a Claude agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack end-to-end. See [docs/agent-eval.md](docs/agent-eval.md).
382385

383386
## License
384387

0 commit comments

Comments
 (0)