You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/agent-eval.md
+48-16Lines changed: 48 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,20 @@
1
1
# Agent Evaluation Framework (`agent-eval`)
2
2
3
-
The `agent-eval` subcommand of `@speakeasy-api/docs-mcp-eval` runs end-to-end agent evaluations. It spawns a Claude agent with docs-mcp tools (`search_docs`, `get_doc`), runs it against a prompt, and evaluates assertions on the output. This validates the full stack — from search quality to how well a real model uses the tools to complete a task.
3
+
The `agent-eval` subcommand of `@speakeasy-api/docs-mcp-eval` runs end-to-end agent evaluations. It spawns an AI coding agent with docs-mcp tools (`search_docs`, `get_doc`), runs it against a prompt, and evaluates assertions on the output. This validates the full stack — from search quality to how well a real model uses the tools to complete a task.
4
+
5
+
## Providers
6
+
7
+
The eval supports multiple agent providers via the `--provider` flag:
8
+
9
+
| Provider | Flag | Backend | Prerequisites |
10
+
|----------|------|---------|---------------|
11
+
| Claude |`--provider claude`|`@anthropic-ai/claude-agent-sdk`|`ANTHROPIC_API_KEY`|
| Auto (default) |`--provider auto`| Detected from environment | Whichever API key is set |
14
+
15
+
Auto-detection priority: if only `OPENAI_API_KEY` is set, Codex is used; otherwise Claude is used (its CLI handles its own auth). If both keys are set, Claude is used with a warning.
16
+
17
+
The Codex provider spawns `codex exec --json` as a child process and injects MCP server configuration via `-c` CLI flags. It performs a pre-flight check to verify the MCP server starts correctly before running the agent.
|`ANTHROPIC_API_KEY`|\*| API key for the Claude provider (used by `@anthropic-ai/claude-agent-sdk`) |
415
+
|`OPENAI_API_KEY`|\*| API key for the OpenAI Codex provider (also used for embedding-based index builds) |
416
+
| SDK-specific keys (e.g. `DUB_API_KEY`) | no | For `script` assertions guarded by `when_env` — skipped if absent |
417
+
|`NO_COLOR`| no | Disables ANSI color output |
418
+
419
+
\* At least one provider API key is required. With `--provider auto`, the eval detects which provider to use based on which key is set.
420
+
421
+
### OpenAI Codex Prerequisites
422
+
423
+
The OpenAI Codex provider requires the [`codex`](https://github.com/openai/codex) CLI to be installed and available on PATH. Install it with:
424
+
425
+
```bash
426
+
npm install -g @openai/codex
427
+
```
428
+
429
+
The Codex CLI manages its own authentication. Run `codex` once interactively to authenticate, or set `OPENAI_API_KEY` in your environment.
398
430
399
431
When running from the monorepo via mise, copy `mise.local.toml.example` to `mise.local.toml` and fill in your API keys. The `.env` file in the eval package directory is also loaded automatically via `dotenv`.
Copy file name to clipboardExpand all lines: packages/eval/README.md
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,9 +30,24 @@ See [docs/eval.md](https://github.com/speakeasy-api/docs-mcp/blob/main/docs/eval
30
30
31
31
### Agent Eval (`agent-eval`)
32
32
33
-
End-to-end evaluation that spawns a Claude agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.
33
+
End-to-end evaluation that spawns an AI coding agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.
34
+
35
+
Supports multiple agent providers:
36
+
37
+
| Provider | Flag | Backend | Prerequisites |
38
+
|----------|------|---------|---------------|
39
+
| Claude |`--provider claude`|`@anthropic-ai/claude-agent-sdk`|`ANTHROPIC_API_KEY`|
0 commit comments