Skip to content

Commit 7eef0e8

Browse files
committed
docs: update agent-eval docs for multi-provider support
Add provider table, Codex prerequisites, updated CLI reference, and environment variable docs reflecting auto-detection behavior.
1 parent 3855320 commit 7eef0e8

File tree

2 files changed

+64
-17
lines changed

2 files changed

+64
-17
lines changed

docs/agent-eval.md

Lines changed: 48 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,20 @@
11
# Agent Evaluation Framework (`agent-eval`)
22

3-
The `agent-eval` subcommand of `@speakeasy-api/docs-mcp-eval` runs end-to-end agent evaluations. It spawns a Claude agent with docs-mcp tools (`search_docs`, `get_doc`), runs it against a prompt, and evaluates assertions on the output. This validates the full stack — from search quality to how well a real model uses the tools to complete a task.
3+
The `agent-eval` subcommand of `@speakeasy-api/docs-mcp-eval` runs end-to-end agent evaluations. It spawns an AI coding agent with docs-mcp tools (`search_docs`, `get_doc`), runs it against a prompt, and evaluates assertions on the output. This validates the full stack — from search quality to how well a real model uses the tools to complete a task.
4+
5+
## Providers
6+
7+
The eval supports multiple agent providers via the `--provider` flag:
8+
9+
| Provider | Flag | Backend | Prerequisites |
10+
|----------|------|---------|---------------|
11+
| Claude | `--provider claude` | `@anthropic-ai/claude-agent-sdk` | `ANTHROPIC_API_KEY` |
12+
| OpenAI Codex | `--provider openai` | `codex exec --json` (CLI spawn) | `OPENAI_API_KEY` + [`codex`](https://github.com/openai/codex) CLI on PATH |
13+
| Auto (default) | `--provider auto` | Detected from environment | Whichever API key is set |
14+
15+
Auto-detection priority: if only `OPENAI_API_KEY` is set, Codex is used; otherwise Claude is used (its CLI handles its own auth). If both keys are set, Claude is used with a warning.
16+
17+
The Codex provider spawns `codex exec --json` as a child process and injects MCP server configuration via `-c` CLI flags. It performs a pre-flight check to verify the MCP server starts correctly before running the agent.
418

519
## Scenario Format
620

@@ -226,14 +240,15 @@ docs-mcp-eval agent-eval [options]
226240

227241
### Agent
228242

229-
| Option | Default | Description |
230-
| ------------------------- | -------------------------- | ------------------------------------- |
231-
| `--model <value>` | `claude-sonnet-4-20250514` | Claude model to use |
232-
| `--max-turns <n>` | `15` | Default max turns per scenario |
233-
| `--max-budget-usd <n>` | `0.50` | Default max budget per scenario (USD) |
234-
| `--max-concurrency <n>` | `1` | Max concurrent scenarios |
235-
| `--system-prompt <value>` || Custom system prompt for the agent |
236-
| `--workspace-dir <path>` || Base directory for agent workspaces |
243+
| Option | Default | Description |
244+
| ------------------------- | -------------------------------- | ------------------------------------------------------ |
245+
| `--provider <value>` | `auto` | Agent provider: `claude`, `openai`, or `auto` |
246+
| `--model <value>` | _(per-provider default)_ | Model to use (e.g. `claude-sonnet-4-20250514`) |
247+
| `--max-turns <n>` | `15` | Default max turns per scenario |
248+
| `--max-budget-usd <n>` | `0.50` | Default max budget per scenario (USD) |
249+
| `--max-concurrency <n>` | `1` | Max concurrent scenarios |
250+
| `--system-prompt <value>` || Custom system prompt for the agent |
251+
| `--workspace-dir <path>` || Base directory for agent workspaces |
237252

238253
### Output
239254

@@ -253,9 +268,14 @@ The eval framework works in two main contexts:
253268
The only thing you need in a consumer repo is a scenario JSON file. Invoke the eval via npx:
254269

255270
```bash
271+
# With Claude (default)
272+
npx @speakeasy-api/docs-mcp-eval agent-eval \
273+
--scenarios ./agent-scenarios.json
274+
275+
# With OpenAI Codex
256276
npx @speakeasy-api/docs-mcp-eval agent-eval \
257277
--scenarios ./agent-scenarios.json \
258-
--model claude-sonnet-4-20250514
278+
--provider openai
259279
```
260280

261281
### Pointing at an OSS project
@@ -389,11 +409,23 @@ When previous results exist in `.eval-results/`, the CLI automatically compares
389409

390410
## Environment Variables
391411

392-
| Variable | Required | Description |
393-
| -------------------------------------- | -------- | ----------------------------------------------------------------------- |
394-
| `ANTHROPIC_API_KEY` | **yes** | API key for the Claude agent (used by `@anthropic-ai/claude-agent-sdk`) |
395-
| `OPENAI_API_KEY` | no | For embedding-based index builds (when using OpenAI embeddings) |
396-
| SDK-specific keys (e.g. `DUB_API_KEY`) | no | For `script` assertions guarded by `when_env` — skipped if absent |
397-
| `NO_COLOR` | no | Disables ANSI color output |
412+
| Variable | Required | Description |
413+
| -------------------------------------- | -------- | ---------------------------------------------------------------------------------------------------- |
414+
| `ANTHROPIC_API_KEY` | \* | API key for the Claude provider (used by `@anthropic-ai/claude-agent-sdk`) |
415+
| `OPENAI_API_KEY` | \* | API key for the OpenAI Codex provider (also used for embedding-based index builds) |
416+
| SDK-specific keys (e.g. `DUB_API_KEY`) | no | For `script` assertions guarded by `when_env` — skipped if absent |
417+
| `NO_COLOR` | no | Disables ANSI color output |
418+
419+
\* At least one provider API key is required. With `--provider auto`, the eval detects which provider to use based on which key is set.
420+
421+
### OpenAI Codex Prerequisites
422+
423+
The OpenAI Codex provider requires the [`codex`](https://github.com/openai/codex) CLI to be installed and available on PATH. Install it with:
424+
425+
```bash
426+
npm install -g @openai/codex
427+
```
428+
429+
The Codex CLI manages its own authentication. Run `codex` once interactively to authenticate, or set `OPENAI_API_KEY` in your environment.
398430

399431
When running from the monorepo via mise, copy `mise.local.toml.example` to `mise.local.toml` and fill in your API keys. The `.env` file in the eval package directory is also loaded automatically via `dotenv`.

packages/eval/README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,24 @@ See [docs/eval.md](https://github.com/speakeasy-api/docs-mcp/blob/main/docs/eval
3030

3131
### Agent Eval (`agent-eval`)
3232

33-
End-to-end evaluation that spawns a Claude agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.
33+
End-to-end evaluation that spawns an AI coding agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.
34+
35+
Supports multiple agent providers:
36+
37+
| Provider | Flag | Backend | Prerequisites |
38+
|----------|------|---------|---------------|
39+
| Claude | `--provider claude` | `@anthropic-ai/claude-agent-sdk` | `ANTHROPIC_API_KEY` |
40+
| OpenAI Codex | `--provider openai` | `codex exec --json` (CLI spawn) | `OPENAI_API_KEY` + [`codex`](https://github.com/openai/codex) on PATH |
41+
| Auto (default) | `--provider auto` | Detected from env | Whichever key is set |
3442

3543
```bash
44+
# Claude (default when ANTHROPIC_API_KEY is set)
45+
docs-mcp-eval agent-eval --suite acmeauth
46+
47+
# OpenAI Codex
48+
docs-mcp-eval agent-eval --suite dub-go --provider openai
49+
50+
# Custom scenario file
3651
docs-mcp-eval agent-eval --scenarios ./my-scenarios.json --docs-dir ./my-docs
3752
```
3853

0 commit comments

Comments
 (0)