Commit 029ec37
feat(eval): add self-contained agent evaluation framework (#5)
* feat(eval): add self-contained agent eval with convenience CLI
Add docsDir to scenarios so each scenario specifies its own docs corpus.
The runner auto-builds and caches indexes (keyed by file listing hash),
eliminating manual --server-command/--server-arg wiring.
New CLI options:
- --suite <name>: resolves to fixtures/agent-scenarios/<name>.json
- --prompt <text>: ad-hoc single scenario (requires --docs-dir)
- --docs-dir <path>: default docsDir fallback for scenarios
- --server-command is now optional when scenarios have docsDir
Also includes the full agent eval framework: runner, assertions,
metrics, observer, types, test fixtures, and tests.
* refactor(eval): simplify runner config, pre-build indexes in CLI
- Move index building from runner to bin.ts (pre-build before eval
starts, deduplicate builds for shared docsDirs)
- Set indexDir directly on scenario objects instead of parallel Map
- Remove resolvedDocsDirs, cliBinPath, serverBinPath, cacheDir from
AgentEvalConfig; make server optional (not needed when all scenarios
have indexDir)
- Revert runAgentScenario signature to (scenario, config, observer)
- Runner computes SERVER_BIN_PATH from import.meta.url, no config needed
- Extract loadScenarios() helper from inline CLI action
- Observer: respect stderr.isTTY for color detection (not just NO_COLOR)
- Fixtures: remove per-scenario maxTurns/maxBudgetUsd so CLI --flags
can override without being shadowed by scenario-level defaults
* feat(eval): improve observer output, add multi-lang scenarios, conditional assertions
- Add when_env conditional assertions that skip when env vars are missing
- Add scenario suites for Go, Python, and acmeauth fixtures
- Replace flaky text assertions with file-based script assertions
- Improve observer: syntax highlighting, pretty-formatted tool results,
panel truncation, indented panels, minimal tool call/result display
- Prevent agent from executing scripts (eval assertions handle that)
- Plumb env vars to agent subprocess
- Remove redundant JSON stdout dump from CLI
* docs: add agent-eval documentation
Add docs/agent-eval.md with full spec (scenario format, assertion types,
CLI reference, docs sources, result format, environment variables).
Update packages/eval/README.md and root README.md to reference both
search-quality and agent eval modes.
* feat(eval): add file assertions, k:v scenarios, --include filter, mise tasks
- Add `file_contains` and `file_matches` assertion types that check
workspace files directly instead of the agent's final text response
- Convert scenario files from arrays to k:v objects keyed by scenario ID
- Add `--include <ids>` CLI option for filtering scenarios by ID
- Fix false-positive script assertions: require program to exit 0 before
grepping output (`output=$(cmd 2>&1) && echo "$output" | grep`)
- Treat compilation/typecheck assertions as soft failures (|| true)
- Add mise convenience tasks: agent-eval, agent-eval:debug,
agent-eval:prompt, agent-eval:suites
- Move .env.example to repo root
- Update docs with new assertion types, scenario format, and CLI options
* address PR #5 review: configurable tool descriptions, realistic system prompt, docs improvements
- Add optional tool_descriptions to CorpusMetadata; wire through core, server,
CLI (--tool-description-search, --tool-description-get-doc), and eval packages
- Change default search_docs description from "library/SDK" to "documentation set"
- Discourage context>0 on first get_doc call in tool and parameter descriptions
- Replace .env.example with mise.local.toml.example; remove redundant package.json
agent-eval script (already covered by mise task)
- Use realistic system prompt in agent eval runner — mention tools as available
rather than mandating use, reducing activation metric bias
- Expand agent-eval docs: two use cases (own SDK vs OSS project), portable JSON
output, CI integration workflow section
* chore: updater
* chore: add changeset for agent eval feature
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 8d1fa0d commit 029ec37
File tree
36 files changed
+3580
-31
lines changed- .changeset
- docs
- packages
- cli/src
- core/src
- eval
- fixtures/agent-scenarios
- src
- agent
- test/agent
- server/src
36 files changed
+3580
-31
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
176 | 176 | | |
177 | 177 | | |
178 | 178 | | |
179 | | - | |
| 179 | + | |
180 | 180 | | |
181 | 181 | | |
182 | 182 | | |
| |||
378 | 378 | | |
379 | 379 | | |
380 | 380 | | |
381 | | - | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
382 | 385 | | |
383 | 386 | | |
384 | 387 | | |
| |||
0 commit comments