agentic-evals is a target-driven evaluation repo for testing one skill at a time.
Current release version: 0.1.0 (from the root VERSION file).
This repo defines:
- the evaluator contract in
AGENT.md - the target under test in
targets/<target_id>/target.yaml - the active suites and cases under
targets/<target_id>/ - the run artifacts written to
runs/<run_id>/
The repo contract is runtime-neutral: evaluators may execute it in Codex or OpenClaw as long as they honor the artifact contract and record the chosen runtime in manifest.json.evidence_mode.
The repo supports 2 evaluator-facing run modes:
single-run: evaluate one local target-skill workspaceab-urls: evaluate 2 isolated target-skill variants prepared from GitHub URLs, then compare them
ab-urls is additive.
It does not change the meaning of targets, suites, cases, assertions, or per-case statuses.
agentic-evals/
├── AGENT.md
├── README.md
├── docs/
│ └── session-evidence.md
├── targets/
│ └── <target_id>/
│ ├── target.yaml
│ └── cases/
│ └── <suite_id>/
│ ├── suite.yaml
│ └── <case_id>.yaml
└── runs/
README.md: human-facing repo guide for understanding and editing the test set.AGENT.md: canonical evaluator-facing repo contract.docs/session-evidence.md: canonical contract for dual-mode session evidence and child-session location.skill-eval/SKILL.md: operational instructions for theskill-evalevaluator skill, not the source of truth for repo assertions or statuses.VERSION: canonical project release version in plain SemVer format.
This repo uses a single root VERSION file as the source of truth for release versioning.
Release bump workflow:
- Update
VERSIONwith the next SemVer value. - Reference that version in release branch name, PR title, and release notes.
- Tag the merge commit in
mainwith the same version (for examplev0.1.0).
- Read
AGENT.mdif you are changing evaluator behavior or validating repo contract rules. - Read
targets/<target_id>/target.yaml, suite files, and case files when editing the test set. - Inspect
runs/when reviewing actual execution output.
Examples in this README use voice-ai-integration as the target id.
Recommended first-version URL shape:
https://github.com/<org>/<repo>/tree/<ref>/<subdir>
This is preferred because it encodes both the git ref and the exact skill root.
Supported URL families for variant acquisition:
https://github.com/<org>/<repo>/tree/<ref>/<subdir>https://github.com/<org>/<repo>/archive/refs/heads/<branch>.tar.gzhttps://github.com/<org>/<repo>/archive/refs/tags/<tag>.tar.gzhttps://github.com/<org>/<repo>/archive/<commit>.tar.gz
Example A/B prompts:
用 skill-eval 以 ab 模式测试 target_id=voice-ai-integration
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
suite_ids=workflow,routing
用 skill-eval 做 A/B 测试
target_id=voice-ai-integration
A=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
B=https://github.com/org/repo/tree/feature-x/.agents/skills/voice-ai-integration
case_id=auth-prefers-rtc-token
An ab-urls run writes a top-level comparison wrapper plus 2 variant-local single-run directories:
agentic-evals/runs/<ab_run_id>/
├── manifest.json
├── variants/
│ ├── A/
│ │ ├── source-manifest.json
│ │ └── run/
│ │ ├── manifest.json
│ │ ├── case-artifacts/
│ │ ├── transcript.md
│ │ ├── case-results/
│ │ └── report.md
│ └── B/
│ ├── source-manifest.json
│ └── run/
│ ├── manifest.json
│ ├── case-artifacts/
│ ├── transcript.md
│ ├── case-results/
│ └── report.md
├── comparison.json
└── report.md
Meaning:
variants/<label>/run/is still a normal single-version run shapesource-manifest.jsonrecords how each URL was parsed and resolvedcomparison.jsonand the top-levelreport.mdsummarize case-by-case differences
Most manual edits in this repo land in one of these places:
targets/<target_id>/target.yamlwhen the default suites, entry skill, or allowed statuses changetargets/<target_id>/cases/<suite_id>/suite.yamlwhen grouping active casestargets/<target_id>/cases/<suite_id>/<case_id>.yamlwhen adding or refining active behavior checks
The evaluator execution protocol lives in AGENT.md, not here.
Add active cases under targets/<target_id>/cases/<suite_id>/ alongside the suite.yaml that references them.
Keep cases focused and behavior-first:
- Lock down one specific rule per case.
- Keep
input.user_promptrealistic andsetupminimal. - Write assertions as natural-language rubric entries.
- Use
assert.summaryfor the case-level behavior being protected. - Use
evidence_scopeto point the evaluator to artifact filenames such asaccepted-session.json,final-answer.txt, or both. - Write trace-facing assertions against accepted session evidence semantics such as consultation of a file, an observed command invocation, ordering, and the final answer.
- Make sure any setup can be reproduced inside an isolated case workspace.
Each case should include:
case_idtitleinputsetup- optional
assert.summary assert.requiredassert.forbiddennotes
Minimal case skeleton:
case_id: "example-case"
title: "Short behavior-oriented title"
input:
user_prompt: "Realistic user request"
locale: "en-US"
setup:
docs_index_present: true
assert:
summary: "The assistant should consult the skill and then follow the supported path."
required:
- description: "The accepted session evidence should show that the top-level skill instructions were consulted before answering."
pass_criteria:
- "The accepted session evidence shows consultation of `.agents/skills/<target_id>/SKILL.md` before the answer."
fail_signals:
- "The accepted session evidence never shows consultation of `.agents/skills/<target_id>/SKILL.md`"
evidence_scope:
- "accepted-session.json"
forbidden: []
notes:
why_it_matters: "Why this behavior is worth protecting."
likely_fix_files:
- ".agents/skills/<target_id>/SKILL.md"- This command validates all case, suite, and target YAML files for the selected target by loading them with Ruby. If every file parses successfully, it prints
yaml-ok.
TARGET_ID=voice-ai-integration ruby -e 'require "yaml"; Dir["targets/#{ENV.fetch("TARGET_ID")}/cases/*/suite.yaml"].sort.each { |f| YAML.load_file(f) }; Dir["targets/#{ENV.fetch("TARGET_ID")}/cases/**/*.yaml"].sort.reject { |f| f.end_with?("suite.yaml") }.each { |f| YAML.load_file(f) }; YAML.load_file("targets/#{ENV.fetch("TARGET_ID")}/target.yaml"); puts "yaml-ok"'- This command checks suite-to-case coverage for the selected target. It reports how many case files exist, how many suite references were found, and whether any cases are missing or referenced more than once.
TARGET_ID=voice-ai-integration ruby -e 'require "yaml"; target_id = ENV.fetch("TARGET_ID"); refs = Dir["targets/#{target_id}/cases/*/suite.yaml"].sort.flat_map { |f| YAML.load_file(f)["cases"] }; counts = Hash.new(0); refs.each { |r| counts[r] += 1 }; cases = Dir["targets/#{target_id}/cases/**/*.yaml"].sort.reject { |f| f.end_with?("suite.yaml") }; missing = cases.reject { |c| counts.key?(c) }; dupes = counts.select { |_, v| v > 1 }; puts "cases=#{cases.size} suite_refs=#{refs.size} dupes=#{dupes.size} missing=#{missing.size}"'- Keep suites thematic and readable.
- Prefer concrete session evidence over vague review criteria.
- Audit for YAML parse errors, duplicate suite references, and unreferenced cases whenever the test set changes.
- Refresh affected suites and cases when the target workflow changes.
- Remove or rewrite stale cases that no longer reflect the supported path.
skill-eval/SKILL.mdAGENT.mddocs/session-evidence.mdtargets/<target_id>/target.yamltargets/<target_id>/cases/<suite_id>/suite.yamltargets/<target_id>/cases/<suite_id>/<case_id>.yamlruns/