agentic-evals

agentic-evals is a target-driven evaluation repo for testing one skill at a time.

Current release version: 0.1.0 (from the root VERSION file).

This repo defines:

the evaluator contract in AGENT.md
the target under test in targets/<target_id>/target.yaml
the active suites and cases under targets/<target_id>/
the run artifacts written to runs/<run_id>/

The repo contract is runtime-neutral: evaluators may execute it in Codex or OpenClaw as long as they honor the artifact contract and record the chosen runtime in manifest.json.evidence_mode.

Run Modes

The repo supports 2 evaluator-facing run modes:

single-run: evaluate one local target-skill workspace
ab-urls: evaluate 2 isolated target-skill variants prepared from GitHub URLs, then compare them

ab-urls is additive. It does not change the meaning of targets, suites, cases, assertions, or per-case statuses.

Layout

agentic-evals/
├── AGENT.md
├── README.md
├── docs/
│   └── session-evidence.md
├── targets/
│   └── <target_id>/
│       ├── target.yaml
│       └── cases/
│           └── <suite_id>/
│               ├── suite.yaml
│               └── <case_id>.yaml
└── runs/

Document Responsibilities

README.md: human-facing repo guide for understanding and editing the test set.
AGENT.md: canonical evaluator-facing repo contract.
docs/session-evidence.md: canonical contract for dual-mode session evidence and child-session location.
skill-eval/SKILL.md: operational instructions for the skill-eval evaluator skill, not the source of truth for repo assertions or statuses.
VERSION: canonical project release version in plain SemVer format.

Versioning

This repo uses a single root VERSION file as the source of truth for release versioning.

Release bump workflow:

Update VERSION with the next SemVer value.
Reference that version in release branch name, PR title, and release notes.
Tag the merge commit in main with the same version (for example v0.1.0).

How To Read This Repo

Read AGENT.md if you are changing evaluator behavior or validating repo contract rules.
Read targets/<target_id>/target.yaml, suite files, and case files when editing the test set.
Inspect runs/ when reviewing actual execution output.

Examples in this README use voice-ai-integration as the target id.

A/B Usage

Recommended first-version URL shape:

https://github.com/<org>/<repo>/tree/<ref>/<subdir>

This is preferred because it encodes both the git ref and the exact skill root.

Supported URL families for variant acquisition:

https://github.com/<org>/<repo>/tree/<ref>/<subdir>
https://github.com/<org>/<repo>/archive/refs/heads/<branch>.tar.gz
https://github.com/<org>/<repo>/archive/refs/tags/<tag>.tar.gz
https://github.com/<org>/<repo>/archive/<commit>.tar.gz

Example A/B prompts:

用 skill-eval 以 ab 模式测试 target_id=voice-ai-integration
variant_a_url=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
variant_b_url=https://github.com/org/repo/tree/rewrite/.agents/skills/voice-ai-integration
suite_ids=workflow,routing

用 skill-eval 做 A/B 测试
target_id=voice-ai-integration
A=https://github.com/org/repo/tree/main/.agents/skills/voice-ai-integration
B=https://github.com/org/repo/tree/feature-x/.agents/skills/voice-ai-integration
case_id=auth-prefers-rtc-token

A/B Artifact Layout

An ab-urls run writes a top-level comparison wrapper plus 2 variant-local single-run directories:

agentic-evals/runs/<ab_run_id>/
├── manifest.json
├── variants/
│   ├── A/
│   │   ├── source-manifest.json
│   │   └── run/
│   │       ├── manifest.json
│   │       ├── case-artifacts/
│   │       ├── transcript.md
│   │       ├── case-results/
│   │       └── report.md
│   └── B/
│       ├── source-manifest.json
│       └── run/
│           ├── manifest.json
│           ├── case-artifacts/
│           ├── transcript.md
│           ├── case-results/
│           └── report.md
├── comparison.json
└── report.md

Meaning:

variants/<label>/run/ is still a normal single-version run shape
source-manifest.json records how each URL was parsed and resolved
comparison.json and the top-level report.md summarize case-by-case differences

What Usually Changes

Most manual edits in this repo land in one of these places:

targets/<target_id>/target.yaml when the default suites, entry skill, or allowed statuses change
targets/<target_id>/cases/<suite_id>/suite.yaml when grouping active cases
targets/<target_id>/cases/<suite_id>/<case_id>.yaml when adding or refining active behavior checks

The evaluator execution protocol lives in AGENT.md, not here.

Case Authoring

Add active cases under targets/<target_id>/cases/<suite_id>/ alongside the suite.yaml that references them.

Keep cases focused and behavior-first:

Lock down one specific rule per case.
Keep input.user_prompt realistic and setup minimal.
Write assertions as natural-language rubric entries.
Use assert.summary for the case-level behavior being protected.
Use evidence_scope to point the evaluator to artifact filenames such as accepted-session.json, final-answer.txt, or both.
Write trace-facing assertions against accepted session evidence semantics such as consultation of a file, an observed command invocation, ordering, and the final answer.
Make sure any setup can be reproduced inside an isolated case workspace.

Each case should include:

case_id
title
input
setup
optional assert.summary
assert.required
assert.forbidden
notes

Minimal case skeleton:

case_id: "example-case"
title: "Short behavior-oriented title"

input:
  user_prompt: "Realistic user request"
  locale: "en-US"

setup:
  docs_index_present: true

assert:
  summary: "The assistant should consult the skill and then follow the supported path."
  required:
    - description: "The accepted session evidence should show that the top-level skill instructions were consulted before answering."
      pass_criteria:
        - "The accepted session evidence shows consultation of `.agents/skills/<target_id>/SKILL.md` before the answer."
      fail_signals:
        - "The accepted session evidence never shows consultation of `.agents/skills/<target_id>/SKILL.md`"
      evidence_scope:
        - "accepted-session.json"
  forbidden: []

notes:
  why_it_matters: "Why this behavior is worth protecting."
  likely_fix_files:
    - ".agents/skills/<target_id>/SKILL.md"

Validation

This command validates all case, suite, and target YAML files for the selected target by loading them with Ruby. If every file parses successfully, it prints yaml-ok.

TARGET_ID=voice-ai-integration ruby -e 'require "yaml"; Dir["targets/#{ENV.fetch("TARGET_ID")}/cases/*/suite.yaml"].sort.each { |f| YAML.load_file(f) }; Dir["targets/#{ENV.fetch("TARGET_ID")}/cases/**/*.yaml"].sort.reject { |f| f.end_with?("suite.yaml") }.each { |f| YAML.load_file(f) }; YAML.load_file("targets/#{ENV.fetch("TARGET_ID")}/target.yaml"); puts "yaml-ok"'

This command checks suite-to-case coverage for the selected target. It reports how many case files exist, how many suite references were found, and whether any cases are missing or referenced more than once.

TARGET_ID=voice-ai-integration ruby -e 'require "yaml"; target_id = ENV.fetch("TARGET_ID"); refs = Dir["targets/#{target_id}/cases/*/suite.yaml"].sort.flat_map { |f| YAML.load_file(f)["cases"] }; counts = Hash.new(0); refs.each { |r| counts[r] += 1 }; cases = Dir["targets/#{target_id}/cases/**/*.yaml"].sort.reject { |f| f.end_with?("suite.yaml") }; missing = cases.reject { |c| counts.key?(c) }; dupes = counts.select { |_, v| v > 1 }; puts "cases=#{cases.size} suite_refs=#{refs.size} dupes=#{dupes.size} missing=#{missing.size}"'

Maintenance

Keep suites thematic and readable.
Prefer concrete session evidence over vague review criteria.
Audit for YAML parse errors, duplicate suite references, and unreferenced cases whenever the test set changes.
Refresh affected suites and cases when the target workflow changes.
Remove or rewrite stale cases that no longer reflect the supported path.

Key Paths

skill-eval/SKILL.md
AGENT.md
docs/session-evidence.md
targets/<target_id>/target.yaml
targets/<target_id>/cases/<suite_id>/suite.yaml
targets/<target_id>/cases/<suite_id>/<case_id>.yaml
runs/

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.agents/skills		.agents/skills
.github/workflows		.github/workflows
docs		docs
evaluations		evaluations
rubrics		rubrics
scripts		scripts
targets		targets
.DS_Store		.DS_Store
.gitignore		.gitignore
AGENT.md		AGENT.md
AGENTS.md		AGENTS.md
README.md		README.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentic-evals

Run Modes

Layout

Document Responsibilities

Versioning

How To Read This Repo

A/B Usage

A/B Artifact Layout

What Usually Changes

Case Authoring

Validation

Maintenance

Key Paths

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentic-evals

Run Modes

Layout

Document Responsibilities

Versioning

How To Read This Repo

A/B Usage

A/B Artifact Layout

What Usually Changes

Case Authoring

Validation

Maintenance

Key Paths

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages