feat: multi-agent eval support (Codex + multi-trial + progressive disclosure) by Railly · Pull Request #60 · clerk/clerk-evals

Railly · 2026-03-24T21:37:12Z

Summary

Codex agent harness: bun agent:codex spawns codex exec --json --full-auto, parses JSONL events, supports skills via AGENTS.md
Multi-trial support: --runs N flag runs each eval N times, computes pass@k / pass^k metrics
Filesystem graders: fileExists(), fileContains(), packageHasDep() for checking actual files agents create
Failure classification: model vs infra vs timeout — filters noise from leaderboard
Progressive disclosure for skills: CLAUDE.md now contains only a catalog (~450 bytes) instead of full skill dump (21KB). Agent reads SKILL.md on-demand per agentskills.io spec.
Cross-agent leaderboard export: bun export:leaderboard produces comparison JSON

Test results

add-auth eval (4 variants: Next.js, React, Android, iOS)

Agent	Skills	Next.js	React	Android	iOS	Average
Claude Code (Opus 4.6)	None	8/11	8/8	7/7	6/6	93%
Claude Code (Opus 4.6)	Progressive	10/11	8/8	6/7	6/6	94%
Codex (GPT-5.4)	None	11/11	8/8	5/7	6/6	93%
Codex (GPT-5.4)	Progressive	11/11	8/8	6/7	6/6	96%

Skills impact: +1% (Claude Code), +3% (Codex) with progressive disclosure.

Linear

Closes AIE-672

Test plan

bun agent:claude --eval add-auth --debug → 93%
bun agent:claude --skills --eval add-auth --debug → 94%
bun agent:codex --eval add-auth --debug → 93%
bun agent:codex --skills --eval add-auth --debug → 96%
--runs flag shows trial labels and pass@k summary
Skills load as catalog-only in CLAUDE.md (446 bytes vs 21KB)
Biome lint clean

- Add 'codex' to AgentType union and AGENTS record - Create codex.ts runner using `codex exec --json --full-auto` - Parse Codex JSONL events (agent_message, command_execution, file_edit) - Init git repo in workDir (Codex requirement) - Copy CLAUDE.md skills content to AGENTS.md for Codex discovery - Add `agent:codex` npm script

- Add AGENT_CONTEXT_FILES mapping (claude-code → CLAUDE.md, codex → AGENTS.md, etc.) - Add setupAgentContext() to copy skills to agent-specific context files - Replace deprecated symlinkSkills import with createSkillsClaudeMd

- fileExists(), fileContains(), packageHasDep(), commandSucceeds() - Factory pattern: grader(workDir) returns standard Grader function - bindFilesystemGraders() to bind factories to a specific workDir

- passAtK(): unbiased estimator for capability (at least 1 success in k) - passToTheK(): reliability metric (all k succeed) - summarizeTrials(): aggregate trial results with both metrics

- Classify failures as model (real), infra (crash/network), or timeout - Pattern matching for known infra errors (ECONNREFUSED, API key, etc.) - isLeaderboardRelevant() to filter noise from results

- Migration adds trial (INTEGER) and failure_type (TEXT) to errors - Update saveError() to accept trial number and failure classification

- New --runs N CLI flag (default: 1) for multi-trial execution - Each eval runs N times with per-trial logging and scoring - Failure classification on errors (model/infra/timeout) - pass@k summary logged after trials complete - Trial-aware debug artifact naming

- leaderboard.ts: reads DB results, groups by agent, exports JSON - skills-impact.ts: compares base vs skills-enhanced scores (delta, newlyPassed) - New export:leaderboard script in package.json

- Replace full skill content dump (21KB) with catalog-only CLAUDE.md (446 bytes) - Copy skill dirs to .skills/ for on-demand reading by agent - Only copy essential files (SKILL.md, scripts/, references/, assets/) - Agent reads SKILL.md when task matches description (progressive disclosure) - Remove deprecated symlinkSkills wrapper - Add evals/add-auth to skill mapping - Result: 94% score (up from 88% with full dump)

- Document agent:codex, --runs flag, export:leaderboard commands

thiskevinwang · 2026-03-25T13:05:31Z

src/config/skills.ts

-  } catch {
-    return ''
-  }
+function parseFrontmatter(content: string): { name: string; description: string } | null {


💭 Reuse graymatter

thiskevinwang · 2026-03-25T13:18:13Z

src/graders/filesystem.ts

+export function packageHasDep(dep: string) {
+  return (workDir: string) =>
+    async (_input: string): Promise<boolean> => {
+      try {
+        const raw = await readFile(path.join(workDir, 'package.json'), 'utf8')
+        const pkg = JSON.parse(raw)
+        return dep in (pkg.dependencies ?? {}) || dep in (pkg.devDependencies ?? {})
+      } catch {
+        return false
+      }
+    }
+}


A comment on packageHasDep

Seems valuable to split deps / devDeps into two explicit graders, since that's an important distinction to grade llms/agents on

Relatedly: Feels like we could instead have a generic JSON-path-checker util (something like lodash/get)

Thinking forward, I can see us eventually introducing additional "dep checkers" for Swift/iOS and Kotlin/Android.

thiskevinwang

LGTM

Left a few comments.
Can we also get the README.md/AGENTS.md updated to mention installing additional claude and codex dependencies for agent-evals

Railly added 11 commits March 24, 2026 16:39

chore: biome format fixes

fb5a2cb

feat: add agent-specific context file support

432b56c

- Add AGENT_CONTEXT_FILES mapping (claude-code → CLAUDE.md, codex → AGENTS.md, etc.) - Add setupAgentContext() to copy skills to agent-specific context files - Replace deprecated symlinkSkills import with createSkillsClaudeMd

feat: add filesystem-based graders for agent evals

6631afc

- fileExists(), fileContains(), packageHasDep(), commandSucceeds() - Factory pattern: grader(workDir) returns standard Grader function - bindFilesystemGraders() to bind factories to a specific workDir

feat: add pass@k and pass^k metrics for multi-trial evals

e0afa24

- passAtK(): unbiased estimator for capability (at least 1 success in k) - passToTheK(): reliability metric (all k succeed) - summarizeTrials(): aggregate trial results with both metrics

feat: add failure classification for agent eval runs

cbd9222

- Classify failures as model (real), infra (crash/network), or timeout - Pattern matching for known infra errors (ECONNREFUSED, API key, etc.) - isLeaderboardRelevant() to filter noise from results

feat: add trial and failure_type columns to errors table

7484cc1

- Migration adds trial (INTEGER) and failure_type (TEXT) to errors - Update saveError() to accept trial number and failure classification

feat: add cross-agent leaderboard export and skills impact tracking

f063962

- leaderboard.ts: reads DB results, groups by agent, exports JSON - skills-impact.ts: compares base vs skills-enhanced scores (delta, newlyPassed) - New export:leaderboard script in package.json

docs: add multi-agent workflow examples to AGENTS.md

6f74d5e

- Document agent:codex, --runs flag, export:leaderboard commands

Railly force-pushed the railly/aie-672-explore-multi-agent-eval-support-next-evals-oss-pattern branch from 5d886c7 to 6f74d5e Compare March 24, 2026 21:39

thiskevinwang reviewed Mar 25, 2026

View reviewed changes

thiskevinwang approved these changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-agent eval support (Codex + multi-trial + progressive disclosure)#60

feat: multi-agent eval support (Codex + multi-trial + progressive disclosure)#60
Railly wants to merge 11 commits intomainfrom
railly/aie-672-explore-multi-agent-eval-support-next-evals-oss-pattern

Railly commented Mar 24, 2026 •

edited

Loading

Uh oh!

thiskevinwang Mar 25, 2026

Uh oh!

thiskevinwang Mar 25, 2026

Uh oh!

thiskevinwang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Railly commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test results

Linear

Test plan

Uh oh!

thiskevinwang Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

thiskevinwang Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

thiskevinwang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Railly commented Mar 24, 2026 •

edited

Loading