Skip to content

feat: multi-agent eval support (Codex + multi-trial + progressive disclosure)#60

Open
Railly wants to merge 11 commits intomainfrom
railly/aie-672-explore-multi-agent-eval-support-next-evals-oss-pattern
Open

feat: multi-agent eval support (Codex + multi-trial + progressive disclosure)#60
Railly wants to merge 11 commits intomainfrom
railly/aie-672-explore-multi-agent-eval-support-next-evals-oss-pattern

Conversation

@Railly
Copy link
Contributor

@Railly Railly commented Mar 24, 2026

Summary

  • Codex agent harness: bun agent:codex spawns codex exec --json --full-auto, parses JSONL events, supports skills via AGENTS.md
  • Multi-trial support: --runs N flag runs each eval N times, computes pass@k / pass^k metrics
  • Filesystem graders: fileExists(), fileContains(), packageHasDep() for checking actual files agents create
  • Failure classification: model vs infra vs timeout — filters noise from leaderboard
  • Progressive disclosure for skills: CLAUDE.md now contains only a catalog (~450 bytes) instead of full skill dump (21KB). Agent reads SKILL.md on-demand per agentskills.io spec.
  • Cross-agent leaderboard export: bun export:leaderboard produces comparison JSON

Test results

add-auth eval (4 variants: Next.js, React, Android, iOS)

Agent Skills Next.js React Android iOS Average
Claude Code (Opus 4.6) None 8/11 8/8 7/7 6/6 93%
Claude Code (Opus 4.6) Progressive 10/11 8/8 6/7 6/6 94%
Codex (GPT-5.4) None 11/11 8/8 5/7 6/6 93%
Codex (GPT-5.4) Progressive 11/11 8/8 6/7 6/6 96%

Skills impact: +1% (Claude Code), +3% (Codex) with progressive disclosure.

Linear

Closes AIE-672

Test plan

  • bun agent:claude --eval add-auth --debug → 93%
  • bun agent:claude --skills --eval add-auth --debug → 94%
  • bun agent:codex --eval add-auth --debug → 93%
  • bun agent:codex --skills --eval add-auth --debug → 96%
  • --runs flag shows trial labels and pass@k summary
  • Skills load as catalog-only in CLAUDE.md (446 bytes vs 21KB)
  • Biome lint clean

Railly added 11 commits March 24, 2026 16:39
- Add 'codex' to AgentType union and AGENTS record
- Create codex.ts runner using `codex exec --json --full-auto`
- Parse Codex JSONL events (agent_message, command_execution, file_edit)
- Init git repo in workDir (Codex requirement)
- Copy CLAUDE.md skills content to AGENTS.md for Codex discovery
- Add `agent:codex` npm script
- Add AGENT_CONTEXT_FILES mapping (claude-code → CLAUDE.md, codex → AGENTS.md, etc.)
- Add setupAgentContext() to copy skills to agent-specific context files
- Replace deprecated symlinkSkills import with createSkillsClaudeMd
- fileExists(), fileContains(), packageHasDep(), commandSucceeds()
- Factory pattern: grader(workDir) returns standard Grader function
- bindFilesystemGraders() to bind factories to a specific workDir
- passAtK(): unbiased estimator for capability (at least 1 success in k)
- passToTheK(): reliability metric (all k succeed)
- summarizeTrials(): aggregate trial results with both metrics
- Classify failures as model (real), infra (crash/network), or timeout
- Pattern matching for known infra errors (ECONNREFUSED, API key, etc.)
- isLeaderboardRelevant() to filter noise from results
- Migration adds trial (INTEGER) and failure_type (TEXT) to errors
- Update saveError() to accept trial number and failure classification
- New --runs N CLI flag (default: 1) for multi-trial execution
- Each eval runs N times with per-trial logging and scoring
- Failure classification on errors (model/infra/timeout)
- pass@k summary logged after trials complete
- Trial-aware debug artifact naming
- leaderboard.ts: reads DB results, groups by agent, exports JSON
- skills-impact.ts: compares base vs skills-enhanced scores (delta, newlyPassed)
- New export:leaderboard script in package.json
- Replace full skill content dump (21KB) with catalog-only CLAUDE.md (446 bytes)
- Copy skill dirs to .skills/ for on-demand reading by agent
- Only copy essential files (SKILL.md, scripts/, references/, assets/)
- Agent reads SKILL.md when task matches description (progressive disclosure)
- Remove deprecated symlinkSkills wrapper
- Add evals/add-auth to skill mapping
- Result: 94% score (up from 88% with full dump)
- Document agent:codex, --runs flag, export:leaderboard commands
@Railly Railly force-pushed the railly/aie-672-explore-multi-agent-eval-support-next-evals-oss-pattern branch from 5d886c7 to 6f74d5e Compare March 24, 2026 21:39
} catch {
return ''
}
function parseFrontmatter(content: string): { name: string; description: string } | null {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Reuse graymatter

Comment on lines +61 to +72
export function packageHasDep(dep: string) {
return (workDir: string) =>
async (_input: string): Promise<boolean> => {
try {
const raw = await readFile(path.join(workDir, 'package.json'), 'utf8')
const pkg = JSON.parse(raw)
return dep in (pkg.dependencies ?? {}) || dep in (pkg.devDependencies ?? {})
} catch {
return false
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on packageHasDep

  1. Seems valuable to split deps / devDeps into two explicit graders, since that's an important distinction to grade llms/agents on
  2. Relatedly: Feels like we could instead have a generic JSON-path-checker util (something like lodash/get)
  3. Thinking forward, I can see us eventually introducing additional "dep checkers" for Swift/iOS and Kotlin/Android.

Copy link
Member

@thiskevinwang thiskevinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

  • Left a few comments.
  • Can we also get the README.md/AGENTS.md updated to mention installing additional claude and codex dependencies for agent-evals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants