Conversation
- Add 'codex' to AgentType union and AGENTS record - Create codex.ts runner using `codex exec --json --full-auto` - Parse Codex JSONL events (agent_message, command_execution, file_edit) - Init git repo in workDir (Codex requirement) - Copy CLAUDE.md skills content to AGENTS.md for Codex discovery - Add `agent:codex` npm script
- Add AGENT_CONTEXT_FILES mapping (claude-code → CLAUDE.md, codex → AGENTS.md, etc.) - Add setupAgentContext() to copy skills to agent-specific context files - Replace deprecated symlinkSkills import with createSkillsClaudeMd
- fileExists(), fileContains(), packageHasDep(), commandSucceeds() - Factory pattern: grader(workDir) returns standard Grader function - bindFilesystemGraders() to bind factories to a specific workDir
- passAtK(): unbiased estimator for capability (at least 1 success in k) - passToTheK(): reliability metric (all k succeed) - summarizeTrials(): aggregate trial results with both metrics
- Classify failures as model (real), infra (crash/network), or timeout - Pattern matching for known infra errors (ECONNREFUSED, API key, etc.) - isLeaderboardRelevant() to filter noise from results
- Migration adds trial (INTEGER) and failure_type (TEXT) to errors - Update saveError() to accept trial number and failure classification
- New --runs N CLI flag (default: 1) for multi-trial execution - Each eval runs N times with per-trial logging and scoring - Failure classification on errors (model/infra/timeout) - pass@k summary logged after trials complete - Trial-aware debug artifact naming
- leaderboard.ts: reads DB results, groups by agent, exports JSON - skills-impact.ts: compares base vs skills-enhanced scores (delta, newlyPassed) - New export:leaderboard script in package.json
- Replace full skill content dump (21KB) with catalog-only CLAUDE.md (446 bytes) - Copy skill dirs to .skills/ for on-demand reading by agent - Only copy essential files (SKILL.md, scripts/, references/, assets/) - Agent reads SKILL.md when task matches description (progressive disclosure) - Remove deprecated symlinkSkills wrapper - Add evals/add-auth to skill mapping - Result: 94% score (up from 88% with full dump)
- Document agent:codex, --runs flag, export:leaderboard commands
5d886c7 to
6f74d5e
Compare
| } catch { | ||
| return '' | ||
| } | ||
| function parseFrontmatter(content: string): { name: string; description: string } | null { |
Comment on lines
+61
to
+72
| export function packageHasDep(dep: string) { | ||
| return (workDir: string) => | ||
| async (_input: string): Promise<boolean> => { | ||
| try { | ||
| const raw = await readFile(path.join(workDir, 'package.json'), 'utf8') | ||
| const pkg = JSON.parse(raw) | ||
| return dep in (pkg.dependencies ?? {}) || dep in (pkg.devDependencies ?? {}) | ||
| } catch { | ||
| return false | ||
| } | ||
| } | ||
| } |
Member
There was a problem hiding this comment.
A comment on packageHasDep
- Seems valuable to split deps / devDeps into two explicit graders, since that's an important distinction to grade llms/agents on
- Relatedly: Feels like we could instead have a generic JSON-path-checker util (something like
lodash/get) - Thinking forward, I can see us eventually introducing additional "dep checkers" for Swift/iOS and Kotlin/Android.
thiskevinwang
approved these changes
Mar 25, 2026
Member
thiskevinwang
left a comment
There was a problem hiding this comment.
LGTM
- Left a few comments.
- Can we also get the README.md/AGENTS.md updated to mention installing additional
claudeandcodexdependencies for agent-evals
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bun agent:codexspawnscodex exec --json --full-auto, parses JSONL events, supports skills via AGENTS.md--runs Nflag runs each eval N times, computes pass@k / pass^k metricsfileExists(),fileContains(),packageHasDep()for checking actual files agents createbun export:leaderboardproduces comparison JSONTest results
add-autheval (4 variants: Next.js, React, Android, iOS)Skills impact: +1% (Claude Code), +3% (Codex) with progressive disclosure.
Linear
Closes AIE-672
Test plan
bun agent:claude --eval add-auth --debug→ 93%bun agent:claude --skills --eval add-auth --debug→ 94%bun agent:codex --eval add-auth --debug→ 93%bun agent:codex --skills --eval add-auth --debug→ 96%--runsflag shows trial labels and pass@k summary