docs(evals): lock /recce-verify rubric + v1 eval gap report#42
Open
even-wei wants to merge 2 commits into
Open
docs(evals): lock /recce-verify rubric + v1 eval gap report#42even-wei wants to merge 2 commits into
even-wei wants to merge 2 commits into
Conversation
Run the v1 /recce-verify eval chain end-to-end and lock the scoring rubric.
DRC-3585 (rubric lock):
- Hand-grade 6 fixtures x Claude Code x {Tier-0,Tier-1} on claude-opus-4-8.
- Resolve 4 rubric ambiguities via lens clarifications + an 8-row waffle log:
delta-only-on-Tier-1 (W1), decisive-issue rule for multi-issue PRs (W3),
evidence-tier-actually-used + cited-query for 1c (W5), and a Recce run-variance
attribution guard (W7).
- Freeze 6 Tier-0 baselines (5 catch, pr3 partial); commit 12 traces; document the
inter-rater proxy (judge self-consistency catch 100% / tier 75% / delta 83%;
judge-human catch 5/6) + an Andy second-grader packet for the human pass.
DRC-3405 (gap report):
- Ranked gap report (5 entries) + 6 per-fixture case studies + status-update draft.
- Headline finding: the eval base does not yet exercise Recce (no materialized dev
env; the driver prompt never invokes /recce-verify; Recce MCP not reachable), so
measured Tier-1 == Tier-0; and Opus 4.8 alone catches 5.5/6 fixtures from static
artifacts. Ranked actions: fix the eval base first, then the base-env (v2) wedge.
Rubric is provisionally locked pending a second grader's independent pass.
Ephemeral spike-driver runtime (_settings/) added to .gitignore.
Refs DRC-3585, DRC-3405
Signed-off-by: even-wei <evenwei@infuseai.io>
Agent review transcripts are markdown (headers, tables, code fences), so .md renders them on GitHub for the second grader instead of a raw text blob. - Rename the 12 committed traces runs/2026-06-09/traces/*.txt -> *.md. - Flip spike-driver/driver.py transcript output (run_claude, run_codex, and the --no-run re-judge discovery path) from .txt to .md so future runs match, and update spike-driver/README.md's output-layout examples. - Update all references (6 Tier-0 baselines, gap-report, Andy packet, cells.json). Refs DRC-3585, DRC-3405 Signed-off-by: even-wei <evenwei@infuseai.io>
Contributor
Author
|
Handing this off to @wcchang1115 — should be safe to merge; it's evals/docs only. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Runs the v1
/recce-verifyeval chain end-to-end and locks the scoring rubric. Closes out the two gated issues in the Agent-blind spots: /recce-verify v1 project: rubric lock (DRC-3585) → run eval + ranked gap report (DRC-3405). All changes are docs + eval artifacts; no product/driver code is modified.DRC-3585 — rubric lock
claude-opus-4-8(Claude-only for v1; Codex deferred — see gap report).RUBRIC.md:n/afor Tier-0).1crequires a cited query result.catch,pr3partial), each with verbatim reasoning, the exact prompt, and the sandbox-profile Notes block.runs/2026-06-09/traces/) and the judge outputs (spike-driver/{summary,verdicts,cells}).runs/2026-06-09/andy-second-grader-packet.md) for the human ≥90% inter-rater pass.DRC-3405 — ranked gap report
runs/2026-06-09/gap-report.md(5 entries) + 6 per-fixture case studies + a project status-update draft.Headline conclusion (stands alone as the v1 conclusion / handoff): the eval base does not yet exercise Recce — the driver prompt never invokes
/recce-verify, the Recce MCP server isn't configured in the cells, and the fixtures have no materialized dev env — so measured Tier-1 ≡ Tier-0. Independently, Opus 4.8 alone catches the decisive issue in 5.5/6 fixtures from static artifacts. Ranked actions: fix the eval base first (materialize a dev env + invoke the skill + reachable MCP; add large-scale fixtures), then the base-env data-diff = the v2 wedge (route to the Recce MCP App / backlog).Status
Rubric is provisionally locked — full DRC-3585 closure waits on Andy's independent second-grade. DRC-3405's committed gap report is the v1 deliverable.
Refs DRC-3585, DRC-3405
🤖 Generated with Claude Code