docs(evals): lock /recce-verify rubric + v1 eval gap report by even-wei · Pull Request #42 · DataRecce/recce-claude-plugin

even-wei · 2026-06-10T00:46:10Z

What

Runs the v1 /recce-verify eval chain end-to-end and locks the scoring rubric. Closes out the two gated issues in the Agent-blind spots: /recce-verify v1 project: rubric lock (DRC-3585) → run eval + ranked gap report (DRC-3405). All changes are docs + eval artifacts; no product/driver code is modified.

DRC-3585 — rubric lock

Hand-graded 6 fixtures × Claude Code × {Tier-0, Tier-1} on claude-opus-4-8 (Claude-only for v1; Codex deferred — see gap report).
Resolved 4 rubric ambiguities as lens clarifications + an 8-row waffle log in RUBRIC.md:
- W1 delta is recorded only on Tier-1+ cells (n/a for Tier-0).
- W3 decisive-issue rule for multi-issue PRs (score against the one blocking issue, named per fixture).
- W5 evidence tier = what was actually used; 1c requires a cited query result.
- W7 Recce-attribution / run-variance guard — a delta counts only if the with-Recce run cites a Recce tool result.
Froze 6 Tier-0 baselines (5 catch, pr3 partial), each with verbatim reasoning, the exact prompt, and the sandbox-profile Notes block.
Committed 12 traces (runs/2026-06-09/traces/) and the judge outputs (spike-driver/{summary,verdicts,cells}).
Documented the inter-rater proxy: judge self-consistency catch 100% / tier 75% / delta 83%; judge–human catch 5/6 (diverges only at the catch/partial boundary).
Andy second-grader packet ready (runs/2026-06-09/andy-second-grader-packet.md) for the human ≥90% inter-rater pass.

DRC-3405 — ranked gap report

runs/2026-06-09/gap-report.md (5 entries) + 6 per-fixture case studies + a project status-update draft.

Headline conclusion (stands alone as the v1 conclusion / handoff): the eval base does not yet exercise Recce — the driver prompt never invokes /recce-verify, the Recce MCP server isn't configured in the cells, and the fixtures have no materialized dev env — so measured Tier-1 ≡ Tier-0. Independently, Opus 4.8 alone catches the decisive issue in 5.5/6 fixtures from static artifacts. Ranked actions: fix the eval base first (materialize a dev env + invoke the skill + reachable MCP; add large-scale fixtures), then the base-env data-diff = the v2 wedge (route to the Recce MCP App / backlog).

Status

Rubric is provisionally locked — full DRC-3585 closure waits on Andy's independent second-grade. DRC-3405's committed gap report is the v1 deliverable.

Refs DRC-3585, DRC-3405

🤖 Generated with Claude Code

Run the v1 /recce-verify eval chain end-to-end and lock the scoring rubric. DRC-3585 (rubric lock): - Hand-grade 6 fixtures x Claude Code x {Tier-0,Tier-1} on claude-opus-4-8. - Resolve 4 rubric ambiguities via lens clarifications + an 8-row waffle log: delta-only-on-Tier-1 (W1), decisive-issue rule for multi-issue PRs (W3), evidence-tier-actually-used + cited-query for 1c (W5), and a Recce run-variance attribution guard (W7). - Freeze 6 Tier-0 baselines (5 catch, pr3 partial); commit 12 traces; document the inter-rater proxy (judge self-consistency catch 100% / tier 75% / delta 83%; judge-human catch 5/6) + an Andy second-grader packet for the human pass. DRC-3405 (gap report): - Ranked gap report (5 entries) + 6 per-fixture case studies + status-update draft. - Headline finding: the eval base does not yet exercise Recce (no materialized dev env; the driver prompt never invokes /recce-verify; Recce MCP not reachable), so measured Tier-1 == Tier-0; and Opus 4.8 alone catches 5.5/6 fixtures from static artifacts. Ranked actions: fix the eval base first, then the base-env (v2) wedge. Rubric is provisionally locked pending a second grader's independent pass. Ephemeral spike-driver runtime (_settings/) added to .gitignore. Refs DRC-3585, DRC-3405 Signed-off-by: even-wei <evenwei@infuseai.io>

Agent review transcripts are markdown (headers, tables, code fences), so .md renders them on GitHub for the second grader instead of a raw text blob. - Rename the 12 committed traces runs/2026-06-09/traces/*.txt -> *.md. - Flip spike-driver/driver.py transcript output (run_claude, run_codex, and the --no-run re-judge discovery path) from .txt to .md so future runs match, and update spike-driver/README.md's output-layout examples. - Update all references (6 Tier-0 baselines, gap-report, Andy packet, cells.json). Refs DRC-3585, DRC-3405 Signed-off-by: even-wei <evenwei@infuseai.io>

even-wei · 2026-06-12T03:21:03Z

Handing this off to @wcchang1115 — should be safe to merge; it's evals/docs only.

even-wei self-assigned this Jun 10, 2026

even-wei requested a review from wcchang1115 June 12, 2026 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evals): lock /recce-verify rubric + v1 eval gap report#42

docs(evals): lock /recce-verify rubric + v1 eval gap report#42
even-wei wants to merge 2 commits into
mainfrom
feature/drc-3585-recce-verify-v1-eval

even-wei commented Jun 10, 2026

Uh oh!

even-wei commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

even-wei commented Jun 10, 2026

What

DRC-3585 — rubric lock

DRC-3405 — ranked gap report

Status

Uh oh!

even-wei commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

even-wei commented Jun 12, 2026 •

edited

Loading