Skip to content

docs(evals): lock /recce-verify rubric + v1 eval gap report#42

Open
even-wei wants to merge 2 commits into
mainfrom
feature/drc-3585-recce-verify-v1-eval
Open

docs(evals): lock /recce-verify rubric + v1 eval gap report#42
even-wei wants to merge 2 commits into
mainfrom
feature/drc-3585-recce-verify-v1-eval

Conversation

@even-wei

Copy link
Copy Markdown
Contributor

What

Runs the v1 /recce-verify eval chain end-to-end and locks the scoring rubric. Closes out the two gated issues in the Agent-blind spots: /recce-verify v1 project: rubric lock (DRC-3585) → run eval + ranked gap report (DRC-3405). All changes are docs + eval artifacts; no product/driver code is modified.

DRC-3585 — rubric lock

  • Hand-graded 6 fixtures × Claude Code × {Tier-0, Tier-1} on claude-opus-4-8 (Claude-only for v1; Codex deferred — see gap report).
  • Resolved 4 rubric ambiguities as lens clarifications + an 8-row waffle log in RUBRIC.md:
    • W1 delta is recorded only on Tier-1+ cells (n/a for Tier-0).
    • W3 decisive-issue rule for multi-issue PRs (score against the one blocking issue, named per fixture).
    • W5 evidence tier = what was actually used; 1c requires a cited query result.
    • W7 Recce-attribution / run-variance guard — a delta counts only if the with-Recce run cites a Recce tool result.
  • Froze 6 Tier-0 baselines (5 catch, pr3 partial), each with verbatim reasoning, the exact prompt, and the sandbox-profile Notes block.
  • Committed 12 traces (runs/2026-06-09/traces/) and the judge outputs (spike-driver/{summary,verdicts,cells}).
  • Documented the inter-rater proxy: judge self-consistency catch 100% / tier 75% / delta 83%; judge–human catch 5/6 (diverges only at the catch/partial boundary).
  • Andy second-grader packet ready (runs/2026-06-09/andy-second-grader-packet.md) for the human ≥90% inter-rater pass.

DRC-3405 — ranked gap report

runs/2026-06-09/gap-report.md (5 entries) + 6 per-fixture case studies + a project status-update draft.

Headline conclusion (stands alone as the v1 conclusion / handoff): the eval base does not yet exercise Recce — the driver prompt never invokes /recce-verify, the Recce MCP server isn't configured in the cells, and the fixtures have no materialized dev env — so measured Tier-1 ≡ Tier-0. Independently, Opus 4.8 alone catches the decisive issue in 5.5/6 fixtures from static artifacts. Ranked actions: fix the eval base first (materialize a dev env + invoke the skill + reachable MCP; add large-scale fixtures), then the base-env data-diff = the v2 wedge (route to the Recce MCP App / backlog).

Status

Rubric is provisionally locked — full DRC-3585 closure waits on Andy's independent second-grade. DRC-3405's committed gap report is the v1 deliverable.

Refs DRC-3585, DRC-3405

🤖 Generated with Claude Code

Run the v1 /recce-verify eval chain end-to-end and lock the scoring rubric.

DRC-3585 (rubric lock):
- Hand-grade 6 fixtures x Claude Code x {Tier-0,Tier-1} on claude-opus-4-8.
- Resolve 4 rubric ambiguities via lens clarifications + an 8-row waffle log:
  delta-only-on-Tier-1 (W1), decisive-issue rule for multi-issue PRs (W3),
  evidence-tier-actually-used + cited-query for 1c (W5), and a Recce run-variance
  attribution guard (W7).
- Freeze 6 Tier-0 baselines (5 catch, pr3 partial); commit 12 traces; document the
  inter-rater proxy (judge self-consistency catch 100% / tier 75% / delta 83%;
  judge-human catch 5/6) + an Andy second-grader packet for the human pass.

DRC-3405 (gap report):
- Ranked gap report (5 entries) + 6 per-fixture case studies + status-update draft.
- Headline finding: the eval base does not yet exercise Recce (no materialized dev
  env; the driver prompt never invokes /recce-verify; Recce MCP not reachable), so
  measured Tier-1 == Tier-0; and Opus 4.8 alone catches 5.5/6 fixtures from static
  artifacts. Ranked actions: fix the eval base first, then the base-env (v2) wedge.

Rubric is provisionally locked pending a second grader's independent pass.
Ephemeral spike-driver runtime (_settings/) added to .gitignore.

Refs DRC-3585, DRC-3405

Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei even-wei self-assigned this Jun 10, 2026
Agent review transcripts are markdown (headers, tables, code fences), so .md
renders them on GitHub for the second grader instead of a raw text blob.

- Rename the 12 committed traces runs/2026-06-09/traces/*.txt -> *.md.
- Flip spike-driver/driver.py transcript output (run_claude, run_codex, and the
  --no-run re-judge discovery path) from .txt to .md so future runs match, and
  update spike-driver/README.md's output-layout examples.
- Update all references (6 Tier-0 baselines, gap-report, Andy packet, cells.json).

Refs DRC-3585, DRC-3405

Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei

even-wei commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Handing this off to @wcchang1115 — should be safe to merge; it's evals/docs only.

@even-wei even-wei requested a review from wcchang1115 June 12, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant