Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998
Open
aweider wants to merge 1 commit intodanielmiessler:mainfrom
Open
Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998aweider wants to merge 1 commit intodanielmiessler:mainfrom
aweider wants to merge 1 commit intodanielmiessler:mainfrom
Conversation
…ipping principle Integrates three patterns from Anthropic's harness design research for long-running application development into the Algorithm and PRD format: 1. Sprint Contracts (PLAN phase, Advanced+ only): Before BUILD, builder and evaluator agree on what "done" means for each ISC criterion — implementation approach and verification method. Prevents misalignment between building and judging. 2. Calibrated Evaluation (VERIFY phase): Four-dimension scoring rubric (Correctness, Completeness, Craft, User Experience) with 1-10 scales, anchor descriptions, and a threshold gate (below 6 = targeted redo). Supplements binary pass/fail testing to catch the gap between "technically works" and "actually good." 3. Stripping Principle (LEARN phase): Reflection question asking which scaffolding steps were load-bearing vs. added latency. Every component encodes an assumption about what the model can't do alone — as models improve, assumptions should be stress-tested. Zero overhead for Standard/Extended tiers. Sprint contracts gate to Advanced+. Calibrated scoring activates only for Advanced+ with QA agent or user-facing output. Reference: https://www.anthropic.com/engineering/harness-design-long-running-apps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates three patterns from Anthropic's harness design research into the Algorithm (v3.7.0) and PRD format:
Changes
Algorithm/v3.7.0.mdPRDFORMAT.mdDesign Decisions
Context
The Anthropic article's key insight: separating the agent doing the work from the agent judging it is a strong lever against self-evaluation bias. Their GAN-inspired generator-evaluator loop with sprint contracts and calibrated scoring produced dramatically better output than solo agents at the cost of more structured workflow. These additions bring that pattern into PAI's existing Algorithm phases without adding new agents or systems.
Test Plan
🤖 Generated with Claude Code