feat: add GAN-style generator-evaluator harness#1029
Conversation
Implements Anthropic's March 2026 harness design pattern — a multi-agent architecture that separates generation from evaluation, creating an adversarial feedback loop that produces production-quality applications. Components: - 3 agent definitions (planner, generator, evaluator) - 1 skill with full documentation (skills/gan-style-harness/) - 2 commands (gan-build for full apps, gan-design for frontend) - 1 shell orchestrator (scripts/gan-harness.sh) - Examples and configuration reference Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps
|
📝 WalkthroughWalkthroughThis PR introduces a comprehensive "GAN-style harness" system for iterative application development and evaluation. It includes three agent specifications (Planner, Generator, Evaluator), command documentation, a Bash orchestration script, supporting skill documentation, and examples that coordinate a multi-iteration feedback loop where an Evaluator tests running applications and provides structured feedback to a Generator for continuous improvement. Changes
Sequence DiagramsequenceDiagram
participant User
participant Planner as Planner Agent
participant Harness as GAN Harness Script
participant Generator as Generator Agent
participant DevServer as Dev Server
participant Evaluator as Evaluator Agent
participant Filesystem as Filesystem<br/>(spec/feedback/state)
User->>Harness: run gan-build brief
Harness->>Planner: generate spec & rubric
Planner->>Filesystem: write spec.md, eval-rubric.md
loop Iteration Loop (1 to max-iterations)
Harness->>Filesystem: read feedback (if iter > 1)
Harness->>Generator: generate/update app
Generator->>DevServer: start/update running app
Generator->>Filesystem: write generator-state.md, commit
Harness->>Evaluator: evaluate running app
Evaluator->>DevServer: Playwright testing
Evaluator->>Filesystem: write feedback-NNN.md w/ score
Harness->>Harness: extract score, check threshold
alt Score >= pass-threshold or plateau detected
Harness->>Harness: early stop
end
end
Harness->>Filesystem: write build-report.md
Harness->>User: final summary, score progression
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Analysis Failed
Troubleshooting
Retry: |
There was a problem hiding this comment.
5 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/gan-style-harness/SKILL.md">
<violation number="1" location="skills/gan-style-harness/SKILL.md:230">
P2: SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.</violation>
</file>
<file name="agents/gan-evaluator.md">
<violation number="1" location="agents/gan-evaluator.md:4">
P2: Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.</violation>
</file>
<file name="scripts/gan-harness.sh">
<violation number="1" location="scripts/gan-harness.sh:105">
P2: config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.</violation>
<violation number="2" location="scripts/gan-harness.sh:244">
P2: Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under `set -euo pipefail`. Use an explicit last index with an empty-array guard for compatibility.</violation>
</file>
<file name="commands/gan-build.md">
<violation number="1" location="commands/gan-build.md:94">
P2: Output section lists zero‑padded feedback filenames (`feedback-001.md`), which conflicts with the earlier unpadded `feedback-{iteration}.md` paths. This inconsistency can mislead users and tooling about expected file names.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| | `GAN_PLANNER_MODEL` | `opus` | Model for planning agent | | ||
| | `GAN_GENERATOR_MODEL` | `opus` | Model for generator agent | | ||
| | `GAN_EVALUATOR_MODEL` | `opus` | Model for evaluator agent | | ||
| | `GAN_EVAL_CRITERIA` | `design,originality,craft,functionality` | Comma-separated criteria | |
There was a problem hiding this comment.
P2: SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/gan-style-harness/SKILL.md, line 230:
<comment>SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.</comment>
<file context>
@@ -0,0 +1,278 @@
+| `GAN_PLANNER_MODEL` | `opus` | Model for planning agent |
+| `GAN_GENERATOR_MODEL` | `opus` | Model for generator agent |
+| `GAN_EVALUATOR_MODEL` | `opus` | Model for evaluator agent |
+| `GAN_EVAL_CRITERIA` | `design,originality,craft,functionality` | Comma-separated criteria |
+| `GAN_DEV_SERVER_PORT` | `3000` | Port for the live app |
+| `GAN_DEV_SERVER_CMD` | `npm run dev` | Command to start dev server |
</file context>
| --- | ||
| name: gan-evaluator | ||
| description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator." | ||
| tools: ["Read", "Write", "Bash", "Grep", "Glob"] |
There was a problem hiding this comment.
P2: Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agents/gan-evaluator.md, line 4:
<comment>Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.</comment>
<file context>
@@ -0,0 +1,209 @@
+---
+name: gan-evaluator
+description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
+tools: ["Read", "Write", "Bash", "Grep", "Glob"]
+model: opus
+color: red
</file context>
|
|
||
| phase "PHASE 3: Build Report" | ||
|
|
||
| FINAL_SCORE="${SCORES[-1]:-0.0}" |
There was a problem hiding this comment.
P2: Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under set -euo pipefail. Use an explicit last index with an empty-array guard for compatibility.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/gan-harness.sh, line 244:
<comment>Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under `set -euo pipefail`. Use an explicit last index with an empty-array guard for compatibility.</comment>
<file context>
@@ -0,0 +1,299 @@
+
+phase "PHASE 3: Build Report"
+
+FINAL_SCORE="${SCORES[-1]:-0.0}"
+NUM_ITERATIONS=${#SCORES[@]}
+ELAPSED=$(elapsed)
</file context>
| # Write config | ||
| cat > "${HARNESS_DIR}/config.json" << EOF | ||
| { | ||
| "brief": "$BRIEF", |
There was a problem hiding this comment.
P2: config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/gan-harness.sh, line 105:
<comment>config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.</comment>
<file context>
@@ -0,0 +1,299 @@
+# Write config
+cat > "${HARNESS_DIR}/config.json" << EOF
+{
+ "brief": "$BRIEF",
+ "maxIterations": $MAX_ITERATIONS,
+ "passThreshold": $PASS_THRESHOLD,
</file context>
| ### Files Created | ||
| - gan-harness/spec.md | ||
| - gan-harness/eval-rubric.md | ||
| - gan-harness/feedback/feedback-001.md through feedback-NNN.md |
There was a problem hiding this comment.
P2: Output section lists zero‑padded feedback filenames (feedback-001.md), which conflicts with the earlier unpadded feedback-{iteration}.md paths. This inconsistency can mislead users and tooling about expected file names.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At commands/gan-build.md, line 94:
<comment>Output section lists zero‑padded feedback filenames (`feedback-001.md`), which conflicts with the earlier unpadded `feedback-{iteration}.md` paths. This inconsistency can mislead users and tooling about expected file names.</comment>
<file context>
@@ -0,0 +1,99 @@
+### Files Created
+- gan-harness/spec.md
+- gan-harness/eval-rubric.md
+- gan-harness/feedback/feedback-001.md through feedback-NNN.md
+- gan-harness/generator-state.md
+- gan-harness/build-report.md
</file context>
Greptile SummaryThis PR adds a GAN-inspired three-agent harness (Planner → Generator → Evaluator feedback loop) for autonomously building production-quality applications, comprising 8 new files across Key issues found:
Confidence Score: 4/5The documentation and agent prompts are high quality, but two P1 bugs (broken score extraction, missing Playwright tools) prevent the harness from working correctly in its default configuration and should be addressed before merging. Four P1 findings on the primary execution path: score extraction silently returns 0.0 causing all runs to exhaust max iterations; default playwright eval mode is non-functional because MCP tools are not declared; evaluator hardcodes weights that conflict with design-mode rubric; JSON injection from unescaped brief. Once those are resolved the harness is otherwise well-designed. scripts/gan-harness.sh (score extraction regex, JSON injection, Bash 4.1 compat, SKIP_PLANNER logic) and agents/gan-evaluator.md (missing Playwright tools, hardcoded weights) Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant S as gan-harness.sh
participant P as Planner Agent
participant G as Generator Agent
participant E as Evaluator Agent
participant FS as File System
U->>S: ./gan-harness.sh "brief"
S->>FS: mkdir gan-harness/{feedback,screenshots}
alt SKIP_PLANNER=false
S->>P: claude -p (planner prompt + brief)
P->>FS: Write gan-harness/spec.md
P->>FS: Write gan-harness/eval-rubric.md
P-->>S: planner-output.log
end
loop iteration 1..MAX_ITERATIONS
S->>G: claude -p (generator prompt + feedback ref)
G->>FS: Read spec.md
G->>FS: Read feedback-{N-1}.md (if i>1)
G->>FS: Write/update app files
G->>FS: Write generator-state.md
G-->>S: generator-{i}.log
S->>E: claude -p --allowedTools Read,Write,Bash,Grep,Glob
Note over E: mcp__playwright__* tools NOT included
E->>FS: Read eval-rubric.md + spec.md
E->>FS: Write feedback/feedback-{i}.md
E-->>S: evaluator-{i}.log
S->>FS: extract_score(feedback-{i}.md)
Note over S: variable-length lookbehind may fail - returns 0.0
alt score >= PASS_THRESHOLD
S-->>U: PASS
else plateau detected
S-->>U: PLATEAU - stopping early
end
end
S->>FS: Write build-report.md
S-->>U: Final report + score progression
Reviews (1): Last reviewed commit: "feat: add GAN-style generator-evaluator ..." | Re-trigger Greptile |
| extract_score() { | ||
| # Extract the TOTAL weighted score from a feedback file | ||
| local file="$1" | ||
| # Look for **TOTAL** or **X.X/10** pattern | ||
| grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | ||
| || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | ||
| || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | ||
| || echo "0.0" | ||
| } |
There was a problem hiding this comment.
Score extraction regex is unreliable in practice
The extract_score function has two compounding problems that will cause it to silently return 0.0 in many real runs, causing the harness to exhaust all MAX_ITERATIONS even when the app has already passed.
Problem 1 — variable-length lookbehind. PCRE lookbehinds must be fixed-length in most distributions. (?<=\*\*TOTAL\*\*.*\*\*) and (?<=TOTAL.*\|.*\| \*\*) are variable-length and will produce a "lookbehind assertion is not fixed length" error on PCRE1-based systems (grep on macOS and many Linux distributions).
Problem 2 — template format mismatch. The evaluator agent prompt (line 129 of agents/gan-evaluator.md) instructs the evaluator to write **X.X/10**, but the shell script's evaluator prompt (line 202) says the format is **X.X** (no /10). Neither regex reliably matches either format against the actual table row | **TOTAL** | | | **X.X** |.
A simpler, portable pattern that matches the intended format:
| extract_score() { | |
| # Extract the TOTAL weighted score from a feedback file | |
| local file="$1" | |
| # Look for **TOTAL** or **X.X/10** pattern | |
| grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | |
| || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | |
| || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | |
| || echo "0.0" | |
| } | |
| extract_score() { | |
| # Extract the TOTAL weighted score from a feedback file | |
| local file="$1" | |
| # Match: | **TOTAL** | | | **7.5** | or **7.5/10** | |
| grep -oP '\|\s*\*\*TOTAL\*\*.*\|\s*\*\*\K[0-9]+\.[0-9]+(?=(/10)?\*\*)' "$file" 2>/dev/null \ | |
| || grep -oP 'Verdict:[^0-9]*\K[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | |
| || echo "0.0" | |
| } |
Also, the evaluator agent prompt template (agents/gan-evaluator.md line 129) and the shell script evaluator instruction (line 202) should be aligned to the same format (**X.X**) so the regex has only one target to match.
| cat > "${HARNESS_DIR}/config.json" << EOF | ||
| { | ||
| "brief": "$BRIEF", | ||
| "maxIterations": $MAX_ITERATIONS, | ||
| "passThreshold": $PASS_THRESHOLD, | ||
| "models": { | ||
| "planner": "$PLANNER_MODEL", | ||
| "generator": "$GENERATOR_MODEL", | ||
| "evaluator": "$EVALUATOR_MODEL" | ||
| }, | ||
| "evalMode": "$EVAL_MODE", | ||
| "devServerPort": $DEV_PORT, | ||
| "startedAt": "$(date -Iseconds)" | ||
| } | ||
| EOF |
There was a problem hiding this comment.
$BRIEF in JSON heredoc causes invalid JSON on common inputs
$BRIEF is written directly into config.json without escaping JSON special characters. Any input containing a double-quote, backslash, or newline will produce malformed JSON. For example, a brief like Build a "real-time" app produces:
{
"brief": "Build a "real-time" app", ← invalid JSON
...
}Replace the raw expansion with a jq-based write (or use printf '%s' "$BRIEF" | python3 -c 'import json,sys; print(json.dumps(sys.stdin.read()))' as a fallback):
BRIEF_JSON=$(printf '%s' "$BRIEF" | python3 -c 'import json,sys; print(json.dumps(sys.stdin.read().strip()))')
cat > "${HARNESS_DIR}/config.json" << EOF
{
"brief": $BRIEF_JSON,
...
}
EOFThe same unescaped $BRIEF is also interpolated into the planner prompt string (line 138), but since that's passed as a CLI argument to claude, the risk there is limited to aesthetic formatting issues rather than structural breakage.
| --- | ||
| name: gan-evaluator | ||
| description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator." | ||
| tools: ["Read", "Write", "Bash", "Grep", "Glob"] | ||
| model: opus | ||
| color: red |
There was a problem hiding this comment.
Playwright MCP tools missing from agent definition — default eval mode non-functional
The evaluator's tools frontmatter only lists ["Read", "Write", "Bash", "Grep", "Glob"]. In playwright mode (the default), the agent's primary mechanism is Playwright MCP (mcp__playwright__navigate, mcp__playwright__click, etc.), but those tools are absent from the definition. The agent will immediately fall through to the Bash fallback path (npx playwright test) instead.
The shell script reinforces this: --allowedTools "Read,Write,Bash,Grep,Glob" on line 187 also omits all Playwright MCP tools. This means running gan-harness.sh with the default GAN_EVAL_MODE=playwright silently degrades to code-only evaluation — the core value proposition of live browser testing is never exercised.
Update the agent frontmatter to declare the Playwright MCP tools:
| --- | |
| name: gan-evaluator | |
| description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator." | |
| tools: ["Read", "Write", "Bash", "Grep", "Glob"] | |
| model: opus | |
| color: red | |
| tools: ["Read", "Write", "Bash", "Grep", "Glob", "mcp__playwright__navigate", "mcp__playwright__click", "mcp__playwright__fill", "mcp__playwright__screenshot", "mcp__playwright__evaluate"] |
And in scripts/gan-harness.sh line 187, update --allowedTools to include the same Playwright tools. Users who don't have Playwright MCP configured should be warned at startup to set GAN_EVAL_MODE=code-only.
| ``` | ||
| weighted = (design * 0.3) + (originality * 0.2) + (craft * 0.3) + (functionality * 0.2) | ||
| ``` |
There was a problem hiding this comment.
Hardcoded weights in evaluator ignore the design-mode rubric
The evaluator's scoring formula is baked in:
weighted = (design * 0.3) + (originality * 0.2) + (craft * 0.3) + (functionality * 0.2)
commands/gan-design.md explicitly sets a different weight profile for design mode (originality: 0.30, design: 0.35, craft: 0.25, functionality: 0.10) and instructs users to write that profile into gan-harness/eval-rubric.md. But the evaluator agent never reads weights from eval-rubric.md — it always applies the default formula regardless of mode. The design-specific rubric weights are silently discarded.
The evaluator should read and apply weights from eval-rubric.md rather than hardcoding them:
**Weighted score formula:**
Read the weights for each criterion from `gan-harness/eval-rubric.md` and compute:
`weighted = sum(criterion_score * criterion_weight)` for all criteria listed in the rubric.|
|
||
| phase "PHASE 3: Build Report" | ||
|
|
||
| FINAL_SCORE="${SCORES[-1]:-0.0}" |
There was a problem hiding this comment.
Negative array index requires Bash 4.1+; breaks on macOS
${SCORES[-1]} uses negative array indexing introduced in Bash 4.1. macOS ships Bash 3.2 and will error with bad array subscript.
| FINAL_SCORE="${SCORES[-1]:-0.0}" | |
| FINAL_SCORE="${SCORES[${#SCORES[@]}-1]:-0.0}" |
${#SCORES[@]}-1 computes the last index and is compatible with Bash 3.2+.
| # With options | ||
| GAN_MAX_ITERATIONS=10 \ | ||
| GAN_PASS_THRESHOLD=7.5 \ | ||
| GAN_EVAL_CRITERIA="functionality,performance,security" \ | ||
| ./scripts/gan-harness.sh "Build a REST API for task management" | ||
| ``` |
There was a problem hiding this comment.
GAN_EVAL_CRITERIA env var is documented but never read by the script
SKILL.md lists GAN_EVAL_CRITERIA as a supported configuration variable, but scripts/gan-harness.sh never reads it — there is no GAN_EVAL_CRITERIA="${GAN_EVAL_CRITERIA:-...}" line and no code that uses it. Users who set this variable will see no effect. Either implement it (parse the comma-separated list and inject it into the evaluator prompt) or remove it from the documented env-var table to avoid confusing consumers.
| if [ "$SKIP_PLANNER" = "true" ] && [ -f "${HARNESS_DIR}/spec.md" ]; then | ||
| phase "PHASE 1: Planning — SKIPPED (spec.md exists)" | ||
| else | ||
| phase "PHASE 1: Planning" | ||
| log "Launching Planner agent (model: $PLANNER_MODEL)..." |
There was a problem hiding this comment.
GAN_SKIP_PLANNER=true silently runs the planner when no spec.md exists
The condition [ "$SKIP_PLANNER" = "true" ] && [ -f "${HARNESS_DIR}/spec.md" ] means that even with SKIP_PLANNER=true, the planner WILL run if spec.md doesn't already exist. This contradicts the documented example in examples/gan-harness/README.md where users are shown using SKIP_PLANNER=true to provide their own spec.
A cleaner fix that fails fast:
if [ "$SKIP_PLANNER" = "true" ]; then
if [ ! -f "${HARNESS_DIR}/spec.md" ]; then
fail "GAN_SKIP_PLANNER=true but ${HARNESS_DIR}/spec.md does not exist. Create it first."
exit 1
fi
phase "PHASE 1: Planning — SKIPPED (spec.md exists)"
else
phase "PHASE 1: Planning"
# ... planner invocation
fiThere was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
scripts/gan-harness.sh (2)
77-81: Declare and assign separately to avoid masking return values.Per shellcheck SC2155, combining
localwith command substitution can mask the command's exit status.Proposed fix
elapsed() { - local now=$(date +%s) + local now + now=$(date +%s) local diff=$((now - START_TIME)) printf '%dh %dm %ds' $((diff/3600)) $((diff%3600/60)) $((diff%60)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/gan-harness.sh` around lines 77 - 81, In the elapsed function, avoid combining local with command substitution (SC2155) because it masks exit status: declare the local variables first (e.g., local now diff) and then assign now=$(date +%s) and diff=$((now - START_TIME)), leaving the printf unchanged; reference the elapsed function and variables now, diff, and START_TIME when making the change.
244-253: Consider Bash 3.2 compatibility for array indexing.The
${SCORES[-1]}negative index syntax requires Bash 4.3+. Systems with older Bash versions (e.g., macOS with default Bash 3.2) won't support this syntax. If broader compatibility is needed, use${SCORES[${#SCORES[@]}-1]}instead.Proposed alternative for compatibility
-FINAL_SCORE="${SCORES[-1]:-0.0}" +FINAL_SCORE="${SCORES[${`#SCORES`[@]}-1]:-0.0}" +if [ ${`#SCORES`[@]} -eq 0 ]; then + FINAL_SCORE="0.0" +fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/gan-harness.sh` around lines 244 - 253, The FINAL_SCORE assignment uses Bash negative indexing `${SCORES[-1]:-0.0}` which breaks on Bash 3.2; change it to compute the last index using the array length: if ${`#SCORES`[@]} is zero set FINAL_SCORE to 0.0, otherwise set FINAL_SCORE to the element at index `${SCORES[$((${`#SCORES`[@]}-1))]}`; update the place where FINAL_SCORE is set (symbol FINAL_SCORE and array SCORES) and keep NUM_ITERATIONS and SCORE_TABLE generation unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commands/gan-build.md`:
- Around line 1-7: Add the required YAML frontmatter with a top-level
description field to this command Markdown so it meets the commands/**/*.md
guideline; specifically, insert a YAML block (---) at the top that includes a
short description string summarizing the command (e.g. "Builds a GAN project
from a one-line brief and optional flags"), then ensure the existing argument
list (`brief`, `--max-iterations`, `--pass-threshold`, `--skip-planner`,
`--eval-mode`) remains after the frontmatter and that the YAML block is closed
(---) so tools that parse frontmatter can read the description.
In `@commands/gan-design.md`:
- Around line 1-6: This command markdown is missing the required YAML
frontmatter; add a YAML block at the top of commands/gan-design.md containing at
minimum a description field (e.g., description: "Create a GAN-style design
harness that parses brief, --max-iterations, and --pass-threshold"), and ensure
any other required frontmatter keys used by the repo (such as title or tags if
applicable) are included so the parser that reads command files can find the
description.
In `@scripts/gan-harness.sh`:
- Around line 200-202: The TOTAL line in the gan-harness.sh output is
inconsistent with the agents/gan-evaluator.md template; update the literal
evaluation instruction currently instructing `| **TOTAL** | | | **X.X** |` so it
matches the evaluator template by changing it to `| **TOTAL** | | | **X.X/10**
|` (or alternatively change the agents/gan-evaluator.md template to remove
`/10`), and ensure any code that emits or parses that string (the printf/echo
that prints the TOTAL line) is updated to use the new format so score extraction
remains reliable.
- Line 34: The variable DEV_CMD is declared but never used; either remove the
DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" line to eliminate the dead
variable, or use it where the dev server command should be invoked (e.g.,
replace any hardcoded npm run dev invocation or pass DEV_CMD into the
generator/start logic) or add a short comment explaining it documents the
default command; update references to use DEV_CMD if you intend it to control
the dev server invocation.
- Around line 61-69: The extract_score function's lookbehind-based regexes don't
match the documented table row `| **TOTAL** | | | **X.X/10** |`, causing
fallback to 0.0; update the regex alternatives in extract_score to use simpler
direct matches such as `\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+` and
`TOTAL.*\|.*\*\*[0-9]+\.[0-9]+` (to catch table variations) and add a final
alternative like `Final Score:\s*[0-9]+\.[0-9]+` to support that format—replace
the three grep -oP lines with these patterns (preserving the 2>/dev/null and
final || echo "0.0") so the function correctly extracts scores like 7.0/10 or
7.0.
- Around line 186-203: The evaluator's claude invocation currently sets
--allowedTools without the Playwright MCP tools, so when EVAL_MODE is
"playwright" the evaluator cannot drive the browser; update the script around
the claude -p --model ... --allowedTools invocation to conditionally append the
Playwright MCP tool names (e.g. mcp__playwright__navigate,
mcp__playwright__click, mcp__playwright__fill, mcp__playwright__screenshot,
etc.) to the allowedTools list when EVAL_MODE contains or equals "playwright"
(use the EVAL_MODE variable to branch and build the --allowedTools string before
running claude, keeping the rest of the command and piping to tee
"${HARNESS_DIR}/evaluator-${i}.log").
---
Nitpick comments:
In `@scripts/gan-harness.sh`:
- Around line 77-81: In the elapsed function, avoid combining local with command
substitution (SC2155) because it masks exit status: declare the local variables
first (e.g., local now diff) and then assign now=$(date +%s) and diff=$((now -
START_TIME)), leaving the printf unchanged; reference the elapsed function and
variables now, diff, and START_TIME when making the change.
- Around line 244-253: The FINAL_SCORE assignment uses Bash negative indexing
`${SCORES[-1]:-0.0}` which breaks on Bash 3.2; change it to compute the last
index using the array length: if ${`#SCORES`[@]} is zero set FINAL_SCORE to 0.0,
otherwise set FINAL_SCORE to the element at index
`${SCORES[$((${`#SCORES`[@]}-1))]}`; update the place where FINAL_SCORE is set
(symbol FINAL_SCORE and array SCORES) and keep NUM_ITERATIONS and SCORE_TABLE
generation unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: de343e56-73a4-4718-b121-5c6b7990ea5a
📒 Files selected for processing (8)
agents/gan-evaluator.mdagents/gan-generator.mdagents/gan-planner.mdcommands/gan-build.mdcommands/gan-design.mdexamples/gan-harness/README.mdscripts/gan-harness.shskills/gan-style-harness/SKILL.md
| Parse the following from $ARGUMENTS: | ||
| 1. `brief` — the user's one-line description of what to build | ||
| 2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles | ||
| 3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass | ||
| 4. `--skip-planner` — (optional) skip planner, assume spec.md already exists | ||
| 5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only | ||
|
|
There was a problem hiding this comment.
Missing required description frontmatter.
Per coding guidelines, command files must include YAML frontmatter with a description field.
Proposed fix to add frontmatter
+---
+description: "GAN-Style Harness Build — Three-agent orchestration loop for building production-quality applications"
+---
+
Parse the following from $ARGUMENTS:
1. `brief` — the user's one-line description of what to buildAs per coding guidelines: commands/**/*.md: Command files must be Markdown with description frontmatter.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Parse the following from $ARGUMENTS: | |
| 1. `brief` — the user's one-line description of what to build | |
| 2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles | |
| 3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass | |
| 4. `--skip-planner` — (optional) skip planner, assume spec.md already exists | |
| 5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only | |
| --- | |
| description: "GAN-Style Harness Build — Three-agent orchestration loop for building production-quality applications" | |
| --- | |
| Parse the following from $ARGUMENTS: | |
| 1. `brief` — the user's one-line description of what to build | |
| 2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles | |
| 3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass | |
| 4. `--skip-planner` — (optional) skip planner, assume spec.md already exists | |
| 5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commands/gan-build.md` around lines 1 - 7, Add the required YAML frontmatter
with a top-level description field to this command Markdown so it meets the
commands/**/*.md guideline; specifically, insert a YAML block (---) at the top
that includes a short description string summarizing the command (e.g. "Builds a
GAN project from a one-line brief and optional flags"), then ensure the existing
argument list (`brief`, `--max-iterations`, `--pass-threshold`,
`--skip-planner`, `--eval-mode`) remains after the frontmatter and that the YAML
block is closed (---) so tools that parse frontmatter can read the description.
| Parse the following from $ARGUMENTS: | ||
| 1. `brief` — the user's description of the design to create | ||
| 2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles | ||
| 3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design) | ||
|
|
||
| ## GAN-Style Design Harness |
There was a problem hiding this comment.
Missing required description frontmatter.
Per coding guidelines, command files must include YAML frontmatter with a description field. This command starts directly with argument parsing.
Proposed fix to add frontmatter
+---
+description: "GAN-Style Design Harness — Two-agent loop (Generator + Evaluator) focused on frontend design quality"
+---
+
Parse the following from $ARGUMENTS:
1. `brief` — the user's description of the design to createAs per coding guidelines: commands/**/*.md: Command files must be Markdown with description frontmatter.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Parse the following from $ARGUMENTS: | |
| 1. `brief` — the user's description of the design to create | |
| 2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles | |
| 3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design) | |
| ## GAN-Style Design Harness | |
| --- | |
| description: "GAN-Style Design Harness — Two-agent loop (Generator + Evaluator) focused on frontend design quality" | |
| --- | |
| Parse the following from $ARGUMENTS: | |
| 1. `brief` — the user's description of the design to create | |
| 2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles | |
| 3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design) | |
| ## GAN-Style Design Harness |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commands/gan-design.md` around lines 1 - 6, This command markdown is missing
the required YAML frontmatter; add a YAML block at the top of
commands/gan-design.md containing at minimum a description field (e.g.,
description: "Create a GAN-style design harness that parses brief,
--max-iterations, and --pass-threshold"), and ensure any other required
frontmatter keys used by the repo (such as title or tags if applicable) are
included so the parser that reads command files can find the description.
| GENERATOR_MODEL="${GAN_GENERATOR_MODEL:-opus}" | ||
| EVALUATOR_MODEL="${GAN_EVALUATOR_MODEL:-opus}" | ||
| DEV_PORT="${GAN_DEV_SERVER_PORT:-3000}" | ||
| DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" |
There was a problem hiding this comment.
Unused variable DEV_CMD.
The DEV_CMD variable is defined but never referenced elsewhere in the script. Either use it (e.g., to start the dev server) or remove it.
Option 1: Remove unused variable
-DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}"Option 2: Use it to document expected server command
If the intent is for users to reference this or for future use, consider adding a comment or actually using it in the generator invocation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 34-34: DEV_CMD appears unused. Verify use (or export if used externally).
(SC2034)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/gan-harness.sh` at line 34, The variable DEV_CMD is declared but
never used; either remove the DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" line
to eliminate the dead variable, or use it where the dev server command should be
invoked (e.g., replace any hardcoded npm run dev invocation or pass DEV_CMD into
the generator/start logic) or add a short comment explaining it documents the
default command; update references to use DEV_CMD if you intend it to control
the dev server invocation.
| extract_score() { | ||
| # Extract the TOTAL weighted score from a feedback file | ||
| local file="$1" | ||
| # Look for **TOTAL** or **X.X/10** pattern | ||
| grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | ||
| || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | ||
| || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | ||
| || echo "0.0" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Test extract_score patterns against expected feedback format
# Create sample feedback matching the documented format
cat > /tmp/test-feedback.md << 'EOF'
# Evaluation — Iteration 001
## Scores
| Criterion | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Design Quality | 7/10 | 0.3 | 2.1 |
| Originality | 6/10 | 0.2 | 1.2 |
| Craft | 7/10 | 0.3 | 2.1 |
| Functionality | 8/10 | 0.2 | 1.6 |
| **TOTAL** | | | **7.0/10** |
## Verdict: PASS (threshold: 7.0)
EOF
echo "Testing original patterns:"
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null && echo "Pattern 1 matched" || echo "Pattern 1 failed"
grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null && echo "Pattern 2 matched" || echo "Pattern 2 failed"
echo ""
echo "Testing proposed patterns:"
grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' && echo "New pattern 1 matched" || echo "New pattern 1 failed"Repository: affaan-m/everything-claude-code
Length of output: 189
🏁 Script executed:
# Find the evaluator agent and any feedback files
find . -type f -name "*.py" -o -name "*.sh" -o -name "*.md" | grep -E "(eval|feedback|gan)" | head -20Repository: affaan-m/everything-claude-code
Length of output: 727
🏁 Script executed:
# Search for extract_score usage and context
rg "extract_score" -A 5 -B 5Repository: affaan-m/everything-claude-code
Length of output: 1608
🏁 Script executed:
# Look for actual feedback examples in the repo
find . -type f \( -name "*.md" -o -name "*.txt" \) -exec grep -l "TOTAL" {} \; | head -10Repository: affaan-m/everything-claude-code
Length of output: 142
🏁 Script executed:
# Find the evaluator agent implementation
fd -e py | xargs rg "TOTAL" -l | head -10Repository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
# Read the evaluator agent documentation to see the documented format
cat agents/gan-evaluator.mdRepository: affaan-m/everything-claude-code
Length of output: 7006
🏁 Script executed:
# Look for actual feedback files in the repo
find . -path ./node_modules -prune -o -type f -name "feedback*.md" -print | head -5Repository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
# Check if there's a feedback directory with examples
find . -type d -name "*feedback*" | head -5Repository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
# Run comprehensive test of both original and proposed patterns against documented format
cat > /tmp/comprehensive-test.sh << 'EOF'
#!/bin/bash
# Test file with documented format
cat > /tmp/test-feedback.md << 'DOC'
# Evaluation — Iteration 001
## Scores
| Criterion | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Design Quality | 7/10 | 0.3 | 2.1 |
| Originality | 6/10 | 0.2 | 1.2 |
| Craft | 7/10 | 0.3 | 2.1 |
| Functionality | 8/10 | 0.2 | 1.6 |
| **TOTAL** | | | **7.0/10** |
## Verdict: PASS (threshold: 7.0)
DOC
# Test variations
cat > /tmp/test-variations.md << 'DOC'
# Variation 1: TOTAL without /10
| **TOTAL** | | | **8.5** |
# Variation 2: TOTAL with different spacing
| **TOTAL** | | | **9.2/10** |
# Variation 3: With Verdict format only
## Verdict: PASS at 6.5/10
# Variation 4: Final Score format
Final Score: 7.8/10
# Variation 5: Malformed (should fail gracefully)
TOTAL score is 5.5
DOC
echo "=== ORIGINAL PATTERNS ==="
echo "Pattern 1: grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+'"
result=$(grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null)
echo "Result: '${result:-FAILED}'"
echo ""
echo "Pattern 2: grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+'"
result=$(grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null)
echo "Result: '${result:-FAILED}'"
echo ""
echo "=== PROPOSED PATTERNS ==="
echo "Pattern 1: grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' | grep -oP '[0-9]+\.[0-9]+'"
result=$(grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"
echo ""
echo "=== EDGE CASES ==="
echo "Testing variations file..."
echo ""
echo "Proposed Pattern 1 on variations:"
result=$(grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"
echo ""
echo "Proposed Pattern 2 on variations:"
result=$(grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"
echo ""
echo "Proposed Pattern 3 (Final Score) on variations:"
result=$(grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"
EOF
bash /tmp/comprehensive-test.shRepository: affaan-m/everything-claude-code
Length of output: 597
🏁 Script executed:
# Search for real feedback files in the repo to test against actual data
find . -type f -name "feedback*.md" -o -name "*feedback*.md" 2>/dev/null | head -20Repository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
# Check if there are test fixtures or examples in the repo
find . -type d -name "test*" -o -name "example*" -o -name "fixture*" 2>/dev/null | grep -i "gan\|eval"Repository: affaan-m/everything-claude-code
Length of output: 57
Fix score extraction regex to match documented feedback format.
The original regex patterns fail to extract the TOTAL score from the documented table format | **TOTAL** | | | **X.X/10** |. Both patterns use lookbehind assertions that don't correctly match this structure, causing the extraction to fail and fall back to 0.0, which breaks the evaluation feedback loop.
The proposed fix addresses this by using simpler, more direct patterns:
\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+matches the TOTAL row regardless of/10suffixTOTAL.*\|.*\*\*[0-9]+\.[0-9]+handles table variationsFinal Score:\s*[0-9]+\.[0-9]+supports alternative formats
Testing confirms the original patterns fail on the documented format while the proposed patterns correctly extract 7.0 from | **TOTAL** | | | **7.0/10** | and handle variations without the /10 suffix.
Proposed fix
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
- # Look for **TOTAL** or **X.X/10** pattern
- grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
- || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
- || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+ # Look for TOTAL row in table format: | **TOTAL** | | | **X.X/10** |
+ # or: | **TOTAL** | | | **X.X** |
+ grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+ || grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+ || grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| extract_score() { | |
| # Extract the TOTAL weighted score from a feedback file | |
| local file="$1" | |
| # Look for **TOTAL** or **X.X/10** pattern | |
| grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | |
| || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \ | |
| || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | |
| || echo "0.0" | |
| } | |
| extract_score() { | |
| # Extract the TOTAL weighted score from a feedback file | |
| local file="$1" | |
| # Look for TOTAL row in table format: | **TOTAL** | | | **X.X/10** | | |
| # or: | **TOTAL** | | | **X.X** | | |
| grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | |
| || grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | |
| || grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \ | |
| || echo "0.0" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/gan-harness.sh` around lines 61 - 69, The extract_score function's
lookbehind-based regexes don't match the documented table row `| **TOTAL** | | |
**X.X/10** |`, causing fallback to 0.0; update the regex alternatives in
extract_score to use simpler direct matches such as
`\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+` and `TOTAL.*\|.*\*\*[0-9]+\.[0-9]+` (to
catch table variations) and add a final alternative like `Final
Score:\s*[0-9]+\.[0-9]+` to support that format—replace the three grep -oP lines
with these patterns (preserving the 2>/dev/null and final || echo "0.0") so the
function correctly extracts scores like 7.0/10 or 7.0.
| claude -p --model "$EVALUATOR_MODEL" \ | ||
| --allowedTools "Read,Write,Bash,Grep,Glob" \ | ||
| "You are the Evaluator in a GAN-style harness. Read agents/gan-evaluator.md for full instructions. | ||
|
|
||
| Iteration: $i | ||
| Eval mode: $EVAL_MODE | ||
| Dev server: http://localhost:$DEV_PORT | ||
|
|
||
| 1. Read gan-harness/eval-rubric.md for scoring criteria | ||
| 2. Read gan-harness/spec.md for feature requirements | ||
| 3. Read gan-harness/generator-state.md for what was built | ||
| 4. Test the live application (mode: $EVAL_MODE) | ||
| 5. Score against the rubric (1-10 per criterion) | ||
| 6. Write detailed feedback to gan-harness/feedback/feedback-$(printf '%03d' $i).md | ||
|
|
||
| Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.' | ||
| Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \ | ||
| 2>&1 | tee "${HARNESS_DIR}/evaluator-${i}.log" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the correct allowedTools syntax for Claude Code CLI to enable Playwright MCP tools?
💡 Result:
The correct syntax for the --allowedTools flag in Claude Code CLI to enable Playwright MCP tools (after adding the MCP server with claude mcp add playwright npx '@playwright/mcp@latest') uses the pattern matching for MCP tools prefixed with "mcp". Examples: claude --allowedTools "mcpplaywrightbrowser_navigate" "mcpplaywrightbrowser_click" "mcpplaywrightbrowser_snapshot" Or to allow all tools from the playwright server: claude --allowedTools "mcpplaywright*" First add the MCP server, then use /mcp in the session to list exact tool names like browser_click, browser_snapshot, etc., from the playwright server. Use those in the full "mcpplaywright" format for --allowedTools to auto-approve without prompts. MCP tools require permission by default unless allowed. Note: Playwright CLI is an alternative (npm install -g @playwright/cli), used via Bash tool as "playwright-cli open https://example.com", but the query specifies Playwright MCP tools.
Citations:
- 1: https://til.simonwillison.net/claude-code/playwright-mcp-claude-code
- 2: Allow MCP tools to be available only to subagent anthropics/claude-code#6915
- 3: https://docs.claude.com/en/docs/claude-code/cli-usage
- 4: https://github.com/microsoft/playwright-mcp
- 5: https://code.claude.com/docs/en/mcp
- 6: http://code.claude.com/docs/en/permissions
- 7: https://claude-code-playbook.pages.dev/en/docs/level-2/permission-modes/
- 8: https://docs.claude.com/en/docs/agent-sdk/custom-tools
- 9: http://getdecipher.com/blog/how-to-use-claude-code-to-write-playwright-tests-(with-playwright-mcp)
- 10: Codex is unable to use playwright mcp with extension? microsoft/playwright-mcp#990
- 11: https://skillsplayground.com/guides/claude-code-permissions/
- 12: https://blog.perfectra1n.com/using-playwright-mcp-with-claude-code/
🏁 Script executed:
# Search for references to mcp, playwright, and allowedTools in the repo
rg -i "mcp.*playwright|allowedtools" --type md --type sh -A 2 -B 2Repository: affaan-m/everything-claude-code
Length of output: 7172
🏁 Script executed:
# Find and examine the gan-evaluator.md file mentioned in the script
find . -name "*gan-evaluator*" -type fRepository: affaan-m/everything-claude-code
Length of output: 100
🏁 Script executed:
# Check for any existing Playwright MCP tool references in scripts
rg "mcp" --type sh -A 2 -B 2Repository: affaan-m/everything-claude-code
Length of output: 2679
🏁 Script executed:
# Check what EVAL_MODE values are used and documented in the script
rg "EVAL_MODE" scripts/gan-harness.sh -B 3 -A 3Repository: affaan-m/everything-claude-code
Length of output: 1794
🏁 Script executed:
# Check the gan-evaluator.md file to see what tools are documented as needed
cat -n agents/gan-evaluator.md | head -80Repository: affaan-m/everything-claude-code
Length of output: 3539
Add Playwright MCP tools to evaluator's --allowedTools for playwright eval mode.
The evaluator documentation (agents/gan-evaluator.md) describes using Playwright MCP tools (mcp__playwright__navigate, mcp__playwright__click, etc.) for browser interaction, but the current --allowedTools parameter omits them. In playwright mode, the evaluator will lack access to these tools.
Conditionally include Playwright MCP tools based on eval mode:
Proposed fix
# ── EVALUATE ──
echo -e "${RED}▶ EVALUATOR (iteration $i)${NC}"
+ # Build allowed tools based on eval mode
+ EVAL_TOOLS="Read,Write,Bash,Grep,Glob"
+ if [ "$EVAL_MODE" = "playwright" ]; then
+ EVAL_TOOLS="Read,Write,Bash,Grep,Glob,mcp__playwright__*"
+ fi
+
claude -p --model "$EVALUATOR_MODEL" \
- --allowedTools "Read,Write,Bash,Grep,Glob" \
+ --allowedTools "$EVAL_TOOLS" \
"You are the Evaluator in a GAN-style harness. Read agents/gan-evaluator.md for full instructions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/gan-harness.sh` around lines 186 - 203, The evaluator's claude
invocation currently sets --allowedTools without the Playwright MCP tools, so
when EVAL_MODE is "playwright" the evaluator cannot drive the browser; update
the script around the claude -p --model ... --allowedTools invocation to
conditionally append the Playwright MCP tool names (e.g.
mcp__playwright__navigate, mcp__playwright__click, mcp__playwright__fill,
mcp__playwright__screenshot, etc.) to the allowedTools list when EVAL_MODE
contains or equals "playwright" (use the EVAL_MODE variable to branch and build
the --allowedTools string before running claude, keeping the rest of the command
and piping to tee "${HARNESS_DIR}/evaluator-${i}.log").
|
|
||
| Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.' | ||
| Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \ |
There was a problem hiding this comment.
Inconsistent TOTAL format instruction vs. agent template.
Line 202 instructs the evaluator to use format | **TOTAL** | | | **X.X** | (without /10), but the agent template in agents/gan-evaluator.md shows **X.X/10**. This inconsistency could cause score extraction failures.
Proposed fix to align with agent template
-Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
+Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X/10** |" \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.' | |
| Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \ | |
| Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.' | |
| Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X/10** |" \ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/gan-harness.sh` around lines 200 - 202, The TOTAL line in the
gan-harness.sh output is inconsistent with the agents/gan-evaluator.md template;
update the literal evaluation instruction currently instructing `| **TOTAL** | |
| **X.X** |` so it matches the evaluator template by changing it to `| **TOTAL**
| | | **X.X/10** |` (or alternatively change the agents/gan-evaluator.md
template to remove `/10`), and ensure any code that emits or parses that string
(the printf/echo that prints the TOTAL line) is updated to use the new format so
score extraction remains reliable.
Summary
Implements the GAN-inspired multi-agent harness pattern from Anthropic's March 2026 engineering blog post — separating generation from evaluation to create an adversarial feedback loop that produces production-quality applications.
The Core Idea
This PR adds a complete three-agent harness:
Files Added (8 files, 1276 lines)
agents/gan-planner.md— Planner agent definitionagents/gan-generator.md— Generator agent definitionagents/gan-evaluator.md— Evaluator agent definition (with Playwright MCP integration)skills/gan-style-harness/SKILL.md— Full skill documentation with architecture, config, anti-patternscommands/gan-build.md— Three-agent build commandcommands/gan-design.md— Two-agent frontend design commandscripts/gan-harness.sh— Standalone shell orchestratorexamples/gan-harness/README.md— Usage examples for different project typesKey Features
playwright(live browser),screenshot(visual),code-only(APIs/CLIs)Usage
References
Test plan
scripts/gan-harness.shwith a simple project🤖 Generated with Claude Code
Summary by cubic
Adds a GAN-style generator–evaluator harness that separates building from strict QA to drive higher-quality apps through iterative scoring. Includes planner, loop orchestration, weighted scoring, and
playwright-based testing.playwrightgan-build(full apps) andgan-design(frontend-focused)scripts/gan-harness.shwith env-based config and reportsplaywright,screenshot,code-onlyskills/gan-style-harness/,examples/gan-harness/, feedback and report outputs undergan-harness/Written for commit a3942b4. Summary will update on new commits.
Summary by CodeRabbit
Release Notes
New Features
gan-buildcommand for full-cycle development with automated feedback loopsgan-designcommand for design-focused iterationsDocumentation