Skip to content

feat: add GAN-style generator-evaluator harness#1029

Merged
affaan-m merged 1 commit intoaffaan-m:mainfrom
haochen806:feature/gan-style-harness
Mar 31, 2026
Merged

feat: add GAN-style generator-evaluator harness#1029
affaan-m merged 1 commit intoaffaan-m:mainfrom
haochen806:feature/gan-style-harness

Conversation

@haochen806
Copy link
Copy Markdown
Contributor

@haochen806 haochen806 commented Mar 30, 2026

Summary

Implements the GAN-inspired multi-agent harness pattern from Anthropic's March 2026 engineering blog post — separating generation from evaluation to create an adversarial feedback loop that produces production-quality applications.

The Core Idea

When asked to evaluate their own work, agents are pathological optimists. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.

This PR adds a complete three-agent harness:

Agent Role What It Does
Planner Product Manager Expands a one-line prompt into a 12-16 feature product spec
Generator Developer Builds the app, reads evaluator feedback, iterates
Evaluator QA Engineer Tests the live running app via Playwright, scores against rubric

Files Added (8 files, 1276 lines)

  • agents/gan-planner.md — Planner agent definition
  • agents/gan-generator.md — Generator agent definition
  • agents/gan-evaluator.md — Evaluator agent definition (with Playwright MCP integration)
  • skills/gan-style-harness/SKILL.md — Full skill documentation with architecture, config, anti-patterns
  • commands/gan-build.md — Three-agent build command
  • commands/gan-design.md — Two-agent frontend design command
  • scripts/gan-harness.sh — Standalone shell orchestrator
  • examples/gan-harness/README.md — Usage examples for different project types

Key Features

  • 4-dimension scoring: Design Quality, Originality, Craft, Functionality (weighted)
  • Pass threshold: Configurable (default 7.0/10), loop until met or max iterations
  • Plateau detection: Auto-stops if score stagnates for 2+ iterations
  • 3 eval modes: playwright (live browser), screenshot (visual), code-only (APIs/CLIs)
  • Model evolution support: Documents how the harness should simplify as models improve (following Anthropic's Stage 1→2→3 evolution)
  • Anti-AI-slop directives: Evaluator explicitly penalizes generic gradients, stock layouts, template aesthetics

Usage

# Via shell script
./scripts/gan-harness.sh "Build a project management app with Kanban boards"

# Via Claude Code command
/project:gan-build "Build a music streaming dashboard"

# Frontend design only
/project:gan-design "Create a landing page for a crypto portfolio tracker"

References

Test plan

  • Review agent prompts for completeness and clarity
  • Test scripts/gan-harness.sh with a simple project
  • Verify Playwright MCP integration in evaluator
  • Validate score extraction regex in shell script
  • Test plateau detection logic

🤖 Generated with Claude Code


Summary by cubic

Adds a GAN-style generator–evaluator harness that separates building from strict QA to drive higher-quality apps through iterative scoring. Includes planner, loop orchestration, weighted scoring, and playwright-based testing.

  • New Features
    • Three agents: planner, generator, evaluator with live testing via playwright
    • Commands: gan-build (full apps) and gan-design (frontend-focused)
    • Orchestrator script: scripts/gan-harness.sh with env-based config and reports
    • Scoring: 4 criteria (design, originality, craft, functionality), weighted total, pass threshold, plateau detection
    • Eval modes: playwright, screenshot, code-only
    • Docs and examples: skills/gan-style-harness/, examples/gan-harness/, feedback and report outputs under gan-harness/

Written for commit a3942b4. Summary will update on new commits.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced "GAN-style harness" for iterative multi-agent application development and evaluation
    • New gan-build command for full-cycle development with automated feedback loops
    • New gan-design command for design-focused iterations
    • Configurable evaluation thresholds, iteration limits, and scoring criteria
    • Automated testing with structured feedback generation across iterations
    • Early-stopping when pass thresholds are met or progress plateaus
  • Documentation

    • Added comprehensive guides for running multi-agent workflows
    • Provided example configurations and quick-start instructions

Implements Anthropic's March 2026 harness design pattern — a multi-agent
architecture that separates generation from evaluation, creating an
adversarial feedback loop that produces production-quality applications.

Components:
- 3 agent definitions (planner, generator, evaluator)
- 1 skill with full documentation (skills/gan-style-harness/)
- 2 commands (gan-build for full apps, gan-design for frontend)
- 1 shell orchestrator (scripts/gan-harness.sh)
- Examples and configuration reference

Based on: https://www.anthropic.com/engineering/harness-design-long-running-apps
@ecc-tools
Copy link
Copy Markdown
Contributor

ecc-tools bot commented Mar 30, 2026

Analyzing 5000 commits...

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive "GAN-style harness" system for iterative application development and evaluation. It includes three agent specifications (Planner, Generator, Evaluator), command documentation, a Bash orchestration script, supporting skill documentation, and examples that coordinate a multi-iteration feedback loop where an Evaluator tests running applications and provides structured feedback to a Generator for continuous improvement.

Changes

Cohort / File(s) Summary
Agent Specifications
agents/gan-planner.md, agents/gan-generator.md, agents/gan-evaluator.md
Define three-agent architecture: Planner transforms user brief into comprehensive spec/rubric, Generator implements features per feedback, Evaluator performs systematic testing with Playwright and design audits using weighted scoring (Design/Originality/Craft/Functionality).
Command Documentation
commands/gan-build.md, commands/gan-design.md
Document CLI workflows: gan-build runs three-phase loop (setup, optional planning, iteration with generator/evaluator); gan-design streamlines for design-focused projects with brief-as-spec and visual-priority rubric.
Orchestration & Skill
scripts/gan-harness.sh, skills/gan-style-harness/SKILL.md
Bash script implements multi-phase harness with phase management, score extraction, early-stop plateau detection, and report generation; skill documentation defines configuration variables, evaluation modes (playwright/screenshot/code-only), and usage patterns.
Examples & Documentation
examples/gan-harness/README.md
Provides Quick Start, step-by-step workflows, environment variable reference, project-type recommendations table, and operational tips for running GAN-style harnesses.

Sequence Diagram

sequenceDiagram
    participant User
    participant Planner as Planner Agent
    participant Harness as GAN Harness Script
    participant Generator as Generator Agent
    participant DevServer as Dev Server
    participant Evaluator as Evaluator Agent
    participant Filesystem as Filesystem<br/>(spec/feedback/state)

    User->>Harness: run gan-build brief
    Harness->>Planner: generate spec & rubric
    Planner->>Filesystem: write spec.md, eval-rubric.md
    
    loop Iteration Loop (1 to max-iterations)
        Harness->>Filesystem: read feedback (if iter > 1)
        Harness->>Generator: generate/update app
        Generator->>DevServer: start/update running app
        Generator->>Filesystem: write generator-state.md, commit
        
        Harness->>Evaluator: evaluate running app
        Evaluator->>DevServer: Playwright testing
        Evaluator->>Filesystem: write feedback-NNN.md w/ score
        
        Harness->>Harness: extract score, check threshold
        alt Score >= pass-threshold or plateau detected
            Harness->>Harness: early stop
        end
    end
    
    Harness->>Filesystem: write build-report.md
    Harness->>User: final summary, score progression
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • affaan-m

Poem

🐰 A harness of agents, they plan and they build,
Feedback loops spinning, with purpose fulfilled,
Planner designs the dream, Generator makes it real,
Evaluator judges with rubric and zeal,
Round and round they iterate—chef's kiss—until the score takes flight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add GAN-style generator-evaluator harness' directly and clearly summarizes the main change—adding a complete GAN-inspired multi-agent harness system. It is concise, specific, and accurately represents the core contribution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ecc-tools
Copy link
Copy Markdown
Contributor

ecc-tools bot commented Mar 30, 2026

Analysis Failed

Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Troubleshooting
Cause Resolution
Large repository Analysis may timeout on repos with extensive history
API rate limits Wait 15 minutes before retrying
Network issues Queue timeout is 15 minutes; retry may succeed
Permissions Verify app has Contents: Read access

Retry: /ecc-tools analyze


Report Issue | ECC Tools

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 8 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/gan-style-harness/SKILL.md">

<violation number="1" location="skills/gan-style-harness/SKILL.md:230">
P2: SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.</violation>
</file>

<file name="agents/gan-evaluator.md">

<violation number="1" location="agents/gan-evaluator.md:4">
P2: Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.</violation>
</file>

<file name="scripts/gan-harness.sh">

<violation number="1" location="scripts/gan-harness.sh:105">
P2: config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.</violation>

<violation number="2" location="scripts/gan-harness.sh:244">
P2: Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under `set -euo pipefail`. Use an explicit last index with an empty-array guard for compatibility.</violation>
</file>

<file name="commands/gan-build.md">

<violation number="1" location="commands/gan-build.md:94">
P2: Output section lists zero‑padded feedback filenames (`feedback-001.md`), which conflicts with the earlier unpadded `feedback-{iteration}.md` paths. This inconsistency can mislead users and tooling about expected file names.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

| `GAN_PLANNER_MODEL` | `opus` | Model for planning agent |
| `GAN_GENERATOR_MODEL` | `opus` | Model for generator agent |
| `GAN_EVALUATOR_MODEL` | `opus` | Model for evaluator agent |
| `GAN_EVAL_CRITERIA` | `design,originality,craft,functionality` | Comma-separated criteria |
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/gan-style-harness/SKILL.md, line 230:

<comment>SKILL.md documents GAN_EVAL_CRITERIA as a supported env var, but the harness script never reads or applies it, so users setting it will get no effect.</comment>

<file context>
@@ -0,0 +1,278 @@
+| `GAN_PLANNER_MODEL` | `opus` | Model for planning agent |
+| `GAN_GENERATOR_MODEL` | `opus` | Model for generator agent |
+| `GAN_EVALUATOR_MODEL` | `opus` | Model for evaluator agent |
+| `GAN_EVAL_CRITERIA` | `design,originality,craft,functionality` | Comma-separated criteria |
+| `GAN_DEV_SERVER_PORT` | `3000` | Port for the live app |
+| `GAN_DEV_SERVER_CMD` | `npm run dev` | Command to start dev server |
</file context>
Fix with Cubic

---
name: gan-evaluator
description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
tools: ["Read", "Write", "Bash", "Grep", "Glob"]
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agents/gan-evaluator.md, line 4:

<comment>Evaluator workflow depends on Playwright/MCP, but the agent’s tool list doesn’t include any Playwright/MCP tools, so the documented browser-testing steps can’t run.</comment>

<file context>
@@ -0,0 +1,209 @@
+---
+name: gan-evaluator
+description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
+tools: ["Read", "Write", "Bash", "Grep", "Glob"]
+model: opus
+color: red
</file context>
Fix with Cubic


phase "PHASE 3: Build Report"

FINAL_SCORE="${SCORES[-1]:-0.0}"
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under set -euo pipefail. Use an explicit last index with an empty-array guard for compatibility.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/gan-harness.sh, line 244:

<comment>Negative array index is not supported in older Bash (e.g., macOS 3.2) and can terminate the script under `set -euo pipefail`. Use an explicit last index with an empty-array guard for compatibility.</comment>

<file context>
@@ -0,0 +1,299 @@
+
+phase "PHASE 3: Build Report"
+
+FINAL_SCORE="${SCORES[-1]:-0.0}"
+NUM_ITERATIONS=${#SCORES[@]}
+ELAPSED=$(elapsed)
</file context>
Fix with Cubic

# Write config
cat > "${HARNESS_DIR}/config.json" << EOF
{
"brief": "$BRIEF",
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/gan-harness.sh, line 105:

<comment>config.json embeds the raw brief without JSON escaping, so quotes/newlines in user input will produce invalid JSON.</comment>

<file context>
@@ -0,0 +1,299 @@
+# Write config
+cat > "${HARNESS_DIR}/config.json" << EOF
+{
+  "brief": "$BRIEF",
+  "maxIterations": $MAX_ITERATIONS,
+  "passThreshold": $PASS_THRESHOLD,
</file context>
Fix with Cubic

### Files Created
- gan-harness/spec.md
- gan-harness/eval-rubric.md
- gan-harness/feedback/feedback-001.md through feedback-NNN.md
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Output section lists zero‑padded feedback filenames (feedback-001.md), which conflicts with the earlier unpadded feedback-{iteration}.md paths. This inconsistency can mislead users and tooling about expected file names.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At commands/gan-build.md, line 94:

<comment>Output section lists zero‑padded feedback filenames (`feedback-001.md`), which conflicts with the earlier unpadded `feedback-{iteration}.md` paths. This inconsistency can mislead users and tooling about expected file names.</comment>

<file context>
@@ -0,0 +1,99 @@
+### Files Created
+- gan-harness/spec.md
+- gan-harness/eval-rubric.md
+- gan-harness/feedback/feedback-001.md through feedback-NNN.md
+- gan-harness/generator-state.md
+- gan-harness/build-report.md
</file context>
Fix with Cubic

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR adds a GAN-inspired three-agent harness (Planner → Generator → Evaluator feedback loop) for autonomously building production-quality applications, comprising 8 new files across agents/, commands/, skills/, scripts/, and examples/. The architecture is well-conceived and the agent prompts are thoughtfully engineered, but several concrete bugs in the shell orchestrator and a critical mismatch between the advertised Playwright evaluation and its actual implementation need to be resolved before the harness works reliably end-to-end.

Key issues found:

  • Score extraction is brokenextract_score() uses variable-length PCRE lookbehinds that fail on most platforms, and the **TOTAL** format expected by the regex differs between the evaluator agent prompt (**X.X/10**) and the shell script instruction (**X.X**). This causes the function to silently return 0.0, making every run exhaust MAX_ITERATIONS.
  • Playwright mode non-functional by defaultagents/gan-evaluator.md and the shell script's --allowedTools both omit all mcp__playwright__* tools, so the default playwright eval mode silently degrades to Bash-based fallback. Users must configure Playwright MCP externally with no prompting from the harness.
  • JSON injection in config.json$BRIEF is written unescaped into the JSON heredoc; double-quotes or backslashes in the input produce invalid JSON.
  • Evaluator ignores design-mode weights — the evaluator hardcodes 0.3/0.2/0.3/0.2 weights, silently discarding the higher originality weight (0.30) that gan-design.md writes into eval-rubric.md.
  • GAN_EVAL_CRITERIA env var is documented but never read by the script.
  • ${SCORES[-1]} requires Bash 4.1+, breaking on macOS (ships Bash 3.2).
  • GAN_SKIP_PLANNER=true with no spec.md silently falls through and runs the planner anyway, contradicting the documented example usage.

Confidence Score: 4/5

The documentation and agent prompts are high quality, but two P1 bugs (broken score extraction, missing Playwright tools) prevent the harness from working correctly in its default configuration and should be addressed before merging.

Four P1 findings on the primary execution path: score extraction silently returns 0.0 causing all runs to exhaust max iterations; default playwright eval mode is non-functional because MCP tools are not declared; evaluator hardcodes weights that conflict with design-mode rubric; JSON injection from unescaped brief. Once those are resolved the harness is otherwise well-designed.

scripts/gan-harness.sh (score extraction regex, JSON injection, Bash 4.1 compat, SKIP_PLANNER logic) and agents/gan-evaluator.md (missing Playwright tools, hardcoded weights)

Important Files Changed

Filename Overview
scripts/gan-harness.sh Core shell orchestrator; P1 issues with unreliable score extraction (variable-length lookbehind regex), JSON injection from unescaped $BRIEF, and missing Playwright tools in --allowedTools; also Bash 4.1+ compatibility issue with ${SCORES[-1]}
agents/gan-evaluator.md Evaluator agent definition; P1: Playwright MCP tools absent from tools array so default playwright mode silently falls back; P1: hardcoded scoring weights (0.3/0.2/0.3/0.2) override the design-mode rubric weights
agents/gan-generator.md Generator agent definition; well-structured workflow instructions, clear iteration protocol, useful anti-AI-slop directives; no significant issues found
agents/gan-planner.md Planner agent definition; thorough spec output format, good anti-slop directives; no issues found
commands/gan-design.md Design-focused two-agent command; specifies custom weights (originality: 0.30, design: 0.35) that the evaluator agent silently ignores due to its hardcoded formula
skills/gan-style-harness/SKILL.md Full skill documentation; well-written architecture overview; documents GAN_EVAL_CRITERIA env var that is never implemented in the script
examples/gan-harness/README.md Usage examples; clear and helpful; the GAN_SKIP_PLANNER=true example may mislead users since the script silently runs the planner if spec.md doesn't already exist

Sequence Diagram

sequenceDiagram
    participant U as User
    participant S as gan-harness.sh
    participant P as Planner Agent
    participant G as Generator Agent
    participant E as Evaluator Agent
    participant FS as File System

    U->>S: ./gan-harness.sh "brief"
    S->>FS: mkdir gan-harness/{feedback,screenshots}

    alt SKIP_PLANNER=false
        S->>P: claude -p (planner prompt + brief)
        P->>FS: Write gan-harness/spec.md
        P->>FS: Write gan-harness/eval-rubric.md
        P-->>S: planner-output.log
    end

    loop iteration 1..MAX_ITERATIONS
        S->>G: claude -p (generator prompt + feedback ref)
        G->>FS: Read spec.md
        G->>FS: Read feedback-{N-1}.md (if i>1)
        G->>FS: Write/update app files
        G->>FS: Write generator-state.md
        G-->>S: generator-{i}.log

        S->>E: claude -p --allowedTools Read,Write,Bash,Grep,Glob
        Note over E: mcp__playwright__* tools NOT included
        E->>FS: Read eval-rubric.md + spec.md
        E->>FS: Write feedback/feedback-{i}.md
        E-->>S: evaluator-{i}.log

        S->>FS: extract_score(feedback-{i}.md)
        Note over S: variable-length lookbehind may fail - returns 0.0
        alt score >= PASS_THRESHOLD
            S-->>U: PASS
        else plateau detected
            S-->>U: PLATEAU - stopping early
        end
    end

    S->>FS: Write build-report.md
    S-->>U: Final report + score progression
Loading

Reviews (1): Last reviewed commit: "feat: add GAN-style generator-evaluator ..." | Re-trigger Greptile

Comment on lines +61 to +69
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Look for **TOTAL** or **X.X/10** pattern
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Score extraction regex is unreliable in practice

The extract_score function has two compounding problems that will cause it to silently return 0.0 in many real runs, causing the harness to exhaust all MAX_ITERATIONS even when the app has already passed.

Problem 1 — variable-length lookbehind. PCRE lookbehinds must be fixed-length in most distributions. (?<=\*\*TOTAL\*\*.*\*\*) and (?<=TOTAL.*\|.*\| \*\*) are variable-length and will produce a "lookbehind assertion is not fixed length" error on PCRE1-based systems (grep on macOS and many Linux distributions).

Problem 2 — template format mismatch. The evaluator agent prompt (line 129 of agents/gan-evaluator.md) instructs the evaluator to write **X.X/10**, but the shell script's evaluator prompt (line 202) says the format is **X.X** (no /10). Neither regex reliably matches either format against the actual table row | **TOTAL** | | | **X.X** |.

A simpler, portable pattern that matches the intended format:

Suggested change
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Look for **TOTAL** or **X.X/10** pattern
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Match: | **TOTAL** | | | **7.5** | or **7.5/10**
grep -oP '\|\s*\*\*TOTAL\*\*.*\|\s*\*\*\K[0-9]+\.[0-9]+(?=(/10)?\*\*)' "$file" 2>/dev/null \
|| grep -oP 'Verdict:[^0-9]*\K[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| echo "0.0"
}

Also, the evaluator agent prompt template (agents/gan-evaluator.md line 129) and the shell script evaluator instruction (line 202) should be aligned to the same format (**X.X**) so the regex has only one target to match.

Comment on lines +103 to +117
cat > "${HARNESS_DIR}/config.json" << EOF
{
"brief": "$BRIEF",
"maxIterations": $MAX_ITERATIONS,
"passThreshold": $PASS_THRESHOLD,
"models": {
"planner": "$PLANNER_MODEL",
"generator": "$GENERATOR_MODEL",
"evaluator": "$EVALUATOR_MODEL"
},
"evalMode": "$EVAL_MODE",
"devServerPort": $DEV_PORT,
"startedAt": "$(date -Iseconds)"
}
EOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 $BRIEF in JSON heredoc causes invalid JSON on common inputs

$BRIEF is written directly into config.json without escaping JSON special characters. Any input containing a double-quote, backslash, or newline will produce malformed JSON. For example, a brief like Build a "real-time" app produces:

{
  "brief": "Build a "real-time" app",   ← invalid JSON
  ...
}

Replace the raw expansion with a jq-based write (or use printf '%s' "$BRIEF" | python3 -c 'import json,sys; print(json.dumps(sys.stdin.read()))' as a fallback):

BRIEF_JSON=$(printf '%s' "$BRIEF" | python3 -c 'import json,sys; print(json.dumps(sys.stdin.read().strip()))')
cat > "${HARNESS_DIR}/config.json" << EOF
{
  "brief": $BRIEF_JSON,
  ...
}
EOF

The same unescaped $BRIEF is also interpolated into the planner prompt string (line 138), but since that's passed as a CLI argument to claude, the risk there is limited to aesthetic formatting issues rather than structural breakage.

Comment on lines +1 to +6
---
name: gan-evaluator
description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
tools: ["Read", "Write", "Bash", "Grep", "Glob"]
model: opus
color: red
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Playwright MCP tools missing from agent definition — default eval mode non-functional

The evaluator's tools frontmatter only lists ["Read", "Write", "Bash", "Grep", "Glob"]. In playwright mode (the default), the agent's primary mechanism is Playwright MCP (mcp__playwright__navigate, mcp__playwright__click, etc.), but those tools are absent from the definition. The agent will immediately fall through to the Bash fallback path (npx playwright test) instead.

The shell script reinforces this: --allowedTools "Read,Write,Bash,Grep,Glob" on line 187 also omits all Playwright MCP tools. This means running gan-harness.sh with the default GAN_EVAL_MODE=playwright silently degrades to code-only evaluation — the core value proposition of live browser testing is never exercised.

Update the agent frontmatter to declare the Playwright MCP tools:

Suggested change
---
name: gan-evaluator
description: "GAN Harness — Evaluator agent. Tests the live running application via Playwright, scores against rubric, and provides actionable feedback to the Generator."
tools: ["Read", "Write", "Bash", "Grep", "Glob"]
model: opus
color: red
tools: ["Read", "Write", "Bash", "Grep", "Glob", "mcp__playwright__navigate", "mcp__playwright__click", "mcp__playwright__fill", "mcp__playwright__screenshot", "mcp__playwright__evaluate"]

And in scripts/gan-harness.sh line 187, update --allowedTools to include the same Playwright tools. Users who don't have Playwright MCP configured should be warned at startup to set GAN_EVAL_MODE=code-only.

Comment on lines +110 to +112
```
weighted = (design * 0.3) + (originality * 0.2) + (craft * 0.3) + (functionality * 0.2)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded weights in evaluator ignore the design-mode rubric

The evaluator's scoring formula is baked in:

weighted = (design * 0.3) + (originality * 0.2) + (craft * 0.3) + (functionality * 0.2)

commands/gan-design.md explicitly sets a different weight profile for design mode (originality: 0.30, design: 0.35, craft: 0.25, functionality: 0.10) and instructs users to write that profile into gan-harness/eval-rubric.md. But the evaluator agent never reads weights from eval-rubric.md — it always applies the default formula regardless of mode. The design-specific rubric weights are silently discarded.

The evaluator should read and apply weights from eval-rubric.md rather than hardcoding them:

**Weighted score formula:**
Read the weights for each criterion from `gan-harness/eval-rubric.md` and compute:
`weighted = sum(criterion_score * criterion_weight)` for all criteria listed in the rubric.


phase "PHASE 3: Build Report"

FINAL_SCORE="${SCORES[-1]:-0.0}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Negative array index requires Bash 4.1+; breaks on macOS

${SCORES[-1]} uses negative array indexing introduced in Bash 4.1. macOS ships Bash 3.2 and will error with bad array subscript.

Suggested change
FINAL_SCORE="${SCORES[-1]:-0.0}"
FINAL_SCORE="${SCORES[${#SCORES[@]}-1]:-0.0}"

${#SCORES[@]}-1 computes the last index and is compatible with Bash 3.2+.

Comment on lines +170 to +175
# With options
GAN_MAX_ITERATIONS=10 \
GAN_PASS_THRESHOLD=7.5 \
GAN_EVAL_CRITERIA="functionality,performance,security" \
./scripts/gan-harness.sh "Build a REST API for task management"
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GAN_EVAL_CRITERIA env var is documented but never read by the script

SKILL.md lists GAN_EVAL_CRITERIA as a supported configuration variable, but scripts/gan-harness.sh never reads it — there is no GAN_EVAL_CRITERIA="${GAN_EVAL_CRITERIA:-...}" line and no code that uses it. Users who set this variable will see no effect. Either implement it (parse the comma-separated list and inject it into the evaluator prompt) or remove it from the documented env-var table to avoid confusing consumers.

Comment on lines +123 to +127
if [ "$SKIP_PLANNER" = "true" ] && [ -f "${HARNESS_DIR}/spec.md" ]; then
phase "PHASE 1: Planning — SKIPPED (spec.md exists)"
else
phase "PHASE 1: Planning"
log "Launching Planner agent (model: $PLANNER_MODEL)..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GAN_SKIP_PLANNER=true silently runs the planner when no spec.md exists

The condition [ "$SKIP_PLANNER" = "true" ] && [ -f "${HARNESS_DIR}/spec.md" ] means that even with SKIP_PLANNER=true, the planner WILL run if spec.md doesn't already exist. This contradicts the documented example in examples/gan-harness/README.md where users are shown using SKIP_PLANNER=true to provide their own spec.

A cleaner fix that fails fast:

if [ "$SKIP_PLANNER" = "true" ]; then
  if [ ! -f "${HARNESS_DIR}/spec.md" ]; then
    fail "GAN_SKIP_PLANNER=true but ${HARNESS_DIR}/spec.md does not exist. Create it first."
    exit 1
  fi
  phase "PHASE 1: Planning — SKIPPED (spec.md exists)"
else
  phase "PHASE 1: Planning"
  # ... planner invocation
fi

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
scripts/gan-harness.sh (2)

77-81: Declare and assign separately to avoid masking return values.

Per shellcheck SC2155, combining local with command substitution can mask the command's exit status.

Proposed fix
 elapsed() {
-  local now=$(date +%s)
+  local now
+  now=$(date +%s)
   local diff=$((now - START_TIME))
   printf '%dh %dm %ds' $((diff/3600)) $((diff%3600/60)) $((diff%60))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` around lines 77 - 81, In the elapsed function, avoid
combining local with command substitution (SC2155) because it masks exit status:
declare the local variables first (e.g., local now diff) and then assign
now=$(date +%s) and diff=$((now - START_TIME)), leaving the printf unchanged;
reference the elapsed function and variables now, diff, and START_TIME when
making the change.

244-253: Consider Bash 3.2 compatibility for array indexing.

The ${SCORES[-1]} negative index syntax requires Bash 4.3+. Systems with older Bash versions (e.g., macOS with default Bash 3.2) won't support this syntax. If broader compatibility is needed, use ${SCORES[${#SCORES[@]}-1]} instead.

Proposed alternative for compatibility
-FINAL_SCORE="${SCORES[-1]:-0.0}"
+FINAL_SCORE="${SCORES[${`#SCORES`[@]}-1]:-0.0}"
+if [ ${`#SCORES`[@]} -eq 0 ]; then
+  FINAL_SCORE="0.0"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` around lines 244 - 253, The FINAL_SCORE assignment
uses Bash negative indexing `${SCORES[-1]:-0.0}` which breaks on Bash 3.2;
change it to compute the last index using the array length: if ${`#SCORES`[@]} is
zero set FINAL_SCORE to 0.0, otherwise set FINAL_SCORE to the element at index
`${SCORES[$((${`#SCORES`[@]}-1))]}`; update the place where FINAL_SCORE is set
(symbol FINAL_SCORE and array SCORES) and keep NUM_ITERATIONS and SCORE_TABLE
generation unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commands/gan-build.md`:
- Around line 1-7: Add the required YAML frontmatter with a top-level
description field to this command Markdown so it meets the commands/**/*.md
guideline; specifically, insert a YAML block (---) at the top that includes a
short description string summarizing the command (e.g. "Builds a GAN project
from a one-line brief and optional flags"), then ensure the existing argument
list (`brief`, `--max-iterations`, `--pass-threshold`, `--skip-planner`,
`--eval-mode`) remains after the frontmatter and that the YAML block is closed
(---) so tools that parse frontmatter can read the description.

In `@commands/gan-design.md`:
- Around line 1-6: This command markdown is missing the required YAML
frontmatter; add a YAML block at the top of commands/gan-design.md containing at
minimum a description field (e.g., description: "Create a GAN-style design
harness that parses brief, --max-iterations, and --pass-threshold"), and ensure
any other required frontmatter keys used by the repo (such as title or tags if
applicable) are included so the parser that reads command files can find the
description.

In `@scripts/gan-harness.sh`:
- Around line 200-202: The TOTAL line in the gan-harness.sh output is
inconsistent with the agents/gan-evaluator.md template; update the literal
evaluation instruction currently instructing `| **TOTAL** | | | **X.X** |` so it
matches the evaluator template by changing it to `| **TOTAL** | | | **X.X/10**
|` (or alternatively change the agents/gan-evaluator.md template to remove
`/10`), and ensure any code that emits or parses that string (the printf/echo
that prints the TOTAL line) is updated to use the new format so score extraction
remains reliable.
- Line 34: The variable DEV_CMD is declared but never used; either remove the
DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" line to eliminate the dead
variable, or use it where the dev server command should be invoked (e.g.,
replace any hardcoded npm run dev invocation or pass DEV_CMD into the
generator/start logic) or add a short comment explaining it documents the
default command; update references to use DEV_CMD if you intend it to control
the dev server invocation.
- Around line 61-69: The extract_score function's lookbehind-based regexes don't
match the documented table row `| **TOTAL** | | | **X.X/10** |`, causing
fallback to 0.0; update the regex alternatives in extract_score to use simpler
direct matches such as `\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+` and
`TOTAL.*\|.*\*\*[0-9]+\.[0-9]+` (to catch table variations) and add a final
alternative like `Final Score:\s*[0-9]+\.[0-9]+` to support that format—replace
the three grep -oP lines with these patterns (preserving the 2>/dev/null and
final || echo "0.0") so the function correctly extracts scores like 7.0/10 or
7.0.
- Around line 186-203: The evaluator's claude invocation currently sets
--allowedTools without the Playwright MCP tools, so when EVAL_MODE is
"playwright" the evaluator cannot drive the browser; update the script around
the claude -p --model ... --allowedTools invocation to conditionally append the
Playwright MCP tool names (e.g. mcp__playwright__navigate,
mcp__playwright__click, mcp__playwright__fill, mcp__playwright__screenshot,
etc.) to the allowedTools list when EVAL_MODE contains or equals "playwright"
(use the EVAL_MODE variable to branch and build the --allowedTools string before
running claude, keeping the rest of the command and piping to tee
"${HARNESS_DIR}/evaluator-${i}.log").

---

Nitpick comments:
In `@scripts/gan-harness.sh`:
- Around line 77-81: In the elapsed function, avoid combining local with command
substitution (SC2155) because it masks exit status: declare the local variables
first (e.g., local now diff) and then assign now=$(date +%s) and diff=$((now -
START_TIME)), leaving the printf unchanged; reference the elapsed function and
variables now, diff, and START_TIME when making the change.
- Around line 244-253: The FINAL_SCORE assignment uses Bash negative indexing
`${SCORES[-1]:-0.0}` which breaks on Bash 3.2; change it to compute the last
index using the array length: if ${`#SCORES`[@]} is zero set FINAL_SCORE to 0.0,
otherwise set FINAL_SCORE to the element at index
`${SCORES[$((${`#SCORES`[@]}-1))]}`; update the place where FINAL_SCORE is set
(symbol FINAL_SCORE and array SCORES) and keep NUM_ITERATIONS and SCORE_TABLE
generation unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de343e56-73a4-4718-b121-5c6b7990ea5a

📥 Commits

Reviewing files that changed from the base of the PR and between e68233c and a3942b4.

📒 Files selected for processing (8)
  • agents/gan-evaluator.md
  • agents/gan-generator.md
  • agents/gan-planner.md
  • commands/gan-build.md
  • commands/gan-design.md
  • examples/gan-harness/README.md
  • scripts/gan-harness.sh
  • skills/gan-style-harness/SKILL.md

Comment on lines +1 to +7
Parse the following from $ARGUMENTS:
1. `brief` — the user's one-line description of what to build
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing required description frontmatter.

Per coding guidelines, command files must include YAML frontmatter with a description field.

Proposed fix to add frontmatter
+---
+description: "GAN-Style Harness Build — Three-agent orchestration loop for building production-quality applications"
+---
+
 Parse the following from $ARGUMENTS:
 1. `brief` — the user's one-line description of what to build

As per coding guidelines: commands/**/*.md: Command files must be Markdown with description frontmatter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Parse the following from $ARGUMENTS:
1. `brief` — the user's one-line description of what to build
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
---
description: "GAN-Style Harness Build — Three-agent orchestration loop for building production-quality applications"
---
Parse the following from $ARGUMENTS:
1. `brief` — the user's one-line description of what to build
2. `--max-iterations N` — (optional, default 15) maximum generator-evaluator cycles
3. `--pass-threshold N` — (optional, default 7.0) weighted score to pass
4. `--skip-planner` — (optional) skip planner, assume spec.md already exists
5. `--eval-mode MODE` — (optional, default "playwright") one of: playwright, screenshot, code-only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commands/gan-build.md` around lines 1 - 7, Add the required YAML frontmatter
with a top-level description field to this command Markdown so it meets the
commands/**/*.md guideline; specifically, insert a YAML block (---) at the top
that includes a short description string summarizing the command (e.g. "Builds a
GAN project from a one-line brief and optional flags"), then ensure the existing
argument list (`brief`, `--max-iterations`, `--pass-threshold`,
`--skip-planner`, `--eval-mode`) remains after the frontmatter and that the YAML
block is closed (---) so tools that parse frontmatter can read the description.

Comment on lines +1 to +6
Parse the following from $ARGUMENTS:
1. `brief` — the user's description of the design to create
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)

## GAN-Style Design Harness
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing required description frontmatter.

Per coding guidelines, command files must include YAML frontmatter with a description field. This command starts directly with argument parsing.

Proposed fix to add frontmatter
+---
+description: "GAN-Style Design Harness — Two-agent loop (Generator + Evaluator) focused on frontend design quality"
+---
+
 Parse the following from $ARGUMENTS:
 1. `brief` — the user's description of the design to create

As per coding guidelines: commands/**/*.md: Command files must be Markdown with description frontmatter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Parse the following from $ARGUMENTS:
1. `brief` — the user's description of the design to create
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
## GAN-Style Design Harness
---
description: "GAN-Style Design Harness — Two-agent loop (Generator + Evaluator) focused on frontend design quality"
---
Parse the following from $ARGUMENTS:
1. `brief` — the user's description of the design to create
2. `--max-iterations N` — (optional, default 10) maximum design-evaluate cycles
3. `--pass-threshold N` — (optional, default 7.5) weighted score to pass (higher default for design)
## GAN-Style Design Harness
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commands/gan-design.md` around lines 1 - 6, This command markdown is missing
the required YAML frontmatter; add a YAML block at the top of
commands/gan-design.md containing at minimum a description field (e.g.,
description: "Create a GAN-style design harness that parses brief,
--max-iterations, and --pass-threshold"), and ensure any other required
frontmatter keys used by the repo (such as title or tags if applicable) are
included so the parser that reads command files can find the description.

GENERATOR_MODEL="${GAN_GENERATOR_MODEL:-opus}"
EVALUATOR_MODEL="${GAN_EVALUATOR_MODEL:-opus}"
DEV_PORT="${GAN_DEV_SERVER_PORT:-3000}"
DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused variable DEV_CMD.

The DEV_CMD variable is defined but never referenced elsewhere in the script. Either use it (e.g., to start the dev server) or remove it.

Option 1: Remove unused variable
-DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}"
Option 2: Use it to document expected server command

If the intent is for users to reference this or for future use, consider adding a comment or actually using it in the generator invocation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}"
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 34-34: DEV_CMD appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` at line 34, The variable DEV_CMD is declared but
never used; either remove the DEV_CMD="${GAN_DEV_SERVER_CMD:-npm run dev}" line
to eliminate the dead variable, or use it where the dev server command should be
invoked (e.g., replace any hardcoded npm run dev invocation or pass DEV_CMD into
the generator/start logic) or add a short comment explaining it documents the
default command; update references to use DEV_CMD if you intend it to control
the dev server invocation.

Comment on lines +61 to +69
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Look for **TOTAL** or **X.X/10** pattern
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Test extract_score patterns against expected feedback format

# Create sample feedback matching the documented format
cat > /tmp/test-feedback.md << 'EOF'
# Evaluation — Iteration 001

## Scores

| Criterion | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Design Quality | 7/10 | 0.3 | 2.1 |
| Originality | 6/10 | 0.2 | 1.2 |
| Craft | 7/10 | 0.3 | 2.1 |
| Functionality | 8/10 | 0.2 | 1.6 |
| **TOTAL** | | | **7.0/10** |

## Verdict: PASS (threshold: 7.0)
EOF

echo "Testing original patterns:"
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null && echo "Pattern 1 matched" || echo "Pattern 1 failed"
grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null && echo "Pattern 2 matched" || echo "Pattern 2 failed"

echo ""
echo "Testing proposed patterns:"
grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' && echo "New pattern 1 matched" || echo "New pattern 1 failed"

Repository: affaan-m/everything-claude-code

Length of output: 189


🏁 Script executed:

# Find the evaluator agent and any feedback files
find . -type f -name "*.py" -o -name "*.sh" -o -name "*.md" | grep -E "(eval|feedback|gan)" | head -20

Repository: affaan-m/everything-claude-code

Length of output: 727


🏁 Script executed:

# Search for extract_score usage and context
rg "extract_score" -A 5 -B 5

Repository: affaan-m/everything-claude-code

Length of output: 1608


🏁 Script executed:

# Look for actual feedback examples in the repo
find . -type f \( -name "*.md" -o -name "*.txt" \) -exec grep -l "TOTAL" {} \; | head -10

Repository: affaan-m/everything-claude-code

Length of output: 142


🏁 Script executed:

# Find the evaluator agent implementation
fd -e py | xargs rg "TOTAL" -l | head -10

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

# Read the evaluator agent documentation to see the documented format
cat agents/gan-evaluator.md

Repository: affaan-m/everything-claude-code

Length of output: 7006


🏁 Script executed:

# Look for actual feedback files in the repo
find . -path ./node_modules -prune -o -type f -name "feedback*.md" -print | head -5

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

# Check if there's a feedback directory with examples
find . -type d -name "*feedback*" | head -5

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

# Run comprehensive test of both original and proposed patterns against documented format
cat > /tmp/comprehensive-test.sh << 'EOF'
#!/bin/bash

# Test file with documented format
cat > /tmp/test-feedback.md << 'DOC'
# Evaluation — Iteration 001

## Scores

| Criterion | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Design Quality | 7/10 | 0.3 | 2.1 |
| Originality | 6/10 | 0.2 | 1.2 |
| Craft | 7/10 | 0.3 | 2.1 |
| Functionality | 8/10 | 0.2 | 1.6 |
| **TOTAL** | | | **7.0/10** |

## Verdict: PASS (threshold: 7.0)
DOC

# Test variations
cat > /tmp/test-variations.md << 'DOC'
# Variation 1: TOTAL without /10

| **TOTAL** | | | **8.5** |

# Variation 2: TOTAL with different spacing

| **TOTAL** | | | **9.2/10** |

# Variation 3: With Verdict format only

## Verdict: PASS at 6.5/10

# Variation 4: Final Score format

Final Score: 7.8/10

# Variation 5: Malformed (should fail gracefully)

TOTAL score is 5.5
DOC

echo "=== ORIGINAL PATTERNS ==="
echo "Pattern 1: grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+'"
result=$(grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null)
echo "Result: '${result:-FAILED}'"

echo ""
echo "Pattern 2: grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+'"
result=$(grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null)
echo "Result: '${result:-FAILED}'"

echo ""
echo "=== PROPOSED PATTERNS ==="
echo "Pattern 1: grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' | grep -oP '[0-9]+\.[0-9]+'"
result=$(grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-feedback.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"

echo ""
echo "=== EDGE CASES ==="
echo "Testing variations file..."
echo ""
echo "Proposed Pattern 1 on variations:"
result=$(grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"

echo ""
echo "Proposed Pattern 2 on variations:"
result=$(grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"

echo ""
echo "Proposed Pattern 3 (Final Score) on variations:"
result=$(grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' /tmp/test-variations.md 2>/dev/null | grep -oP '[0-9]+\.[0-9]+')
echo "Result: '${result:-FAILED}'"
EOF
bash /tmp/comprehensive-test.sh

Repository: affaan-m/everything-claude-code

Length of output: 597


🏁 Script executed:

# Search for real feedback files in the repo to test against actual data
find . -type f -name "feedback*.md" -o -name "*feedback*.md" 2>/dev/null | head -20

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

# Check if there are test fixtures or examples in the repo
find . -type d -name "test*" -o -name "example*" -o -name "fixture*" 2>/dev/null | grep -i "gan\|eval"

Repository: affaan-m/everything-claude-code

Length of output: 57


Fix score extraction regex to match documented feedback format.

The original regex patterns fail to extract the TOTAL score from the documented table format | **TOTAL** | | | **X.X/10** |. Both patterns use lookbehind assertions that don't correctly match this structure, causing the extraction to fail and fall back to 0.0, which breaks the evaluation feedback loop.

The proposed fix addresses this by using simpler, more direct patterns:

  • \*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+ matches the TOTAL row regardless of /10 suffix
  • TOTAL.*\|.*\*\*[0-9]+\.[0-9]+ handles table variations
  • Final Score:\s*[0-9]+\.[0-9]+ supports alternative formats

Testing confirms the original patterns fail on the documented format while the proposed patterns correctly extract 7.0 from | **TOTAL** | | | **7.0/10** | and handle variations without the /10 suffix.

Proposed fix
 extract_score() {
   # Extract the TOTAL weighted score from a feedback file
   local file="$1"
-  # Look for **TOTAL** or **X.X/10** pattern
-  grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
-    || grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
-    || grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+  # Look for TOTAL row in table format: | **TOTAL** | | | **X.X/10** |
+  # or: | **TOTAL** | | | **X.X** |
+  grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+    || grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
+    || grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
     || echo "0.0"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Look for **TOTAL** or **X.X/10** pattern
grep -oP '(?<=\*\*TOTAL\*\*.*\*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP '(?<=TOTAL.*\|.*\| \*\*)[0-9]+\.[0-9]+' "$file" 2>/dev/null \
|| grep -oP 'Verdict:.*([0-9]+\.[0-9]+)' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}
extract_score() {
# Extract the TOTAL weighted score from a feedback file
local file="$1"
# Look for TOTAL row in table format: | **TOTAL** | | | **X.X/10** |
# or: | **TOTAL** | | | **X.X** |
grep -oP '\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| grep -oP 'TOTAL.*\|.*\*\*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| grep -oP 'Final Score:\s*[0-9]+\.[0-9]+' "$file" 2>/dev/null | grep -oP '[0-9]+\.[0-9]+' \
|| echo "0.0"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` around lines 61 - 69, The extract_score function's
lookbehind-based regexes don't match the documented table row `| **TOTAL** | | |
**X.X/10** |`, causing fallback to 0.0; update the regex alternatives in
extract_score to use simpler direct matches such as
`\*\*TOTAL\*\*.*\*\*[0-9]+\.[0-9]+` and `TOTAL.*\|.*\*\*[0-9]+\.[0-9]+` (to
catch table variations) and add a final alternative like `Final
Score:\s*[0-9]+\.[0-9]+` to support that format—replace the three grep -oP lines
with these patterns (preserving the 2>/dev/null and final || echo "0.0") so the
function correctly extracts scores like 7.0/10 or 7.0.

Comment on lines +186 to +203
claude -p --model "$EVALUATOR_MODEL" \
--allowedTools "Read,Write,Bash,Grep,Glob" \
"You are the Evaluator in a GAN-style harness. Read agents/gan-evaluator.md for full instructions.

Iteration: $i
Eval mode: $EVAL_MODE
Dev server: http://localhost:$DEV_PORT

1. Read gan-harness/eval-rubric.md for scoring criteria
2. Read gan-harness/spec.md for feature requirements
3. Read gan-harness/generator-state.md for what was built
4. Test the live application (mode: $EVAL_MODE)
5. Score against the rubric (1-10 per criterion)
6. Write detailed feedback to gan-harness/feedback/feedback-$(printf '%03d' $i).md

Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.'
Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
2>&1 | tee "${HARNESS_DIR}/evaluator-${i}.log"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the correct allowedTools syntax for Claude Code CLI to enable Playwright MCP tools?

💡 Result:

The correct syntax for the --allowedTools flag in Claude Code CLI to enable Playwright MCP tools (after adding the MCP server with claude mcp add playwright npx '@playwright/mcp@latest') uses the pattern matching for MCP tools prefixed with "mcp". Examples: claude --allowedTools "mcpplaywrightbrowser_navigate" "mcpplaywrightbrowser_click" "mcpplaywrightbrowser_snapshot" Or to allow all tools from the playwright server: claude --allowedTools "mcpplaywright*" First add the MCP server, then use /mcp in the session to list exact tool names like browser_click, browser_snapshot, etc., from the playwright server. Use those in the full "mcpplaywright" format for --allowedTools to auto-approve without prompts. MCP tools require permission by default unless allowed. Note: Playwright CLI is an alternative (npm install -g @playwright/cli), used via Bash tool as "playwright-cli open https://example.com", but the query specifies Playwright MCP tools.

Citations:


🏁 Script executed:

# Search for references to mcp, playwright, and allowedTools in the repo
rg -i "mcp.*playwright|allowedtools" --type md --type sh -A 2 -B 2

Repository: affaan-m/everything-claude-code

Length of output: 7172


🏁 Script executed:

# Find and examine the gan-evaluator.md file mentioned in the script
find . -name "*gan-evaluator*" -type f

Repository: affaan-m/everything-claude-code

Length of output: 100


🏁 Script executed:

# Check for any existing Playwright MCP tool references in scripts
rg "mcp" --type sh -A 2 -B 2

Repository: affaan-m/everything-claude-code

Length of output: 2679


🏁 Script executed:

# Check what EVAL_MODE values are used and documented in the script
rg "EVAL_MODE" scripts/gan-harness.sh -B 3 -A 3

Repository: affaan-m/everything-claude-code

Length of output: 1794


🏁 Script executed:

# Check the gan-evaluator.md file to see what tools are documented as needed
cat -n agents/gan-evaluator.md | head -80

Repository: affaan-m/everything-claude-code

Length of output: 3539


Add Playwright MCP tools to evaluator's --allowedTools for playwright eval mode.

The evaluator documentation (agents/gan-evaluator.md) describes using Playwright MCP tools (mcp__playwright__navigate, mcp__playwright__click, etc.) for browser interaction, but the current --allowedTools parameter omits them. In playwright mode, the evaluator will lack access to these tools.

Conditionally include Playwright MCP tools based on eval mode:

Proposed fix
   # ── EVALUATE ──
   echo -e "${RED}▶ EVALUATOR (iteration $i)${NC}"

+  # Build allowed tools based on eval mode
+  EVAL_TOOLS="Read,Write,Bash,Grep,Glob"
+  if [ "$EVAL_MODE" = "playwright" ]; then
+    EVAL_TOOLS="Read,Write,Bash,Grep,Glob,mcp__playwright__*"
+  fi
+
   claude -p --model "$EVALUATOR_MODEL" \
-    --allowedTools "Read,Write,Bash,Grep,Glob" \
+    --allowedTools "$EVAL_TOOLS" \
     "You are the Evaluator in a GAN-style harness. Read agents/gan-evaluator.md for full instructions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` around lines 186 - 203, The evaluator's claude
invocation currently sets --allowedTools without the Playwright MCP tools, so
when EVAL_MODE is "playwright" the evaluator cannot drive the browser; update
the script around the claude -p --model ... --allowedTools invocation to
conditionally append the Playwright MCP tool names (e.g.
mcp__playwright__navigate, mcp__playwright__click, mcp__playwright__fill,
mcp__playwright__screenshot, etc.) to the allowedTools list when EVAL_MODE
contains or equals "playwright" (use the EVAL_MODE variable to branch and build
the --allowedTools string before running claude, keeping the rest of the command
and piping to tee "${HARNESS_DIR}/evaluator-${i}.log").

Comment on lines +200 to +202

Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.'
Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent TOTAL format instruction vs. agent template.

Line 202 instructs the evaluator to use format | **TOTAL** | | | **X.X** | (without /10), but the agent template in agents/gan-evaluator.md shows **X.X/10**. This inconsistency could cause score extraction failures.

Proposed fix to align with agent template
-Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
+Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X/10** |" \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.'
Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X** |" \
Be RUTHLESSLY strict. A 7 means genuinely good, not 'good for AI.'
Include the weighted TOTAL score in the format: | **TOTAL** | | | **X.X/10** |" \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gan-harness.sh` around lines 200 - 202, The TOTAL line in the
gan-harness.sh output is inconsistent with the agents/gan-evaluator.md template;
update the literal evaluation instruction currently instructing `| **TOTAL** | |
| **X.X** |` so it matches the evaluator template by changing it to `| **TOTAL**
| | | **X.X/10** |` (or alternatively change the agents/gan-evaluator.md
template to remove `/10`), and ensure any code that emits or parses that string
(the printf/echo that prints the TOTAL line) is updated to use the new format so
score extraction remains reliable.

@affaan-m affaan-m merged commit 4cdfe70 into affaan-m:main Mar 31, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants