fix: adopt superpowers patterns — review reception, debugging, self-review, adversarial framing

maxritter · maxritter · commit e8aceef849fd · 2026-02-23T17:09:49.000+01:00
- Add code-review-reception rule (verify-before-implementing, YAGNI checks, no performative agreement)
- Enhance debugging with defense-in-depth, root-cause tracing, condition-based waiting
- Add self-review checklist (step 9) to spec-implement workflow
- Add adversarial framing to all 3 spec-reviewer agents
- Expand testing anti-patterns (test-only methods, mocking without understanding)
- Add CSO/description trap guidance to learn command
- Add conditional brainstorming step (1.3b) to spec-plan workflow
diff --git a/pilot/agents/spec-reviewer-compliance.md b/pilot/agents/spec-reviewer-compliance.md
@@ -21,6 +21,10 @@ The orchestrator provides:
 - `test_framework_constraints` (optional): What the test framework can/cannot test
 - `plan_risks`: Risks and mitigations from the plan
 
+## ⛔ Adversarial Posture
+
+Plan tasks may be marked done without fully meeting their Definition of Done criteria. Do not trust self-reported completion status — verify every DoD criterion independently against the actual code. The implementation may be incomplete or optimistic.
+
 ## Verification Workflow (FOLLOW THIS ORDER EXACTLY)
 
 1. **Read the plan file completely** - This is your source of truth
diff --git a/pilot/agents/spec-reviewer-goal.md b/pilot/agents/spec-reviewer-goal.md
@@ -39,6 +39,10 @@ Detect the project language from:
 - If changed files include `.sh`, `.bash`, `.py` hook scripts → apply Shell/Python patterns
 - Mixed projects: apply patterns per-file
 
+#### ⛔ Adversarial Posture
+
+For code changes: artifacts may exist but be stubs, disconnected, or only partially wired together. Verify that components are substantive and actually connected — not just present. For Markdown-only changes: verify content is substantive and placed correctly — wiring checks are not applicable.
+
 ### Step 1: Goal-Backward Truth Derivation
 
 Read the plan file completely. Then follow this preference order:
diff --git a/pilot/agents/spec-reviewer-quality.md b/pilot/agents/spec-reviewer-quality.md
@@ -11,6 +11,10 @@ permissionMode: plan
 
 You review implementation code for quality, security, testing, performance, and error handling. Your job is to find real issues that would cause problems in production: bugs, security vulnerabilities, poor error handling, missing tests, and performance problems.
 
+## ⛔ Adversarial Posture
+
+Code that passes tests can still have quality, security, or performance issues. Do not trust passing tests as proof of quality — verify independently that error handling, edge cases, and security boundaries are sound. The implementation may look correct while hiding real problems.
+
 ## ⛔ MANDATORY FIRST STEP: Read ALL Rules
 
 **Before reviewing ANY code, you MUST read all rule files. This is NON-NEGOTIABLE.**
diff --git a/pilot/commands/learn.md b/pilot/commands/learn.md
@@ -1,5 +1,5 @@
 ---
-description: Extract reusable knowledge into skills - online learning system
+description: Use after significant debugging, workarounds, or multi-step workflows worth standardizing for future sessions
 model: sonnet
 ---
 
@@ -71,9 +71,9 @@ Skills must be:
 name: descriptive-kebab-case-name
 description: |
   [CRITICAL: This determines when the skill triggers. Include:]
-  - What the skill does
+  - When to use it (trigger conditions, scenarios — NOT what the skill does)
   - Specific trigger conditions (exact error messages, symptoms)
-  - When to use it (contexts, scenarios)
+  - Contexts where it applies
 author: Claude Code
 version: 1.0.0
 ---
@@ -106,11 +106,13 @@ version: 1.0.0
 ```
 
 <details>
-<summary>Writing Effective Descriptions</summary>
+<summary>Writing Effective Descriptions (⚠️ The Description Trap)</summary>
 
-The description field is CRITICAL for skill discovery:
+The description field is CRITICAL for skill discovery. **Describe WHEN to use the skill, NEVER summarize HOW it works.**
 
-**Good:**
+**Why this matters:** If the description summarizes the workflow/process, Claude follows the short description as a shortcut instead of reading the full SKILL.md. The skill's detailed steps, examples, and nuances get ignored.
+
+**Good** (trigger conditions):
 
 ```yaml
 description: |
@@ -119,10 +121,10 @@ description: |
   not in packages, (3) symlinked dependencies cause failures.
 ```
 
-**Bad:**
+**Bad** (workflow summary — Claude will follow this instead of reading the skill):
 
 ```yaml
-description: Helps with npm problems in monorepos.
+description: Extract and organize npm monorepo fixes by analyzing symlinks and paths.
 ```
 
 </details>
diff --git a/pilot/commands/spec-implement.md b/pilot/commands/spec-implement.md
@@ -206,14 +206,19 @@ TaskCreate: "Task 4: Add documentation"            → id=4, addBlockedBy: [2]
 6. **Run actual program** - Use the plan's Runtime Environment section to start the service/program. Show real output with sample data. Check port availability first: `lsof -i :<port>`. **If using `playwright-cli`, use session isolation:** `playwright-cli -s="${PILOT_SESSION_ID:-default}" open <url>` (see `playwright-cli.md`). Always close the session after verification.
 7. **Check diagnostics** - Must be zero errors
 8. **Validate Definition of Done** - Check all criteria from plan
-9. **Per-task commit (worktree mode only)** - If `Worktree: Yes` in the plan, commit task changes immediately:
+9. **Self-Review (Fresh Eyes)** - Before committing, re-examine your changes as if seeing them for the first time:
+   - **Completeness:** Did I miss any requirements from the task objective?
+   - **Quality:** Are names clear? Is the code readable without comments explaining what it does?
+   - **Discipline:** Did I add anything not in the task scope? (YAGNI — remove it)
+   - **Testing:** Do tests verify behavior (inputs → outputs), not implementation details (mock calls)?
+10. **Per-task commit (worktree mode only)** - If `Worktree: Yes` in the plan, commit task changes immediately:
    ```bash
    git add <task-specific-files>  # Stage only files related to this task
    git commit -m "{type}(spec): {task-name}"
    ```
    Use `feat(spec):` for new features, `fix(spec):` for bug fixes, `test(spec):` for test-only tasks, `refactor(spec):` for refactoring. Skip this step when `Worktree: No` (normal git rules apply).
-10. **Mark task as `completed`** - `TaskUpdate(taskId="<id>", status="completed")`
-11. **UPDATE PLAN FILE IMMEDIATELY** (see Step 2.4)
+11. **Mark task as `completed`** - `TaskUpdate(taskId="<id>", status="completed")`
+12. **UPDATE PLAN FILE IMMEDIATELY** (see Step 2.4)
 
 **⚠️ NEVER SKIP TASKS:**
 
diff --git a/pilot/commands/spec-plan.md b/pilot/commands/spec-plan.md
@@ -300,6 +300,36 @@ If the task is clear and unambiguous with no meaningful gray areas, skip directl
 3. Identify integration points and potential risks
 4. Note reusable patterns
 
+### Step 1.3b: Present Findings & Brainstorm (Scope Selection) — CONDITIONAL
+
+**Only do this step when exploration revealed multiple possible directions or the scope is ambiguous.** Skip for straightforward tasks where the scope is clear from the user's request (e.g., "fix this bug", "add a logout button", "refactor the auth module").
+
+**Triggers for this step:**
+- Task is exploratory or open-ended ("analyze X and see what we can improve", "find gaps in our system")
+- Exploration uncovered significantly more opportunities than the user likely expected
+- Multiple valid approaches exist and user preference matters for scoping
+
+**Process:**
+
+1. **Send notification** so the user knows their input is needed:
+
+   ```bash
+   ~/.pilot/bin/pilot notify plan_approval "Findings Ready" "<plan_name> — review discovered items and select what to include" --plan-path "<plan_path>" 2>/dev/null || true
+   ```
+
+2. **List ALL discovered gaps/opportunities** with a brief assessment for each (1-2 sentences: what it is, why it matters or doesn't, effort level).
+
+3. **Use AskUserQuestion with `multiSelect: true`** to let the user pick which items to include:
+
+   ```
+   Question: "Which of these items should we include in the plan?"
+   Header: "Scope"
+   multiSelect: true
+   Options: [each discovered item as an option with description]
+   ```
+
+4. **Only proceed to Step 1.4 with the user's selected scope.** Items not selected go into the plan's "Out of Scope" or "Deferred Ideas" section.
+
 ### Step 1.4: Design Decisions
 
 **Present findings and gather all design decisions (Question Batch 2).**
diff --git a/pilot/rules/code-review-reception.md b/pilot/rules/code-review-reception.md
@@ -0,0 +1,53 @@
+## Code Review Reception
+
+When receiving code review feedback — from users, spec-reviewer agents, or external tools like CodeRabbit — apply these guidelines.
+
+### Response Sequence
+
+1. **Read** — Complete feedback without reacting
+2. **Understand** — Restate requirement in own words (or ask)
+3. **Verify** — Check against codebase reality
+4. **Evaluate** — Technically sound for THIS codebase?
+5. **Respond** — Technical acknowledgment or reasoned pushback
+6. **Implement** — One item at a time, test each
+
+If any item is unclear: **STOP** — do not implement anything yet. Ask for clarification on all unclear items first. Partial understanding = wrong implementation.
+
+### Source-Specific Handling
+
+| Source | Approach |
+|--------|----------|
+| **User feedback** | Trusted — implement after understanding. Still ask if scope unclear. Skip to action or technical acknowledgment. |
+| **External reviewers** | Verify first: (1) technically correct for THIS codebase? (2) breaks existing functionality? (3) reason for current implementation? (4) conflicts with user's prior decisions? If conflicts → stop and discuss with user first. |
+| **Spec-reviewer agents** | `must_fix` and `should_fix` → fix immediately. `suggestion` → implement if quick. No discussion needed. |
+
+### YAGNI Check
+
+When a reviewer suggests adding or "properly implementing" a feature:
+
+1. Search codebase for actual usage (`vexor`, `Grep`, or LSP `findReferences`)
+2. If unused → push back: "This isn't called anywhere. Remove it (YAGNI)?"
+3. If used → implement properly
+
+### Implementation Order (Multi-Item Feedback)
+
+1. Clarify anything unclear **first**
+2. Blocking issues (breaks, security)
+3. Simple fixes (typos, imports, naming)
+4. Complex fixes (refactoring, logic changes)
+5. Test each fix individually, verify no regressions
+
+### Forbidden Responses
+
+| Never Say | Instead |
+|-----------|---------|
+| "You're absolutely right!" | State the technical requirement |
+| "Great point!" / "Excellent feedback!" | Just start working — actions > words |
+| "Let me implement that now" (before verification) | Verify against codebase first |
+| "Thanks for catching that!" | "Fixed. [Brief description of what changed]" |
+
+### When to Push Back
+
+Push back with technical reasoning when: suggestion breaks existing functionality, reviewer lacks full context, violates YAGNI, technically incorrect for this stack, or conflicts with user's architectural decisions.
+
+If you pushed back and were wrong: state the correction factually and move on. No apologies or over-explaining.
diff --git a/pilot/rules/development-practices.md b/pilot/rules/development-practices.md
@@ -30,6 +30,43 @@
 
 **Meta-Debugging:** Treat your own code as foreign. Your mental model is a guess — the code's behavior is truth.
 
+#### Defense-in-Depth & Root-Cause Tracing
+
+**After fixing a bug, make it structurally impossible — not just patched.**
+
+When a bug is caused by invalid data flowing through multiple layers:
+
+1. **Trace backward** from symptom through the call chain to the original trigger. Use LSP `incomingCalls` or add `new Error().stack` instrumentation before the failing operation. Fix at the source — never fix just where the error appears.
+2. **Then add validation at every layer** the data passes through:
+
+| Layer | Purpose | Example |
+|-------|---------|---------|
+| Entry point | Reject invalid input at API boundary | Validate non-empty, exists, correct type |
+| Business logic | Ensure data makes sense for this operation | Validate required fields for specific context |
+| Environment guards | Prevent dangerous operations in specific contexts | Refuse destructive ops outside temp dirs in tests |
+| Debug instrumentation | Capture context for forensics | Log directory, cwd, stack trace before risky ops |
+
+**Single validation = "we fixed the bug". All four layers = "we made the bug impossible."**
+
+#### Condition-Based Waiting (Test Flakiness)
+
+**Replace arbitrary `sleep`/`setTimeout` with polling for the actual condition.**
+
+```
+# ❌ Guessing at timing (flaky)
+await sleep(500)
+result = get_result()
+
+# ✅ Wait for the condition (reliable)
+result = await wait_for(lambda: get_result() is not None, timeout=5.0)
+```
+
+**When to use:** Tests with arbitrary delays, flaky tests, waiting for async operations.
+
+**When NOT to use:** Testing actual timing behavior (debounce, throttle) — document WHY the timeout is needed.
+
+**Rules:** Poll every 10ms (not 1ms — wastes CPU). Always include timeout with clear error message. Call getter inside loop for fresh data (no stale cache).
+
 ### Git Operations
 
 **Read git state freely. NEVER execute write commands without EXPLICIT user permission.**
diff --git a/pilot/rules/standards-frontend.md b/pilot/rules/standards-frontend.md
@@ -56,18 +56,27 @@ paths:
 
 **Commit to a clear aesthetic. Every visual choice must be intentional.**
 
-1. **Purpose** — What does this communicate? 2. **Audience** — Developers? Consumers? 3. **Tone** — Minimalist, editorial, playful, etc. 4. **Differentiator** — One memorable element.
+**Before coding, answer these four questions:**
 
-- **Typography:** Display font for personality, body font for readability. Max 2 fonts. Avoid defaults (Inter, Roboto, Arial).
-- **Color:** Clear hierarchy: Primary (CTAs) → Accent → Neutral → Semantic. Dark mode: design separately, not just invert.
+1. **Purpose** — What does this interface communicate? Who uses it?
+2. **Audience** — Developers? Consumers? Internal tooling?
+3. **Tone** — Minimalist, editorial, playful, industrial, luxury, retro-futuristic, etc.
+4. **Differentiator** — What's the one thing someone will remember?
+
+Then execute that direction with precision across every detail below.
+
+- **Typography:** Display font for personality, body font for readability. Max 2 fonts. Avoid defaults (Inter, Roboto, Arial, system fonts). Distinctive, characterful choices elevate the entire interface.
+- **Color:** Clear hierarchy: Primary (CTAs) → Accent → Neutral → Semantic. Dominant colors with sharp accents outperform timid, evenly-distributed palettes. Dark mode: design separately, not just invert.
 - **Spacing:** Generous whitespace. Cramped = low quality.
-- **Motion:** Every animation has purpose. Max 500ms. Always respect `prefers-reduced-motion`.
-- **Avoid AI aesthetic:** Purple gradients on white, symmetric 3-column grids, rounded cards with shadow, gradient text everywhere.
+- **Spatial Composition:** Break out of predictable grid layouts. Use asymmetry, overlapping elements, diagonal flow, or grid-breaking accents to create visual interest. Controlled density and unexpected placement make interfaces feel designed rather than templated.
+- **Visual Depth:** Create atmosphere beyond solid backgrounds. Use gradient meshes, subtle noise textures, layered transparencies, dramatic shadows, decorative borders, or grain overlays — matched to the overall aesthetic. Flat and empty ≠ minimal; depth creates polish.
+- **Motion:** Every animation has purpose. Max 500ms. Always respect `prefers-reduced-motion`. Prefer one well-orchestrated page load with staggered reveals (`animation-delay`) over scattered micro-interactions. Use scroll-triggered animations and surprising hover states for high-impact moments. CSS-only for HTML; Motion library for React when available.
+- **Avoid AI aesthetic:** Purple gradients on white, symmetric 3-column grids, rounded cards with shadow, gradient text everywhere, overused font families (Inter, Space Grotesk), cookie-cutter component patterns.
 
 ## Checklist
 
 - [ ] Components: single responsibility, typed props, local state
 - [ ] CSS: project methodology, design tokens, no `!important`
 - [ ] Accessible: keyboard, labels, contrast 4.5:1, alt text
 - [ ] Responsive: mobile-first, fluid, touch targets ≥ 44px
-- [ ] Design: intentional direction, no AI anti-patterns
+- [ ] Design: intentional direction, visual depth, composition, no AI anti-patterns
diff --git a/pilot/rules/testing.md b/pilot/rules/testing.md
@@ -66,9 +66,11 @@ Use `playwright-cli` to verify what the user sees. See `playwright-cli.md` for c
 ### Anti-Patterns
 
 - **Dependent tests** — each test must work independently
-- **Testing implementation, not behavior** — test outputs, not internals
-- **Incomplete mock data** — must match real API structure
+- **Testing implementation, not behavior** — assert outputs and state changes, not that specific mocks were called. `assert result == expected` not `mock.assert_called_with(...)`. If the implementation changes but behavior stays the same, tests should still pass.
+- **Incomplete mocks hiding structural assumptions** — mocks must mirror the complete real API structure, not just the fields you think you need. Partial mocks hide coupling to downstream fields and break when the real API returns additional or different data.
 - **Unnecessary mocks** — only for external deps
+- **Test-only methods in production** — never add methods, properties, or flags to production classes purely for test access. If you need internal state for testing, refactor the design so the behavior is observable through public interfaces.
+- **Mocking without understanding** — before mocking a dependency, understand what it actually does. A mock that doesn't reflect real behavior is a lie — tests pass against the lie, then fail against reality.
 
 ### Completion Checklist
 
diff --git a/pilot/scripts/mcp-server.cjs b/pilot/scripts/mcp-server.cjs
diff --git a/pilot/scripts/worker-service.cjs b/pilot/scripts/worker-service.cjs