fix: harden debate prompts and fix consult model/flag issues (#226)

avifenesh · web-flow · commit 9b6f7a05dd79 · 2026-02-17T21:38:32.000+02:00
* fix: remove auto-generated header from adapter files

The `&lt;!-- AUTO-GENERATED ... --&gt;` HTML comment before frontmatter
prevented tools like agnix from parsing YAML frontmatter on line 1.
Removed the header entirely - the adapters/ directory is
self-explanatory. Drops agnix errors from 83 to 8.

* fix: apply debate findings and enhance analysis to consult and debate plugins

Debate skill hardened based on its own first debate's findings:
- Universal evidence standard for both proposer AND challenger
- Proposer prompts now require cited evidence (was challenger-only)
- Challenger follow-up reordered: anti-convergence guard first
- Minimum-disagreement requirement per round added
- Context summarization criteria specified (500-800 tokens)
- Rigor indicator and Debate Quality rating in synthesis output

Consult skill fixes from enhance analysis:
- Gemini section: added missing Session ID extraction line
- Codex: removed invalid -a suggest flag (codex exec doesn't support it)
- Codex: added -c model_reasoning_effort to safe command patterns
- Gemini models: replaced all -preview suffixes with stable names

Also updates README, test strategy doc, and all adapters.
diff --git a/README.md b/README.md
@@ -651,7 +651,7 @@ agent-knowledge/
 | Tool | Default Model (high) | Reasoning Control |
 |------|---------------------|-------------------|
 | Claude | opus | max-turns |
-| Gemini | gemini-3-pro-preview | built-in |
+| Gemini | gemini-3-pro | built-in |
 | Codex | gpt-5.3-codex | model_reasoning_effort |
 | OpenCode | github-copilot/claude-opus-4-6 | --variant |
 | Copilot | (default) | none |
diff --git a/adapters/codex/skills/consult/SKILL.md b/adapters/codex/skills/consult/SKILL.md
@@ -11,7 +11,7 @@ You are executing the /consult command. Your job is to parse the user's request
 
 - NEVER expose API keys in commands or output
 - NEVER run with permission-bypassing flags (`--dangerously-skip-permissions`, `bypassPermissions`)
-- MUST use safe-mode defaults (`-a suggest` for Codex, `--allowedTools "Read,Glob,Grep"` for Claude)
+- MUST use safe-mode defaults (`--allowedTools "Read,Glob,Grep"` for Claude, `-c model_reasoning_effort` for Codex)
 - MUST enforce 120s timeout on all tool executions
 - MUST validate tool names against allow-list: gemini, codex, claude, opencode, copilot (reject all others)
 - MUST validate `--context=file=PATH` is within the project directory (reject absolute paths outside cwd)
diff --git a/adapters/opencode/agents/consult-agent.md b/adapters/opencode/agents/consult-agent.md
@@ -120,9 +120,9 @@ Run N Bash commands **in parallel** (multiple Bash tool calls in a single messag
 
 Example for 3 parallel Codex calls:
 ```
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-1.tmp")" --json -m "gpt-5.3-codex" -a suggest
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-2.tmp")" --json -m "gpt-5.3-codex" -a suggest
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-3.tmp")" --json -m "gpt-5.3-codex" -a suggest
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-1.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-2.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-3.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
 ```
 
 #### 4d. Parse and Format Results
diff --git a/adapters/opencode/commands/consult.md b/adapters/opencode/commands/consult.md
@@ -17,7 +17,7 @@ You are executing the /consult command. Your job is to parse the user's request
 
 - NEVER expose API keys in commands or output
 - NEVER run with permission-bypassing flags (`--dangerously-skip-permissions`, `bypassPermissions`)
-- MUST use safe-mode defaults (`-a suggest` for Codex, `--allowedTools "Read,Glob,Grep"` for Claude)
+- MUST use safe-mode defaults (`--allowedTools "Read,Glob,Grep"` for Claude, `-c model_reasoning_effort` for Codex)
 - MUST enforce 120s timeout on all tool executions
 - MUST validate tool names against allow-list: gemini, codex, claude, opencode, copilot (reject all others)
 - MUST validate `--context=file=PATH` is within the project directory (reject absolute paths outside cwd)
diff --git a/adapters/opencode/skills/consult/SKILL.md b/adapters/opencode/skills/consult/SKILL.md
@@ -66,22 +66,23 @@ Command: gemini -p "QUESTION" --output-format json -m "MODEL"
 Session resume: --resume "SESSION_ID"
 ```
 
-Models: gemini-2.5-flash, gemini-2.5-pro, gemini-3-flash-preview, gemini-3-pro-preview
+Models: gemini-2.5-flash, gemini-2.5-pro, gemini-3-flash, gemini-3-pro
 
 | Effort | Model |
 |--------|-------|
 | low | gemini-2.5-flash |
-| medium | gemini-2.5-pro |
-| high | gemini-3-flash-preview |
-| max | gemini-3-pro-preview |
+| medium | gemini-3-flash |
+| high | gemini-3-pro |
+| max | gemini-3-pro |
 
 **Parse output**: `JSON.parse(stdout).response`
+**Session ID**: `JSON.parse(stdout).session_id`
 **Continuable**: Yes (via `--resume`)
 
 ### Codex
 
 ```
-Command: codex exec "QUESTION" --json -m "MODEL" -a suggest -c model_reasoning_effort="LEVEL"
+Command: codex exec "QUESTION" --json -m "MODEL" -c model_reasoning_effort="LEVEL"
 Session resume: codex exec resume SESSION_ID "QUESTION" --json
 Session resume (latest): codex exec resume --last "QUESTION" --json
 ```
@@ -193,7 +194,7 @@ User-provided question text MUST NOT be interpolated into shell command strings.
 | Claude (resume) | `claude -p - --output-format json --model "MODEL" --max-turns TURNS --allowedTools "Read,Glob,Grep" --resume "SESSION_ID" < "{AI_STATE_DIR}/consult/question.tmp"` |
 | Gemini | `gemini -p - --output-format json -m "MODEL" < "{AI_STATE_DIR}/consult/question.tmp"` |
 | Gemini (resume) | `gemini -p - --output-format json -m "MODEL" --resume "SESSION_ID" < "{AI_STATE_DIR}/consult/question.tmp"` |
-| Codex | `codex exec "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL" -a suggest` (Codex exec lacks stdin mode -- cat reads from platform-controlled path, not user input) |
+| Codex | `codex exec "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL" -c model_reasoning_effort="LEVEL"` (Codex exec lacks stdin mode -- cat reads from platform-controlled path, not user input) |
 | Codex (resume) | `codex exec resume SESSION_ID "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL"` |
 | Codex (resume latest) | `codex exec resume --last "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL"` |
 | OpenCode | `opencode run - --format json --model "MODEL" --variant "VARIANT" < "{AI_STATE_DIR}/consult/question.tmp"` |
@@ -266,7 +267,7 @@ Return a plain JSON object to stdout (no markers or wrappers):
 ```json
 {
   "tool": "gemini",
-  "model": "gemini-3-pro-preview",
+  "model": "gemini-3-pro",
   "effort": "high",
   "duration_ms": 12300,
   "response": "The AI's response text here...",
diff --git a/adapters/opencode/skills/debate/SKILL.md b/adapters/opencode/skills/debate/SKILL.md
@@ -26,6 +26,10 @@ Parse from `$ARGUMENTS`:
 - **--model-proposer**: Specific model for proposer (optional)
 - **--model-challenger**: Specific model for challenger (optional)
 
+## Universal Rules
+
+ALL participants (proposer AND challenger) MUST support claims with specific evidence (file path, code pattern, benchmark, or documented behavior). Unsupported claims from either side will be flagged by the other participant and noted in the verdict. This applies to every round.
+
 ## Prompt Templates
 
 ### Round 1: Proposer Opening
@@ -35,7 +39,9 @@ You are participating in a structured debate as the PROPOSER.
 
 Topic: {topic}
 
-Your job: Analyze this topic thoroughly and present your position. Be specific, cite concrete reasons, and consider tradeoffs. Do not hedge excessively - take a clear stance.
+Your job: Analyze this topic thoroughly and present your position. Take a clear stance. Do not hedge excessively.
+
+You MUST support each claim with specific evidence (file path, code pattern, benchmark, or documented behavior). Unsupported claims will be challenged. "I think" or "generally speaking" without evidence is not acceptable.
 
 Provide your analysis:
 ```
@@ -60,6 +66,9 @@ Rules:
 - Lead with what's WRONG or MISSING, then acknowledge what's right
 - If you genuinely agree on a point, explain what RISK remains despite the agreement
 - Propose at least one concrete alternative approach
+- You MUST address at least these categories: correctness, security implications, and developer experience
+- Do NOT agree with ANY claim unless you can cite specific evidence (file path, code pattern, or documented behavior) that supports the agreement. Unsupported agreement is not allowed.
+- If the proposer makes a claim without evidence, call it out: "This claim is unsupported."
 
 Provide your challenge:
 ```
@@ -81,8 +90,10 @@ The CHALLENGER ({challenger_tool}) raised these points in round {round-1}:
 
 Your job: Address each challenge directly. For each point:
 - If they're right, concede explicitly and explain how your position evolves
-- If they're wrong, explain why with specific reasoning
-- If it's a tradeoff, acknowledge the tradeoff and explain why you still favor your approach
+- If they're wrong, explain why with specific evidence (file path, code pattern, benchmark, or documented behavior)
+- If it's a tradeoff, acknowledge the tradeoff and explain why you still favor your approach with evidence
+
+Every claim you make -- whether concession, rebuttal, or new argument -- MUST cite specific evidence. The challenger will reject unsupported claims.
 
 Do NOT simply restate your original position. Your response must show you engaged with the specific challenges raised.
 
@@ -91,29 +102,7 @@ Provide your defense:
 
 ### Round 2+: Challenger Follow-up
 
-```
-You are the CHALLENGER in round {round} of a structured debate.
-
-Topic: {topic}
-
-{context_summary}
-
-The PROPOSER ({proposer_tool}) responded to your challenges:
-
----
-{proposer_previous_response}
----
-
-Your job: Evaluate the proposer's defense. For each point they addressed:
-- Did they adequately address your concern? If so, acknowledge it
-- Did they dodge or superficially address it? Call it out specifically
-- Are there NEW weaknesses in their revised position?
-
-If you're genuinely convinced on a point, say so - but explain what convinced you.
-If you see new problems, raise them.
-
-Provide your follow-up:
-```
+*(JavaScript reference - not executable in OpenCode)*
 
 ## Context Assembly
 
@@ -148,11 +137,12 @@ Round {N-1} - Challenger ({challenger_tool}):
 {full response}
 ```
 
-The orchestrator agent (opus) generates the summary. It should preserve:
+The orchestrator agent (opus) generates the summary. Target: 500-800 tokens. MUST preserve:
 - Each side's core position
-- Points of agreement (resolved)
+- All concessions (verbatim quotes, not paraphrased)
+- All evidence citations that support agreements
 - Points of disagreement (unresolved)
-- Any concessions made
+- Any contradictions between rounds (e.g., proposer concedes in round 1 but walks it back in round 2 -- note both explicitly)
 
 ## Synthesis Format
 
@@ -165,14 +155,22 @@ After all rounds complete, the orchestrator produces this structured output:
 **Proposer**: {proposer_tool} ({proposer_model})
 **Challenger**: {challenger_tool} ({challenger_model})
 **Rounds**: {rounds_completed}
+**Rigor**: Structured perspective comparison (prompt-enforced adversarial rules, no deterministic verification)
 
 ### Verdict
 
 {winner_tool} had the stronger argument because: {specific reasoning citing debate evidence}
 
+### Debate Quality
+
+Rate the debate on these dimensions:
+- **Genuine disagreement**: Did the challenger maintain independent positions, or converge toward the proposer? (high/medium/low)
+- **Evidence quality**: Did both sides cite specific examples, or argue from generalities? (high/medium/low)
+- **Challenge depth**: Were the challenges substantive, or surface-level? (high/medium/low)
+
 ### Key Agreements
-- {agreed point 1}
-- {agreed point 2}
+- {agreed point 1} (evidence: {what supports this agreement})
+- {agreed point 2} (evidence: {what supports this agreement})
 
 ### Key Disagreements
 - {point}: {proposer_tool} argues {X}, {challenger_tool} argues {Y}
diff --git a/docs/consult-command-test-strategy.md b/docs/consult-command-test-strategy.md
@@ -172,9 +172,9 @@ describe('Model Selection', () => {
   describe('Gemini models', () => {
     it('should map effort levels correctly', () => {
       expect(getGeminiModel('low')).toBe('gemini-2.5-flash');
-      expect(getGeminiModel('medium')).toBe('gemini-3-flash-preview');
-      expect(getGeminiModel('high')).toBe('gemini-3-pro-preview');
-      expect(getGeminiModel('max')).toBe('gemini-3-pro-preview');
+      expect(getGeminiModel('medium')).toBe('gemini-3-flash');
+      expect(getGeminiModel('high')).toBe('gemini-3-pro');
+      expect(getGeminiModel('max')).toBe('gemini-3-pro');
     });
   });
 
@@ -244,7 +244,7 @@ describe('Session Management', () => {
     it('should include question in saved session', () => {
       const session = {
         tool: 'gemini',
-        model: 'gemini-3-pro-preview',
+        model: 'gemini-3-pro',
         effort: 'medium',
         session_id: 'xyz-789',
         timestamp: new Date().toISOString(),
@@ -458,7 +458,7 @@ describe('Session Continuation', () => {
     it('should restore tool from saved session', () => {
       const session = {
         tool: 'gemini',
-        model: 'gemini-3-pro-preview',
+        model: 'gemini-3-pro',
         effort: 'medium',
         session_id: 'session-456',
         timestamp: new Date().toISOString(),
@@ -672,18 +672,18 @@ describe('Command Building', () => {
 
   describe('Gemini Command', () => {
     it('should build basic command', () => {
-      const { command, flags } = buildGeminiCommand('question', 'gemini-3-pro-preview');
+      const { command, flags } = buildGeminiCommand('question', 'gemini-3-pro');
       expect(command).toBe('gemini');
       expect(flags).toContain('-p');
       expect(flags).toContain('"question"');
       expect(flags).toContain('--output-format');
       expect(flags).toContain('json');
       expect(flags).toContain('-m');
-      expect(flags).toContain('gemini-3-pro-preview');
+      expect(flags).toContain('gemini-3-pro');
     });
 
     it('should append session resume for continuation', () => {
-      const { flags } = buildGeminiCommand('question', 'gemini-3-pro-preview', 'session-456', true);
+      const { flags } = buildGeminiCommand('question', 'gemini-3-pro', 'session-456', true);
       expect(flags).toContain('--resume');
       expect(flags).toContain('session-456');
     });
@@ -939,7 +939,7 @@ describe('Full Consultation Flow', () => {
       jest.spyOn(fs, 'readFileSync').mockReturnValueOnce(JSON.stringify({
         tool: 'gemini',
         session_id: 'session-456',
-        model: 'gemini-3-pro-preview',
+        model: 'gemini-3-pro',
         effort: 'medium',
         timestamp: new Date().toISOString(),
         question: 'continue',
@@ -1139,7 +1139,7 @@ describe('Mocked Tool Outputs', () => {
   const mockGeminiOutput = `=== CONSULT_RESULT ===
 {
   "tool": "gemini",
-  "model": "gemini-3-pro-preview",
+  "model": "gemini-3-pro",
   "effort": "medium",
   "duration_ms": 23400,
   "response": "Based on my analysis, the approach seems sound but could benefit from error handling for edge cases.",
@@ -1175,7 +1175,7 @@ describe('Mocked Tool Outputs', () => {
     it('should parse structured output correctly', () => {
       const result = parseMockOutput(mockGeminiOutput, 'gemini');
       expect(result.tool).toBe('gemini');
-      expect(result.model).toBe('gemini-3-pro-preview');
+      expect(result.model).toBe('gemini-3-pro');
       expect(result.duration_ms).toBe(23400);
       expect(result.session_id).toBe('session-xyz-789');
     });
diff --git a/plugins/consult/agents/consult-agent.md b/plugins/consult/agents/consult-agent.md
@@ -126,9 +126,9 @@ Run N Bash commands **in parallel** (multiple Bash tool calls in a single messag
 
 Example for 3 parallel Codex calls:
 ```
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-1.tmp")" --json -m "gpt-5.3-codex" -a suggest
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-2.tmp")" --json -m "gpt-5.3-codex" -a suggest
-Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-3.tmp")" --json -m "gpt-5.3-codex" -a suggest
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-1.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-2.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
+Bash: codex exec "$(cat "{AI_STATE_DIR}/consult/question-3.tmp")" --json -m "gpt-5.3-codex" -c model_reasoning_effort="high"
 ```
 
 #### 4d. Parse and Format Results
diff --git a/plugins/consult/commands/consult.md b/plugins/consult/commands/consult.md
@@ -14,7 +14,7 @@ You are executing the /consult command. Your job is to parse the user's request
 
 - NEVER expose API keys in commands or output
 - NEVER run with permission-bypassing flags (`--dangerously-skip-permissions`, `bypassPermissions`)
-- MUST use safe-mode defaults (`-a suggest` for Codex, `--allowedTools "Read,Glob,Grep"` for Claude)
+- MUST use safe-mode defaults (`--allowedTools "Read,Glob,Grep"` for Claude, `-c model_reasoning_effort` for Codex)
 - MUST enforce 120s timeout on all tool executions
 - MUST validate tool names against allow-list: gemini, codex, claude, opencode, copilot (reject all others)
 - MUST validate `--context=file=PATH` is within the project directory (reject absolute paths outside cwd)
diff --git a/plugins/consult/skills/consult/SKILL.md b/plugins/consult/skills/consult/SKILL.md
@@ -60,22 +60,23 @@ Command: gemini -p "QUESTION" --output-format json -m "MODEL"
 Session resume: --resume "SESSION_ID"
 ```
 
-Models: gemini-2.5-flash, gemini-2.5-pro, gemini-3-flash-preview, gemini-3-pro-preview
+Models: gemini-2.5-flash, gemini-2.5-pro, gemini-3-flash, gemini-3-pro
 
 | Effort | Model |
 |--------|-------|
 | low | gemini-2.5-flash |
-| medium | gemini-2.5-pro |
-| high | gemini-3-flash-preview |
-| max | gemini-3-pro-preview |
+| medium | gemini-3-flash |
+| high | gemini-3-pro |
+| max | gemini-3-pro |
 
 **Parse output**: `JSON.parse(stdout).response`
+**Session ID**: `JSON.parse(stdout).session_id`
 **Continuable**: Yes (via `--resume`)
 
 ### Codex
 
 ```
-Command: codex exec "QUESTION" --json -m "MODEL" -a suggest -c model_reasoning_effort="LEVEL"
+Command: codex exec "QUESTION" --json -m "MODEL" -c model_reasoning_effort="LEVEL"
 Session resume: codex exec resume SESSION_ID "QUESTION" --json
 Session resume (latest): codex exec resume --last "QUESTION" --json
 ```
@@ -187,7 +188,7 @@ User-provided question text MUST NOT be interpolated into shell command strings.
 | Claude (resume) | `claude -p - --output-format json --model "MODEL" --max-turns TURNS --allowedTools "Read,Glob,Grep" --resume "SESSION_ID" < "{AI_STATE_DIR}/consult/question.tmp"` |
 | Gemini | `gemini -p - --output-format json -m "MODEL" < "{AI_STATE_DIR}/consult/question.tmp"` |
 | Gemini (resume) | `gemini -p - --output-format json -m "MODEL" --resume "SESSION_ID" < "{AI_STATE_DIR}/consult/question.tmp"` |
-| Codex | `codex exec "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL" -a suggest` (Codex exec lacks stdin mode -- cat reads from platform-controlled path, not user input) |
+| Codex | `codex exec "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL" -c model_reasoning_effort="LEVEL"` (Codex exec lacks stdin mode -- cat reads from platform-controlled path, not user input) |
 | Codex (resume) | `codex exec resume SESSION_ID "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL"` |
 | Codex (resume latest) | `codex exec resume --last "$(cat "{AI_STATE_DIR}/consult/question.tmp")" --json -m "MODEL"` |
 | OpenCode | `opencode run - --format json --model "MODEL" --variant "VARIANT" < "{AI_STATE_DIR}/consult/question.tmp"` |
@@ -260,7 +261,7 @@ Return a plain JSON object to stdout (no markers or wrappers):
 ```json
 {
   "tool": "gemini",
-  "model": "gemini-3-pro-preview",
+  "model": "gemini-3-pro",
   "effort": "high",
   "duration_ms": 12300,
   "response": "The AI's response text here...",
diff --git a/plugins/debate/skills/debate/SKILL.md b/plugins/debate/skills/debate/SKILL.md