fix: resolve triple-shot second pass caused by merge recommendations (#662)

Iron-Ham · web-flow · commit 8979c194d419 · 2026-02-15T16:22:03.000-05:00
When the judge recommends "merge" strategy, it populates the
suggested_changes field. LLMs frequently write this as a plain string
instead of []string, causing json.Unmarshal to fail. The parse failure
made VerifyWork return false, the bridge called gate.Fail(), and with
defaultMaxRetries=2 the task retried — spawning a duplicate judge.

Add FlexibleStringSlice type (mirrors existing FlexibleString) to
tolerate string/array mismatches in all LLM-parsed sentinel file
structs: Evaluation, AttemptEvaluationItem, AdversarialReviewFile.

Also log SetMaxRetries errors instead of silently discarding, and
consolidate the redundant Team("judge") lookup in startJudge.
diff --git a/AGENTS.md b/AGENTS.md
@@ -309,6 +309,7 @@ This is not exhaustive — update it when you add or discover undocumented packa
 - `internal/team/` — Multi-team orchestration with dependency ordering, budget tracking, and inter-team routing *(has `AGENTS.md`)*
 - `internal/bridge/` — Connects team Hubs to real Claude Code instances (worktree + tmux) *(has `AGENTS.md`)*
 - `internal/orchestrator/bridgewire/` — Adapter types that wire orchestrator infrastructure to bridge interfaces *(has `AGENTS.md`)*
+- `internal/orchestrator/workflows/tripleshot/` — Triple-shot workflow: 3 parallel attempts + judge evaluation. Defines sentinel file types (`CompletionFile`, `Evaluation`, `AdversarialReviewFile`) with flexible JSON unmarshaling *(has `AGENTS.md`)*
 - `internal/orchestrator/workflows/tripleshot/teamwire/` — Adapts TripleShot to Orchestration 2.0 teams via `TeamCoordinator` + bridge adapters *(has `AGENTS.md`)*
 - `internal/pipeline/` — Plan decomposer and multi-phase team pipeline *(has `AGENTS.md`)*
 - `internal/tui/` — Bubble Tea terminal UI components *(has `AGENTS.md`)*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Triple-Shot Spurious Second Pass** - Fixed duplicate instance creation in triple-shot workflows. Two root causes: (1) TaskQueue's `defaultMaxRetries=2` caused failed attempt/judge tasks to retry, spawning new instances. Fixed by calling `SetMaxRetries(taskID, 0)` after team creation. (2) Judge "merge" recommendations caused `json.Unmarshal` to fail when the LLM wrote `suggested_changes` as a string instead of `[]string`. The evaluation file parse failure triggered a retry, creating a second judge. Fixed by adding `FlexibleStringSlice` type (mirrors existing `FlexibleString`) to tolerate string/array mismatches in all LLM-parsed sentinel file structs (`Evaluation`, `AttemptEvaluationItem`, `AdversarialReviewFile`).
+
 - **Teamwire TUI Freeze** - Fixed TUI freeze when starting a triple-shot in teamwire mode. `coordinator.Start()` was called synchronously in the Bubble Tea `Update()` handler, blocking the event loop while bridges created git worktrees. Moved startup to an async `tea.Cmd` so the UI remains responsive during initialization.
 
 - **Teamwire Channel Safety** - Fixed potential panic from closing `teamwireEventCh` while callbacks may still write to it (nil-guard before close), goroutine leak from re-subscribing after triple-shot completion, and channel overwrite leak when starting multiple sessions. Surfaced session error details in `PhaseFailed` handler instead of generic "Triple-shot failed" message.
diff --git a/internal/orchestrator/workflows/tripleshot/AGENTS.md b/internal/orchestrator/workflows/tripleshot/AGENTS.md
@@ -0,0 +1,9 @@
+# tripleshot — Agent Guidelines
+
+> **Living document.** Update this file when you learn something specific to this package.
+> Same rules as the root `AGENTS.md` — see its Self-Improvement Protocol.
+
+## Pitfalls
+
+- **LLM output type mismatches in sentinel files** — LLMs frequently write a plain string where the JSON schema expects `[]string` (e.g., `"suggested_changes": "fix the bug"` instead of `"suggested_changes": ["fix the bug"]`). The `Evaluation`, `AttemptEvaluationItem`, and `AdversarialReviewFile` structs use `FlexibleStringSlice` for all `[]string` fields and `FlexibleString` for `Reasoning` to tolerate this. When adding new LLM-parsed fields of type `string` or `[]string`, use these flexible types instead of bare Go types. Without this, `json.Unmarshal` fails, `VerifyWork` returns false, and the bridge retries the task — spawning a duplicate instance.
+- **Sentinel file search in subdirectories** — `FindCompletionFile`, `FindEvaluationFile`, and `FindAdversarialReviewFile` all search the worktree root *and* immediate subdirectories. LLM instances sometimes write files relative to their CWD rather than the worktree root. Don't bypass `Find*File` with a direct `filepath.Join(worktree, filename)`.
diff --git a/internal/orchestrator/workflows/tripleshot/CLAUDE.md b/internal/orchestrator/workflows/tripleshot/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
diff --git a/internal/orchestrator/workflows/tripleshot/session.go b/internal/orchestrator/workflows/tripleshot/session.go
@@ -180,7 +180,7 @@ func (m *Manager) SetEvaluation(eval *Evaluation) {
 
 	m.emitEvent(Event{
 		Type:    EventEvaluationReady,
-		Message: eval.Reasoning,
+		Message: eval.Reasoning.String(),
 	})
 }
 
diff --git a/internal/orchestrator/workflows/tripleshot/session_test.go b/internal/orchestrator/workflows/tripleshot/session_test.go
@@ -326,6 +326,129 @@ func TestParseEvaluationFile_InvalidJSON(t *testing.T) {
 	}
 }
 
+func TestParseEvaluationFile_FlexibleFields(t *testing.T) {
+	tests := []struct {
+		name              string
+		json              string
+		wantStrategy      MergeStrategy
+		wantReasoning     string
+		wantChangesLen    int
+		wantStrengthsLen  int
+		wantWeaknessesLen int
+		wantFirstChange   string
+		wantFirstStrength string
+		wantFirstWeakness string
+	}{
+		{
+			name: "merge with suggested_changes as string",
+			json: `{
+				"winner_index": -1,
+				"merge_strategy": "merge",
+				"reasoning": "Combined approach is best",
+				"attempt_evaluations": [
+					{"attempt_index": 0, "score": 7, "strengths": ["Good"], "weaknesses": ["Missing tests"]}
+				],
+				"suggested_changes": "Combine A's error handling with B's test structure"
+			}`,
+			wantStrategy:      MergeStrategyMerge,
+			wantReasoning:     "Combined approach is best",
+			wantChangesLen:    1,
+			wantFirstChange:   "Combine A's error handling with B's test structure",
+			wantStrengthsLen:  1,
+			wantFirstStrength: "Good",
+			wantWeaknessesLen: 1,
+			wantFirstWeakness: "Missing tests",
+		},
+		{
+			name: "merge with suggested_changes as array",
+			json: `{
+				"winner_index": -1,
+				"merge_strategy": "merge",
+				"reasoning": "Merging is ideal",
+				"attempt_evaluations": [],
+				"suggested_changes": ["Change A", "Change B"]
+			}`,
+			wantStrategy:    MergeStrategyMerge,
+			wantReasoning:   "Merging is ideal",
+			wantChangesLen:  2,
+			wantFirstChange: "Change A",
+		},
+		{
+			name: "reasoning as array of strings",
+			json: `{
+				"winner_index": 0,
+				"merge_strategy": "select",
+				"reasoning": ["First point.", "Second point."],
+				"attempt_evaluations": []
+			}`,
+			wantStrategy:  MergeStrategySelect,
+			wantReasoning: "First point.\nSecond point.",
+		},
+		{
+			name: "strengths and weaknesses as strings",
+			json: `{
+				"winner_index": 0,
+				"merge_strategy": "select",
+				"reasoning": "Best one",
+				"attempt_evaluations": [
+					{"attempt_index": 0, "score": 8, "strengths": "Clean implementation", "weaknesses": "No tests"}
+				]
+			}`,
+			wantStrategy:      MergeStrategySelect,
+			wantStrengthsLen:  1,
+			wantFirstStrength: "Clean implementation",
+			wantWeaknessesLen: 1,
+			wantFirstWeakness: "No tests",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			tmpDir := t.TempDir()
+			evalPath := filepath.Join(tmpDir, EvaluationFileName)
+			if err := os.WriteFile(evalPath, []byte(tt.json), 0644); err != nil {
+				t.Fatalf("failed to write evaluation file: %v", err)
+			}
+
+			parsed, err := ParseEvaluationFile(tmpDir)
+			if err != nil {
+				t.Fatalf("ParseEvaluationFile() error = %v", err)
+			}
+
+			if parsed.MergeStrategy != tt.wantStrategy {
+				t.Errorf("MergeStrategy = %q, want %q", parsed.MergeStrategy, tt.wantStrategy)
+			}
+			if tt.wantReasoning != "" && parsed.Reasoning.String() != tt.wantReasoning {
+				t.Errorf("Reasoning = %q, want %q", parsed.Reasoning, tt.wantReasoning)
+			}
+			if tt.wantChangesLen > 0 {
+				if len(parsed.SuggestedChanges) != tt.wantChangesLen {
+					t.Errorf("len(SuggestedChanges) = %d, want %d", len(parsed.SuggestedChanges), tt.wantChangesLen)
+				}
+				if len(parsed.SuggestedChanges) > 0 && parsed.SuggestedChanges[0] != tt.wantFirstChange {
+					t.Errorf("SuggestedChanges[0] = %q, want %q", parsed.SuggestedChanges[0], tt.wantFirstChange)
+				}
+			}
+			if tt.wantStrengthsLen > 0 && len(parsed.AttemptEvaluation) > 0 {
+				if len(parsed.AttemptEvaluation[0].Strengths) != tt.wantStrengthsLen {
+					t.Errorf("Strengths len = %d, want %d", len(parsed.AttemptEvaluation[0].Strengths), tt.wantStrengthsLen)
+				}
+				if len(parsed.AttemptEvaluation[0].Strengths) > 0 && parsed.AttemptEvaluation[0].Strengths[0] != tt.wantFirstStrength {
+					t.Errorf("Strengths[0] = %q, want %q", parsed.AttemptEvaluation[0].Strengths[0], tt.wantFirstStrength)
+				}
+			}
+			if tt.wantWeaknessesLen > 0 && len(parsed.AttemptEvaluation) > 0 {
+				if len(parsed.AttemptEvaluation[0].Weaknesses) != tt.wantWeaknessesLen {
+					t.Errorf("Weaknesses len = %d, want %d", len(parsed.AttemptEvaluation[0].Weaknesses), tt.wantWeaknessesLen)
+				}
+				if len(parsed.AttemptEvaluation[0].Weaknesses) > 0 && parsed.AttemptEvaluation[0].Weaknesses[0] != tt.wantFirstWeakness {
+					t.Errorf("Weaknesses[0] = %q, want %q", parsed.AttemptEvaluation[0].Weaknesses[0], tt.wantFirstWeakness)
+				}
+			}
+		})
+	}
+}
+
 func TestParseEvaluationFromOutput(t *testing.T) {
 	tests := []struct {
 		name    string
diff --git a/internal/orchestrator/workflows/tripleshot/teamwire/AGENTS.md b/internal/orchestrator/workflows/tripleshot/teamwire/AGENTS.md
@@ -32,7 +32,7 @@ TeamCoordinator
 - **Two-phase Start** — `Start()` must not hold `tc.mu` when calling `Bridge.Start()`. The bridge's claim loop publishes `BridgeTaskStartedEvent` synchronously, and the handler `onBridgeTaskStarted` acquires `tc.mu`. Holding the lock through `Start()` → bridge claim → event publish → handler → lock = deadlock. The fix: `registerStart()` holds/releases the lock, then `Start()` creates bridges outside it.
 - **Event subscription timing** — Subscriptions must happen before `Bridge.Start()` launches the claim loop. Currently done in `registerStart()` (Phase 1, under lock, before Phase 2 bridge creation) — this is the safe window. Don't move subscriptions after Phase 2 begins. For test assertions where you need events, subscribe before calling `Start()`. For production callbacks, use `SetCallbacks` before `Start`.
 - **`onTeamCompleted` dispatches to goroutine** — The handler for `team.completed` dispatches `startJudge()` via `go` to avoid deadlock. The synchronous event bus would block if `startJudge` tried to publish events while the bus's `Publish` goroutine holds a lock.
-- **Bridge retry vs. completion file status** — When `VerifyWork` returns `success=false` (e.g., completion file has `"failed"` status), the bridge calls `gate.Fail()`. Due to TaskQueue retry logic (`defaultMaxRetries=2`), the task returns to Pending and gets re-claimed by the bridge. Each re-claim creates a new instance with a new empty worktree. Tests that depend on failure being final must account for this retry cycle or test handler methods directly.
+- **Retries disabled for tripleshot tasks** — `registerStart()` and `startJudge()` call `SetMaxRetries(taskID, 0)` to disable TaskQueue's default retry logic (`defaultMaxRetries=2`). Without this, failed attempt/judge tasks would return to Pending and spawn duplicate instances, appearing as a spurious "second pass." The triple-shot workflow has its own redundancy (3 independent attempts), so retrying individual tasks is counterproductive.
 - **Every `onJudgeCompleted` failure path must publish `TripleShotJudgeCompletedEvent`** — Use the `failJudge()` helper, which sets session error, transitions to `PhaseFailed`, fires callbacks, and publishes the event. Forgetting the event on one path breaks downstream listeners.
 - **Session mutation lock discipline** — `tsManager.Session()` returns a raw `*Session` pointer; the `tsManager.mu` RLock only protects the pointer swap, not field access. All session field reads *and* mutations (`JudgeID`, `CompletedAt`, `Error`, `Attempts[i].*`) must hold `tc.mu`. `GetWinningBranch()` also holds `tc.mu` for reads. The lock order `tc.mu → tsManager.mu` is safe (no reverse path exists). Functions like `failJudge` and `startJudge` error paths acquire `tc.mu` for mutations, then release before `notifyCallbacks`/`bus.Publish` to avoid deadlock.
 - **`startJudge` snapshot-then-I/O pattern** — `startJudge()` must snapshot attempt data (Status, InstanceID) under `tc.mu` before releasing the lock for I/O (GetInstance, ParseCompletionFile). After I/O completes, it re-acquires `tc.mu` to write results back (WorktreePath, Branch) and build the judge prompt. Without the snapshot, `onBridgeTaskCompleted` can write `Attempts[i].Status` concurrently, causing a data race.
diff --git a/internal/orchestrator/workflows/tripleshot/teamwire/teamcoordinator.go b/internal/orchestrator/workflows/tripleshot/teamwire/teamcoordinator.go
@@ -247,6 +247,20 @@ func (tc *TeamCoordinator) registerStart(ctx context.Context) (*team.Manager, er
 		}
 	}
 
+	// Disable retries for attempt tasks. The triple-shot workflow has its
+	// own redundancy (3 independent attempts), so retrying individual tasks
+	// just creates duplicate instances that appear as a spurious second pass.
+	for i := range 3 {
+		t := mgr.Team(tc.attemptTeamIDs[i])
+		if t != nil {
+			taskID := fmt.Sprintf("attempt-%d-task", i)
+			if err := t.Hub().TaskQueue().SetMaxRetries(taskID, 0); err != nil {
+				tc.logger.Warn("failed to disable retries for attempt task",
+					"task_id", taskID, "error", err)
+			}
+		}
+	}
+
 	tc.subscribeEvents()
 
 	if err := mgr.Start(ctx); err != nil {
@@ -649,6 +663,11 @@ func (tc *TeamCoordinator) startJudge() {
 		return
 	}
 
+	// Disable retries for the judge task — same rationale as attempt tasks.
+	if err := judgeTeam.Hub().TaskQueue().SetMaxRetries("judge-task", 0); err != nil {
+		tc.logger.Warn("failed to disable retries for judge task", "error", err)
+	}
+
 	factory := newAttemptFactory(tc.orch, tc.session)
 	checker := newJudgeCompletionChecker()
 	recorder := tc.buildJudgeRecorder()
diff --git a/internal/orchestrator/workflows/tripleshot/types.go b/internal/orchestrator/workflows/tripleshot/types.go
@@ -114,17 +114,17 @@ type AttemptRoundHistory struct {
 type Evaluation struct {
 	WinnerIndex       int                     `json:"winner_index"`        // 0, 1, or 2 (-1 if merged)
 	MergeStrategy     MergeStrategy           `json:"merge_strategy"`      // Strategy for applying solution
-	Reasoning         string                  `json:"reasoning"`           // Explanation of the decision
+	Reasoning         FlexibleString          `json:"reasoning"`           // Explanation of the decision
 	AttemptEvaluation []AttemptEvaluationItem `json:"attempt_evaluations"` // Evaluation of each attempt
-	SuggestedChanges  []string                `json:"suggested_changes"`   // If merging, changes to make
+	SuggestedChanges  FlexibleStringSlice     `json:"suggested_changes"`   // If merging, changes to make
 }
 
 // AttemptEvaluationItem holds the evaluation for a single attempt
 type AttemptEvaluationItem struct {
-	AttemptIndex int      `json:"attempt_index"`
-	Score        int      `json:"score"` // 1-10
-	Strengths    []string `json:"strengths"`
-	Weaknesses   []string `json:"weaknesses"`
+	AttemptIndex int                 `json:"attempt_index"`
+	Score        int                 `json:"score"` // 1-10
+	Strengths    FlexibleStringSlice `json:"strengths"`
+	Weaknesses   FlexibleStringSlice `json:"weaknesses"`
 }
 
 // CompletionFileName is the sentinel file that attempts write when complete
@@ -158,6 +158,30 @@ func (f FlexibleString) String() string {
 	return string(f)
 }
 
+// FlexibleStringSlice is a custom type that can unmarshal either a JSON array of strings
+// or a single JSON string. When unmarshaling a single string, it wraps it in a slice.
+// This handles LLM output that writes a plain string where the schema expects []string.
+type FlexibleStringSlice []string
+
+// UnmarshalJSON implements json.Unmarshaler for FlexibleStringSlice.
+func (f *FlexibleStringSlice) UnmarshalJSON(data []byte) error {
+	// Try to unmarshal as an array of strings first (most common)
+	var arr []string
+	if err := json.Unmarshal(data, &arr); err == nil {
+		*f = arr
+		return nil
+	}
+
+	// Try to unmarshal as a single string
+	var s string
+	if err := json.Unmarshal(data, &s); err == nil {
+		*f = []string{s}
+		return nil
+	}
+
+	return fmt.Errorf("FlexibleStringSlice: expected string or []string, got %s", string(data))
+}
+
 // CompletionFile represents the completion report written by an attempt
 type CompletionFile struct {
 	AttemptIndex  int            `json:"attempt_index"`
@@ -407,15 +431,15 @@ func FindAdversarialReviewFile(worktreePath string) (string, error) {
 
 // AdversarialReviewFile represents the reviewer's feedback on an attempt
 type AdversarialReviewFile struct {
-	AttemptIndex    int      `json:"attempt_index"`    // Which attempt (0-2) this review is for
-	Round           int      `json:"round"`            // Review round (starts at 1)
-	Approved        bool     `json:"approved"`         // Whether the implementation is approved
-	Score           int      `json:"score"`            // Quality score (1-10)
-	Strengths       []string `json:"strengths"`        // What was done well
-	Issues          []string `json:"issues"`           // Problems that must be fixed
-	Suggestions     []string `json:"suggestions"`      // Optional improvements
-	Summary         string   `json:"summary"`          // Overall assessment
-	RequiredChanges []string `json:"required_changes"` // Specific changes needed (if not approved)
+	AttemptIndex    int                 `json:"attempt_index"`    // Which attempt (0-2) this review is for
+	Round           int                 `json:"round"`            // Review round (starts at 1)
+	Approved        bool                `json:"approved"`         // Whether the implementation is approved
+	Score           int                 `json:"score"`            // Quality score (1-10)
+	Strengths       FlexibleStringSlice `json:"strengths"`        // What was done well
+	Issues          FlexibleStringSlice `json:"issues"`           // Problems that must be fixed
+	Suggestions     FlexibleStringSlice `json:"suggestions"`      // Optional improvements
+	Summary         string              `json:"summary"`          // Overall assessment
+	RequiredChanges FlexibleStringSlice `json:"required_changes"` // Specific changes needed (if not approved)
 }
 
 // Validate checks that the AdversarialReviewFile has valid values.
diff --git a/internal/tui/view/tripleshot.go b/internal/tui/view/tripleshot.go
@@ -420,7 +420,7 @@ func RenderTripleShotEvaluation(ctx TripleShotRenderContext) string {
 	// Reasoning
 	lines = append(lines, tsSubtle.Render("Reasoning:"))
 	// Word wrap reasoning
-	words := strings.Fields(eval.Reasoning)
+	words := strings.Fields(eval.Reasoning.String())
 	var line string
 	for _, word := range words {
 		if len(line)+len(word)+1 > ctx.Width-4 {

Original file line number	Diff line number	Diff line change
`@@ -180,7 +180,7 @@ func (m Manager) SetEvaluation(eval Evaluation) {`
`180`	`180`
`181`	`181`	`m.emitEvent(Event{`
`182`	`182`	`Type: EventEvaluationReady,`
`183`		`- Message: eval.Reasoning,`
	`183`	`+ Message: eval.Reasoning.String(),`
`184`	`184`	`})`
`185`	`185`	`}`
`186`	`186`