Skip to content

Commit 8979c19

Browse files
authored
fix: resolve triple-shot second pass caused by merge recommendations (#662)
When the judge recommends "merge" strategy, it populates the suggested_changes field. LLMs frequently write this as a plain string instead of []string, causing json.Unmarshal to fail. The parse failure made VerifyWork return false, the bridge called gate.Fail(), and with defaultMaxRetries=2 the task retried — spawning a duplicate judge. Add FlexibleStringSlice type (mirrors existing FlexibleString) to tolerate string/array mismatches in all LLM-parsed sentinel file structs: Evaluation, AttemptEvaluationItem, AdversarialReviewFile. Also log SetMaxRetries errors instead of silently discarding, and consolidate the redundant Team("judge") lookup in startJudge.
1 parent 667e5a7 commit 8979c19

File tree

10 files changed

+197
-18
lines changed

10 files changed

+197
-18
lines changed

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,7 @@ This is not exhaustive — update it when you add or discover undocumented packa
309309
- `internal/team/` — Multi-team orchestration with dependency ordering, budget tracking, and inter-team routing *(has `AGENTS.md`)*
310310
- `internal/bridge/` — Connects team Hubs to real Claude Code instances (worktree + tmux) *(has `AGENTS.md`)*
311311
- `internal/orchestrator/bridgewire/` — Adapter types that wire orchestrator infrastructure to bridge interfaces *(has `AGENTS.md`)*
312+
- `internal/orchestrator/workflows/tripleshot/` — Triple-shot workflow: 3 parallel attempts + judge evaluation. Defines sentinel file types (`CompletionFile`, `Evaluation`, `AdversarialReviewFile`) with flexible JSON unmarshaling *(has `AGENTS.md`)*
312313
- `internal/orchestrator/workflows/tripleshot/teamwire/` — Adapts TripleShot to Orchestration 2.0 teams via `TeamCoordinator` + bridge adapters *(has `AGENTS.md`)*
313314
- `internal/pipeline/` — Plan decomposer and multi-phase team pipeline *(has `AGENTS.md`)*
314315
- `internal/tui/` — Bubble Tea terminal UI components *(has `AGENTS.md`)*

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1919

2020
### Fixed
2121

22+
- **Triple-Shot Spurious Second Pass** - Fixed duplicate instance creation in triple-shot workflows. Two root causes: (1) TaskQueue's `defaultMaxRetries=2` caused failed attempt/judge tasks to retry, spawning new instances. Fixed by calling `SetMaxRetries(taskID, 0)` after team creation. (2) Judge "merge" recommendations caused `json.Unmarshal` to fail when the LLM wrote `suggested_changes` as a string instead of `[]string`. The evaluation file parse failure triggered a retry, creating a second judge. Fixed by adding `FlexibleStringSlice` type (mirrors existing `FlexibleString`) to tolerate string/array mismatches in all LLM-parsed sentinel file structs (`Evaluation`, `AttemptEvaluationItem`, `AdversarialReviewFile`).
23+
2224
- **Teamwire TUI Freeze** - Fixed TUI freeze when starting a triple-shot in teamwire mode. `coordinator.Start()` was called synchronously in the Bubble Tea `Update()` handler, blocking the event loop while bridges created git worktrees. Moved startup to an async `tea.Cmd` so the UI remains responsive during initialization.
2325

2426
- **Teamwire Channel Safety** - Fixed potential panic from closing `teamwireEventCh` while callbacks may still write to it (nil-guard before close), goroutine leak from re-subscribing after triple-shot completion, and channel overwrite leak when starting multiple sessions. Surfaced session error details in `PhaseFailed` handler instead of generic "Triple-shot failed" message.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# tripleshot — Agent Guidelines
2+
3+
> **Living document.** Update this file when you learn something specific to this package.
4+
> Same rules as the root `AGENTS.md` — see its Self-Improvement Protocol.
5+
6+
## Pitfalls
7+
8+
- **LLM output type mismatches in sentinel files** — LLMs frequently write a plain string where the JSON schema expects `[]string` (e.g., `"suggested_changes": "fix the bug"` instead of `"suggested_changes": ["fix the bug"]`). The `Evaluation`, `AttemptEvaluationItem`, and `AdversarialReviewFile` structs use `FlexibleStringSlice` for all `[]string` fields and `FlexibleString` for `Reasoning` to tolerate this. When adding new LLM-parsed fields of type `string` or `[]string`, use these flexible types instead of bare Go types. Without this, `json.Unmarshal` fails, `VerifyWork` returns false, and the bridge retries the task — spawning a duplicate instance.
9+
- **Sentinel file search in subdirectories**`FindCompletionFile`, `FindEvaluationFile`, and `FindAdversarialReviewFile` all search the worktree root *and* immediate subdirectories. LLM instances sometimes write files relative to their CWD rather than the worktree root. Don't bypass `Find*File` with a direct `filepath.Join(worktree, filename)`.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
AGENTS.md

internal/orchestrator/workflows/tripleshot/session.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ func (m *Manager) SetEvaluation(eval *Evaluation) {
180180

181181
m.emitEvent(Event{
182182
Type: EventEvaluationReady,
183-
Message: eval.Reasoning,
183+
Message: eval.Reasoning.String(),
184184
})
185185
}
186186

internal/orchestrator/workflows/tripleshot/session_test.go

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,129 @@ func TestParseEvaluationFile_InvalidJSON(t *testing.T) {
326326
}
327327
}
328328

329+
func TestParseEvaluationFile_FlexibleFields(t *testing.T) {
330+
tests := []struct {
331+
name string
332+
json string
333+
wantStrategy MergeStrategy
334+
wantReasoning string
335+
wantChangesLen int
336+
wantStrengthsLen int
337+
wantWeaknessesLen int
338+
wantFirstChange string
339+
wantFirstStrength string
340+
wantFirstWeakness string
341+
}{
342+
{
343+
name: "merge with suggested_changes as string",
344+
json: `{
345+
"winner_index": -1,
346+
"merge_strategy": "merge",
347+
"reasoning": "Combined approach is best",
348+
"attempt_evaluations": [
349+
{"attempt_index": 0, "score": 7, "strengths": ["Good"], "weaknesses": ["Missing tests"]}
350+
],
351+
"suggested_changes": "Combine A's error handling with B's test structure"
352+
}`,
353+
wantStrategy: MergeStrategyMerge,
354+
wantReasoning: "Combined approach is best",
355+
wantChangesLen: 1,
356+
wantFirstChange: "Combine A's error handling with B's test structure",
357+
wantStrengthsLen: 1,
358+
wantFirstStrength: "Good",
359+
wantWeaknessesLen: 1,
360+
wantFirstWeakness: "Missing tests",
361+
},
362+
{
363+
name: "merge with suggested_changes as array",
364+
json: `{
365+
"winner_index": -1,
366+
"merge_strategy": "merge",
367+
"reasoning": "Merging is ideal",
368+
"attempt_evaluations": [],
369+
"suggested_changes": ["Change A", "Change B"]
370+
}`,
371+
wantStrategy: MergeStrategyMerge,
372+
wantReasoning: "Merging is ideal",
373+
wantChangesLen: 2,
374+
wantFirstChange: "Change A",
375+
},
376+
{
377+
name: "reasoning as array of strings",
378+
json: `{
379+
"winner_index": 0,
380+
"merge_strategy": "select",
381+
"reasoning": ["First point.", "Second point."],
382+
"attempt_evaluations": []
383+
}`,
384+
wantStrategy: MergeStrategySelect,
385+
wantReasoning: "First point.\nSecond point.",
386+
},
387+
{
388+
name: "strengths and weaknesses as strings",
389+
json: `{
390+
"winner_index": 0,
391+
"merge_strategy": "select",
392+
"reasoning": "Best one",
393+
"attempt_evaluations": [
394+
{"attempt_index": 0, "score": 8, "strengths": "Clean implementation", "weaknesses": "No tests"}
395+
]
396+
}`,
397+
wantStrategy: MergeStrategySelect,
398+
wantStrengthsLen: 1,
399+
wantFirstStrength: "Clean implementation",
400+
wantWeaknessesLen: 1,
401+
wantFirstWeakness: "No tests",
402+
},
403+
}
404+
405+
for _, tt := range tests {
406+
t.Run(tt.name, func(t *testing.T) {
407+
tmpDir := t.TempDir()
408+
evalPath := filepath.Join(tmpDir, EvaluationFileName)
409+
if err := os.WriteFile(evalPath, []byte(tt.json), 0644); err != nil {
410+
t.Fatalf("failed to write evaluation file: %v", err)
411+
}
412+
413+
parsed, err := ParseEvaluationFile(tmpDir)
414+
if err != nil {
415+
t.Fatalf("ParseEvaluationFile() error = %v", err)
416+
}
417+
418+
if parsed.MergeStrategy != tt.wantStrategy {
419+
t.Errorf("MergeStrategy = %q, want %q", parsed.MergeStrategy, tt.wantStrategy)
420+
}
421+
if tt.wantReasoning != "" && parsed.Reasoning.String() != tt.wantReasoning {
422+
t.Errorf("Reasoning = %q, want %q", parsed.Reasoning, tt.wantReasoning)
423+
}
424+
if tt.wantChangesLen > 0 {
425+
if len(parsed.SuggestedChanges) != tt.wantChangesLen {
426+
t.Errorf("len(SuggestedChanges) = %d, want %d", len(parsed.SuggestedChanges), tt.wantChangesLen)
427+
}
428+
if len(parsed.SuggestedChanges) > 0 && parsed.SuggestedChanges[0] != tt.wantFirstChange {
429+
t.Errorf("SuggestedChanges[0] = %q, want %q", parsed.SuggestedChanges[0], tt.wantFirstChange)
430+
}
431+
}
432+
if tt.wantStrengthsLen > 0 && len(parsed.AttemptEvaluation) > 0 {
433+
if len(parsed.AttemptEvaluation[0].Strengths) != tt.wantStrengthsLen {
434+
t.Errorf("Strengths len = %d, want %d", len(parsed.AttemptEvaluation[0].Strengths), tt.wantStrengthsLen)
435+
}
436+
if len(parsed.AttemptEvaluation[0].Strengths) > 0 && parsed.AttemptEvaluation[0].Strengths[0] != tt.wantFirstStrength {
437+
t.Errorf("Strengths[0] = %q, want %q", parsed.AttemptEvaluation[0].Strengths[0], tt.wantFirstStrength)
438+
}
439+
}
440+
if tt.wantWeaknessesLen > 0 && len(parsed.AttemptEvaluation) > 0 {
441+
if len(parsed.AttemptEvaluation[0].Weaknesses) != tt.wantWeaknessesLen {
442+
t.Errorf("Weaknesses len = %d, want %d", len(parsed.AttemptEvaluation[0].Weaknesses), tt.wantWeaknessesLen)
443+
}
444+
if len(parsed.AttemptEvaluation[0].Weaknesses) > 0 && parsed.AttemptEvaluation[0].Weaknesses[0] != tt.wantFirstWeakness {
445+
t.Errorf("Weaknesses[0] = %q, want %q", parsed.AttemptEvaluation[0].Weaknesses[0], tt.wantFirstWeakness)
446+
}
447+
}
448+
})
449+
}
450+
}
451+
329452
func TestParseEvaluationFromOutput(t *testing.T) {
330453
tests := []struct {
331454
name string

internal/orchestrator/workflows/tripleshot/teamwire/AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ TeamCoordinator
3232
- **Two-phase Start**`Start()` must not hold `tc.mu` when calling `Bridge.Start()`. The bridge's claim loop publishes `BridgeTaskStartedEvent` synchronously, and the handler `onBridgeTaskStarted` acquires `tc.mu`. Holding the lock through `Start()` → bridge claim → event publish → handler → lock = deadlock. The fix: `registerStart()` holds/releases the lock, then `Start()` creates bridges outside it.
3333
- **Event subscription timing** — Subscriptions must happen before `Bridge.Start()` launches the claim loop. Currently done in `registerStart()` (Phase 1, under lock, before Phase 2 bridge creation) — this is the safe window. Don't move subscriptions after Phase 2 begins. For test assertions where you need events, subscribe before calling `Start()`. For production callbacks, use `SetCallbacks` before `Start`.
3434
- **`onTeamCompleted` dispatches to goroutine** — The handler for `team.completed` dispatches `startJudge()` via `go` to avoid deadlock. The synchronous event bus would block if `startJudge` tried to publish events while the bus's `Publish` goroutine holds a lock.
35-
- **Bridge retry vs. completion file status**When `VerifyWork` returns `success=false` (e.g., completion file has `"failed"` status), the bridge calls `gate.Fail()`. Due to TaskQueue retry logic (`defaultMaxRetries=2`), the task returns to Pending and gets re-claimed by the bridge. Each re-claim creates a new instance with a new empty worktree. Tests that depend on failure being final must account for this retry cycle or test handler methods directly.
35+
- **Retries disabled for tripleshot tasks**`registerStart()` and `startJudge()` call `SetMaxRetries(taskID, 0)` to disable TaskQueue's default retry logic (`defaultMaxRetries=2`). Without this, failed attempt/judge tasks would return to Pending and spawn duplicate instances, appearing as a spurious "second pass." The triple-shot workflow has its own redundancy (3 independent attempts), so retrying individual tasks is counterproductive.
3636
- **Every `onJudgeCompleted` failure path must publish `TripleShotJudgeCompletedEvent`** — Use the `failJudge()` helper, which sets session error, transitions to `PhaseFailed`, fires callbacks, and publishes the event. Forgetting the event on one path breaks downstream listeners.
3737
- **Session mutation lock discipline**`tsManager.Session()` returns a raw `*Session` pointer; the `tsManager.mu` RLock only protects the pointer swap, not field access. All session field reads *and* mutations (`JudgeID`, `CompletedAt`, `Error`, `Attempts[i].*`) must hold `tc.mu`. `GetWinningBranch()` also holds `tc.mu` for reads. The lock order `tc.mu → tsManager.mu` is safe (no reverse path exists). Functions like `failJudge` and `startJudge` error paths acquire `tc.mu` for mutations, then release before `notifyCallbacks`/`bus.Publish` to avoid deadlock.
3838
- **`startJudge` snapshot-then-I/O pattern**`startJudge()` must snapshot attempt data (Status, InstanceID) under `tc.mu` before releasing the lock for I/O (GetInstance, ParseCompletionFile). After I/O completes, it re-acquires `tc.mu` to write results back (WorktreePath, Branch) and build the judge prompt. Without the snapshot, `onBridgeTaskCompleted` can write `Attempts[i].Status` concurrently, causing a data race.

internal/orchestrator/workflows/tripleshot/teamwire/teamcoordinator.go

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,20 @@ func (tc *TeamCoordinator) registerStart(ctx context.Context) (*team.Manager, er
247247
}
248248
}
249249

250+
// Disable retries for attempt tasks. The triple-shot workflow has its
251+
// own redundancy (3 independent attempts), so retrying individual tasks
252+
// just creates duplicate instances that appear as a spurious second pass.
253+
for i := range 3 {
254+
t := mgr.Team(tc.attemptTeamIDs[i])
255+
if t != nil {
256+
taskID := fmt.Sprintf("attempt-%d-task", i)
257+
if err := t.Hub().TaskQueue().SetMaxRetries(taskID, 0); err != nil {
258+
tc.logger.Warn("failed to disable retries for attempt task",
259+
"task_id", taskID, "error", err)
260+
}
261+
}
262+
}
263+
250264
tc.subscribeEvents()
251265

252266
if err := mgr.Start(ctx); err != nil {
@@ -649,6 +663,11 @@ func (tc *TeamCoordinator) startJudge() {
649663
return
650664
}
651665

666+
// Disable retries for the judge task — same rationale as attempt tasks.
667+
if err := judgeTeam.Hub().TaskQueue().SetMaxRetries("judge-task", 0); err != nil {
668+
tc.logger.Warn("failed to disable retries for judge task", "error", err)
669+
}
670+
652671
factory := newAttemptFactory(tc.orch, tc.session)
653672
checker := newJudgeCompletionChecker()
654673
recorder := tc.buildJudgeRecorder()

internal/orchestrator/workflows/tripleshot/types.go

Lines changed: 39 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -114,17 +114,17 @@ type AttemptRoundHistory struct {
114114
type Evaluation struct {
115115
WinnerIndex int `json:"winner_index"` // 0, 1, or 2 (-1 if merged)
116116
MergeStrategy MergeStrategy `json:"merge_strategy"` // Strategy for applying solution
117-
Reasoning string `json:"reasoning"` // Explanation of the decision
117+
Reasoning FlexibleString `json:"reasoning"` // Explanation of the decision
118118
AttemptEvaluation []AttemptEvaluationItem `json:"attempt_evaluations"` // Evaluation of each attempt
119-
SuggestedChanges []string `json:"suggested_changes"` // If merging, changes to make
119+
SuggestedChanges FlexibleStringSlice `json:"suggested_changes"` // If merging, changes to make
120120
}
121121

122122
// AttemptEvaluationItem holds the evaluation for a single attempt
123123
type AttemptEvaluationItem struct {
124-
AttemptIndex int `json:"attempt_index"`
125-
Score int `json:"score"` // 1-10
126-
Strengths []string `json:"strengths"`
127-
Weaknesses []string `json:"weaknesses"`
124+
AttemptIndex int `json:"attempt_index"`
125+
Score int `json:"score"` // 1-10
126+
Strengths FlexibleStringSlice `json:"strengths"`
127+
Weaknesses FlexibleStringSlice `json:"weaknesses"`
128128
}
129129

130130
// CompletionFileName is the sentinel file that attempts write when complete
@@ -158,6 +158,30 @@ func (f FlexibleString) String() string {
158158
return string(f)
159159
}
160160

161+
// FlexibleStringSlice is a custom type that can unmarshal either a JSON array of strings
162+
// or a single JSON string. When unmarshaling a single string, it wraps it in a slice.
163+
// This handles LLM output that writes a plain string where the schema expects []string.
164+
type FlexibleStringSlice []string
165+
166+
// UnmarshalJSON implements json.Unmarshaler for FlexibleStringSlice.
167+
func (f *FlexibleStringSlice) UnmarshalJSON(data []byte) error {
168+
// Try to unmarshal as an array of strings first (most common)
169+
var arr []string
170+
if err := json.Unmarshal(data, &arr); err == nil {
171+
*f = arr
172+
return nil
173+
}
174+
175+
// Try to unmarshal as a single string
176+
var s string
177+
if err := json.Unmarshal(data, &s); err == nil {
178+
*f = []string{s}
179+
return nil
180+
}
181+
182+
return fmt.Errorf("FlexibleStringSlice: expected string or []string, got %s", string(data))
183+
}
184+
161185
// CompletionFile represents the completion report written by an attempt
162186
type CompletionFile struct {
163187
AttemptIndex int `json:"attempt_index"`
@@ -407,15 +431,15 @@ func FindAdversarialReviewFile(worktreePath string) (string, error) {
407431

408432
// AdversarialReviewFile represents the reviewer's feedback on an attempt
409433
type AdversarialReviewFile struct {
410-
AttemptIndex int `json:"attempt_index"` // Which attempt (0-2) this review is for
411-
Round int `json:"round"` // Review round (starts at 1)
412-
Approved bool `json:"approved"` // Whether the implementation is approved
413-
Score int `json:"score"` // Quality score (1-10)
414-
Strengths []string `json:"strengths"` // What was done well
415-
Issues []string `json:"issues"` // Problems that must be fixed
416-
Suggestions []string `json:"suggestions"` // Optional improvements
417-
Summary string `json:"summary"` // Overall assessment
418-
RequiredChanges []string `json:"required_changes"` // Specific changes needed (if not approved)
434+
AttemptIndex int `json:"attempt_index"` // Which attempt (0-2) this review is for
435+
Round int `json:"round"` // Review round (starts at 1)
436+
Approved bool `json:"approved"` // Whether the implementation is approved
437+
Score int `json:"score"` // Quality score (1-10)
438+
Strengths FlexibleStringSlice `json:"strengths"` // What was done well
439+
Issues FlexibleStringSlice `json:"issues"` // Problems that must be fixed
440+
Suggestions FlexibleStringSlice `json:"suggestions"` // Optional improvements
441+
Summary string `json:"summary"` // Overall assessment
442+
RequiredChanges FlexibleStringSlice `json:"required_changes"` // Specific changes needed (if not approved)
419443
}
420444

421445
// Validate checks that the AdversarialReviewFile has valid values.

internal/tui/view/tripleshot.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,7 @@ func RenderTripleShotEvaluation(ctx TripleShotRenderContext) string {
420420
// Reasoning
421421
lines = append(lines, tsSubtle.Render("Reasoning:"))
422422
// Word wrap reasoning
423-
words := strings.Fields(eval.Reasoning)
423+
words := strings.Fields(eval.Reasoning.String())
424424
var line string
425425
for _, word := range words {
426426
if len(line)+len(word)+1 > ctx.Width-4 {

0 commit comments

Comments
 (0)