Scope reviewed:
src/team/runtime.tssrc/team/runtime-cli.tssrc/team/tmux-session.tsskills/omc-teams/SKILL.md
- Evidence:
- Tasks start as
pending:src/team/runtime.ts:133. - Non-Claude startup path sends instructions but does not mark task file
in_progress:src/team/runtime.ts:193,src/team/runtime.ts:206,src/team/runtime.ts:213. - Watchdog later writes terminal status directly:
src/team/runtime.ts:241. - Phase inference marks
planningwheninProgress=0,pending>0,completed=0:src/team/runtime.ts:317.
- Tasks start as
- Impact:
- During actual codex/gemini work,
monitorTeam()can reportplanninginstead ofexecuting. - This skews runtime-cli logs and failure heuristics based on outstanding work.
- During actual codex/gemini work,
2) High: tasks[i] ?? tasks[0] with more workers than tasks duplicates work and can produce non-existent task IDs
- Evidence:
- Fallback duplication:
src/team/runtime.ts:187,src/team/runtime.ts:206. - Task ID still computed as
i+1:src/team/runtime.ts:189,src/team/runtime.ts:208. - Only
tasks.lengthtask files are created:src/team/runtime.ts:131. - Completion writes to
tasks/{event.taskId}.json, guarded by existence:src/team/runtime.ts:239,src/team/runtime.ts:241.
- Fallback duplication:
- Impact:
- Multiple workers may execute same task content.
- Worker may emit
taskIdwith no corresponding task file; result is dropped from task-state accounting.
- Note:
- Skill doc says decomposition should produce exactly N subtasks:
skills/omc-teams/SKILL.md:64.
- Skill doc says decomposition should produce exactly N subtasks:
- Evidence:
collectTaskResults()runs before watchdog stop + shutdown:src/team/runtime-cli.ts:118,src/team/runtime-cli.ts:122.- Stopping watchdog only clears interval; in-flight tick/callback may still run:
src/team/runtime.ts:379,src/team/runtime.ts:381,src/team/runtime.ts:365. - Callback can update task files after results were already collected:
src/team/runtime.ts:245.
- Impact:
- CLI output may contain stale
pending/unknownsummaries despite work finishing moments later.
- CLI output may contain stale
- Evidence:
- Fixed sleep for codex/gemini:
src/team/runtime.ts:195. - Claude has an explicit readiness protocol (
.ready) with timeout:src/team/runtime.ts:184.
- Fixed sleep for codex/gemini:
- Impact:
- If codex/gemini startup is slower than 4s, initial instruction may land before input is accepted.
- If faster, every worker pays avoidable latency.
- Evidence:
- Hard truncation at 200 chars:
src/team/tmux-session.ts:300. - Non-Claude trigger message includes relative path with full
teamName:src/team/runtime.ts:213. - Runtime CLI does not constrain
teamNamelength:src/team/runtime-cli.ts:80.
- Hard truncation at 200 chars:
- Impact:
- Long
teamNamecan truncate the path instruction, causing worker to read an invalid/incomplete path.
- Long
- Evidence:
- Shutdown waits for all worker ACK files until deadline:
src/team/runtime.ts:437,src/team/runtime.ts:441,src/team/runtime.ts:445. - Runtime CLI comment: non-Claude workers never write shutdown ACK; passes
2000ms:src/team/runtime-cli.ts:126,src/team/runtime-cli.ts:133. - Poll sleep granularity: 500ms:
src/team/runtime.ts:451.
- Shutdown waits for all worker ACK files until deadline:
- Effective timeout:
- In runtime-cli path: approximately 2.0s to 2.5s before kill/cleanup.
- In default callers: full default 30s (
src/team/runtime.ts:425) can be wasted.
7) Medium: Watchdog partial done.json writes are tolerated, but there is no explicit atomic-write contract
- Evidence:
- Watchdog polls every 3000ms:
src/team/runtime.ts:229,src/team/runtime.ts:379. - Read/parse failures return
nulland are ignored:src/team/runtime.ts:79,src/team/runtime.ts:351.
- Watchdog polls every 3000ms:
- Impact:
- Partial/incomplete JSON during write is not fatal; watchdog retries next tick.
- Main downside is completion latency and dependency on eventual valid rewrite.
- Evidence:
- Liveness check:
src/team/tmux-session.ts:417,src/team/tmux-session.ts:420. - Worker launch uses
execso shell is replaced by CLI process:src/team/tmux-session.ts:246.
- Liveness check:
- Conclusion:
- For actual process exit, this check is appropriate.
- It cannot detect "alive but unresponsive" states without heartbeat/IO heuristics.
- Evidence:
processedset is keyed by worker name and never cleared:src/team/runtime.ts:341,src/team/runtime.ts:348,src/team/runtime.ts:355.assignTask()exists for additional assignments:src/team/runtime.ts:387.
- Impact:
- If a worker is assigned multiple tasks sequentially, only first
done.jsonis consumed. - Subsequent completions from the same worker are ignored.
- If a worker is assigned multiple tasks sequentially, only first
10) Medium: Initial assignment message tells worker to "claim tasks", but runtime pre-assigns task IDs inconsistently
- Evidence:
- Initial inbox prompt says "claim tasks" from task dir:
src/team/runtime.ts:157. - Runtime also pushes an explicit "Initial Task Assignment" with fixed
Task ID:src/team/runtime.ts:190,src/team/runtime.ts:209.
- Initial inbox prompt says "claim tasks" from task dir:
- Impact:
- Worker behavior may diverge (self-claim vs. follow fixed ID), increasing protocol ambiguity and state drift.
- Evidence:
- Function signature includes
sessionName:src/team/tmux-session.ts:296. - Body only targets pane ID; session name is ignored.
- Function signature includes
- Impact:
- Not a functional bug, but indicates API drift and potential confusion for callers.
- Evidence:
- State path uses
teamNamedirectly:src/team/runtime.ts:71. tmuxnames are sanitized separately intmux-session.ts, but file paths are not.
- State path uses
- Impact:
- Path characters from
teamNamecan make instructions/path lengths fragile; path traversal risk depends on upstream validation (not visible in reviewed scope).
- Path characters from
- Non-Claude workers skipping
in_progressdoes causemonitorTeamphase misreporting (planningwhile executing). - Hardcoded 4000ms can be too short (lost/garbled first instruction) or too long (wasted latency).
tasks[i] ?? tasks[0]is likely unintended for production; it can duplicate work and emit non-existent task IDs.- Partial
done.jsonis retried safely (parse failure -> ignore), but completion may be delayed. - 200-char limit is not always safe for long
teamNamepaths. - Runtime-cli effectively waits ~2-2.5s; default shutdown path can still waste up to 30s.
- There is a real race around result collection vs in-flight watchdog completion updates.
pane_deadcorrectly indicates exited codex worker process, but cannot detect hangs.- Additional fragility exists (one-completion-per-worker watchdog design, protocol ambiguity, API drift).
- Mark non-Claude initial tasks
in_progressbefore notifying workers. - Enforce
workerCount <= tasks.length(or generate concrete extra tasks) and removetasks[i] ?? tasks[0]fallback. - Replace fixed 4000ms with readiness probe per CLI type.
- Make done-signal writes atomic by contract (
tmp + rename) and validate required fields. - Remove 200-char truncation risk by using short trigger tokens + file-backed payload.
- In shutdown path, skip ACK waiting for non-Claude workers or use agent-type-aware expected ACK set.
- Serialize shutdown with watchdog completion drain before collecting task results.
- Redesign watchdog processed key to include task ID (or sequence), not only worker name.