Skip to content

Conversation

@Gladdonilli
Copy link
Contributor

@Gladdonilli Gladdonilli commented Jan 11, 2026

Summary

This PR improves background agent reliability by adding safety guards and simplifying completion detection to prevent stuck tasks.

Changes

1. Agent Safety Guards

  • max_steps limit: Added max_steps: 25 to explore agent and max_steps: 30 to librarian agent to prevent infinite loops
  • Tool blocking: Librarian now blocks sisyphus_task, call_omo_agent, and task tools to prevent unintended child spawning

2. Background Agent Completion Detection (PR #655 Implementation)

  • Global timeout: Added MAX_RUN_TIME_MS (15 minutes) to prevent tasks from running forever
  • Simplified idle handler: Removed validateSessionHasOutput() and checkSessionTodos() guards that were causing stuck tasks
  • Minimum idle time: MIN_IDLE_TIME_MS (5 seconds) prevents premature completion from early session.idle events
  • Timeout cleanup: Timer is properly cleared on task completion to prevent memory leaks

3. Configuration Improvements

  • JSONC support: Config paths now check for .jsonc files before .json, enabling comments in config files
  • Category model fix: sisyphus_task sync mode now correctly passes categoryModel for category-based tasks

Completion Detection Flow

┌─────────────────────────────────────────────────────────────┐
│                    COMPLETION FLOW                          │
├─────────────────────────────────────────────────────────────┤
│  1. Agent runs, does work (tool calls, thinking, etc.)      │
│  2. Agent goes idle (no more tool calls/responses)          │
│  3. OpenCode SDK fires session.idle event                   │
│  4. GUARD: Check if elapsed time >= MIN_IDLE_TIME_MS (5s)   │
│     - If < 5s: IGNORE (too early, agent still starting)     │
│     - If >= 5s: ACCEPT as complete                          │
│  5. Clear timeout timer (prevent memory leak)               │
│  6. Mark task.status = "completed"                          │
│  7. Notify parent via noReply batching pattern              │
└─────────────────────────────────────────────────────────────┘

Why Guards Were Removed

The previous guards (validateSessionHasOutput, checkSessionTodos) caused tasks to get stuck:

Guard Problem
validateSessionHasOutput If model returns empty (config issue), waits forever
checkSessionTodos If agent creates todos but never completes them, waits forever

The new approach: Fail fast, surface errors, don't hide them. An empty result is a visible signal of a problem the user can debug.

Safety Nets

Edge Case Protection
Early completion MIN_IDLE_TIME_MS (5s minimum)
Infinite loop max_steps: 25/30
Total hang MAX_RUN_TIME_MS (15 min timeout)
Bad model config Empty result surfaces issue
Unintended spawning Tool blocking in librarian
Memory leak Timeout timer cleared on completion

Testing

  • bun run typecheck passes
  • bun run build succeeds
  • ✅ All 45 background-agent tests pass
  • ✅ Manual testing: explore (background), librarian (background), explore (sync), librarian (sync) all complete successfully

References


Summary by cubic

Improves background agent stability and completion detection to prevent stuck tasks and runaway sessions. Adds model variant support, a workdir-constrained sisyphus_task, and a version-aware installer.

  • Bug Fixes

    • Background tasks: complete on session.idle after 5s; add 15‑min timeout (also on resume) with clear/unref; prevent double-release; reset startedAt on resume; clean session tracking.
    • Agent safety: cap max_steps (explore 25, librarian 30); block task-spawning tools in librarian.
    • Parent agent: fix parentAgent resolution for background tools to stop Prometheus→Build fallback.
    • Sisyphus: pass category model (including variant) in sync mode.
  • New Features

    • Config: JSONC support and a "variant" field for agents/categories; variant applied to messages and task models.
    • Sisyphus task: new workdir parameter to constrain agents to a directory (validated and injected into system/prompt and resume).
    • Installer: write a version-aware plugin entry (@latest/@beta or pinned) and replace existing entry.

Written for commit ec554fc. Summary will update on new commits.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 6 files

Confidence score: 2/5

  • Double-releasing the concurrency key in src/features/background-agent/manager.ts risks oversubscribing background tasks and breaking the intended queue limits, making the change high risk.
  • Resumed tasks failing to reset startedAt in src/features/background-agent/manager.ts means long-running work can be wrongly marked complete on idle, potentially cutting off user sessions.
  • Global task timeouts in src/features/background-agent/manager.ts are left pending after completion, keeping long-lived timers around and threatening overall process stability.
  • Pay close attention to src/features/background-agent/manager.ts - concurrency handling, resume logic, and timeout cleanup all need fixes.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:189">
P2: Global task timeout is created but never cleared/unref’d on most completion paths, leaving 15‑minute timers alive after tasks finish and potentially keeping the event loop running.</violation>

<violation number="2" location="src/features/background-agent/manager.ts:197">
P1: Concurrency key is released twice on timeout (timeout handler and cleanup), but `release` is not idempotent—double-release can oversubscribe queued tasks</violation>

<violation number="3" location="src/features/background-agent/manager.ts:423">
P2: Resumed tasks can be prematurely completed: startedAt isn’t reset on resume, so the simplified session.idle handler will immediately complete any long-lived resumed task on the first idle event without verifying output/todos.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:361">
P2: Resume error path does not clear the newly created timeout or release the concurrency key, holding resources until retention cleanup and keeping the timer scheduled</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@motonari728
Copy link

Love this PR ❤️ I hope early merge.

Gladdonilli and others added 12 commits January 12, 2026 12:24
- Add max_steps limit to explore (25) and librarian (30) agents
- Block sisyphus_task/call_omo_agent tools in librarian to prevent spawning
- Add global 15-minute timeout for background tasks (MAX_RUN_TIME_MS)
- Simplify session.idle handler - remove validateSessionHasOutput/checkSessionTodos guards
- Add JSONC config file support (.jsonc checked before .json)
- Fix categoryModel passing in sisyphus_task sync mode

Reference: PR code-yeongyu#655, kdcokenny/opencode-background-agents
- Reset startedAt when resuming tasks to prevent immediate completion
  (MIN_IDLE_TIME_MS check was passing immediately for resumed tasks)
- Previous commit already fixed timeout.unref() and double-release prevention
- Set concurrencyKey = undefined after every release() call to prevent
  double-release when multiple code paths try to release the same key
- Add 15-minute timeout timer for resumed tasks (was missing)
- Fixes: promptAsync error, session.deleted, pruneStaleTasksAndNotifications
Address P2 feedback: resume() error handler now properly:
- Clears the timeout timer created for the resumed task
- Releases concurrency key to unblock queued tasks
Allow optional model variant config for agents and categories.
Propagate category variants into task model payloads so
category-driven runs inherit provider-specific variants.

Closes: code-yeongyu#647
Previously, the installer always wrote 'oh-my-opencode' without a version,
causing users who installed beta versions (e.g., bunx oh-my-opencode@beta)
to unexpectedly load the stable version on next OpenCode startup.

Now the installer queries npm dist-tags and writes:
- @latest when current version matches the latest tag
- @beta when current version matches the beta tag
- @<version> when no tag matches (pins to specific version)

This ensures:
- bunx oh-my-opencode install → @latest (tracks stable)
- bunx oh-my-opencode@beta install → @beta (tracks beta tag)
- bunx [email protected] install → @3.0.0-beta.2 (pinned)
Addresses cubic review feedback: installer now replaces existing
oh-my-opencode entries with the new version-aware entry, allowing
users to switch between @latest, @beta, or pinned versions.
Add optional `workdir` parameter that injects strict directory constraints
into spawned agents' system content. Validates absolute paths, existence,
and directory type. Enables orchestrators to delegate work to specific
git worktrees or project subdirectories.

- Validation: absolute path, exists, is directory
- Injection: system content (sync/background) or prompt prepend (resume)
- Documentation: updated tool description and schema
- Tests: validation, injection, and combination scenarios
- Created findFirstMessageWithAgent() to read original session agent from oldest message
- Updated parentAgent resolution in sisyphus_task, call_omo_agent, background_task
- Fixed message.updated handler to only track agent from user messages (not assistant/system)
- Added debug logging for parentAgent resolution troubleshooting

Fixes issue where Prometheus agent would switch to Build when background task notifications were injected, caused by OpenCode writing 'agent: build' to message files mid-session.
- Fixed a spelling error.
- Clarify when to use google_auth: true vs false based on plugin choice
@code-yeongyu
Copy link
Owner

@Gladdonilli i once modified in this way before but kinda worried about double-releasing as cubit bot, do you have any opinions regarding this

@Gladdonilli
Copy link
Contributor Author

@Gladdonilli i once modified in this way before but kinda worried about double-releasing as cubit bot, do you have any opinions regarding this

Ill polish this up more today, might have a more elegant solution

- Release concurrency key immediately in session.idle handler
- Clean up subagentSessions Set on normal completion
- Clean up sessionAgentMap on both completion paths
- Add documentation for validateSessionHasOutput usage
@Gladdonilli Gladdonilli force-pushed the fix/subagent-safety-minimal branch from 4833ebb to ec554fc Compare January 12, 2026 13:17
@Gladdonilli
Copy link
Contributor Author

making a new pr to retrigger reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants