Skip to content

feat: smart provider failover#971

Closed
pipi-1997 wants to merge 11 commits intocode-yeongyu:devfrom
pipi-1997:feat/smart-provider-failover
Closed

feat: smart provider failover#971
pipi-1997 wants to merge 11 commits intocode-yeongyu:devfrom
pipi-1997:feat/smart-provider-failover

Conversation

@pipi-1997
Copy link

@pipi-1997 pipi-1997 commented Jan 21, 2026

Summary

  • Add a smart provider failover system to minimize downtime when primary models become unavailable.

Changes

  • Add failover feature module (model chain resolver, error diagnoser, provider status manager).
  • Add smart-failover hook to swap models on retry loops / provider errors and show a throttled toast.
  • Extend config + JSON schema to support model as pipe syntax or array, plus failover config.
  • Update agent/model resolution paths to use the primary model when multiple are provided.
  • Add documentation for Smart Failover (EN + zh-CN).

Testing

bun run typecheck
bun run build

Related Issues


Summary by cubic

Adds smart provider failover that auto-switches to healthy fallback models when the primary hits rate limits, errors, or quota, cutting downtime and avoiding stuck retry loops. Sessions abort and resume on a compatible model with throttled toasts.

  • New Features

    • Smart failover engine: error diagnoser (with header-aware cooldowns), model chain resolver, and provider status manager with HEALTHY/COOLING/LOCKED/PROBATION.
    • Smart-failover hook: detects retry loops/provider errors, swaps to the next compatible model, aborts/resumes the session, and shows throttled toasts (including a provider-locked alert).
    • Config/schema: model supports pipe syntax or arrays at root, category, and agent levels; hooks list includes smart-failover.
    • Agent resolution: when multiple models are provided, the first is treated as primary; fallbacks respect context-window guardrails.
    • Docs and tests added for Smart Failover (EN + zh-CN) and hook behavior.
  • Migration

    • Enabled by default; disable by adding smart-failover to disabled_hooks.
    • Define model chains via pipe syntax (e.g., providerA/modelA | providerB/modelB) or arrays.

Written for commit 3324ea0. Summary will update on new commits.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 21, 2026

Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement (CLA).

To sign the CLA, please comment on this PR with:

I have read the CLA Document and I hereby sign the CLA

This is a one-time requirement. Once signed, all your future contributions will be automatically accepted.


I have read the CLA Document and I hereby sign the CLA


1 out of 2 committers have signed the CLA.
✅ (pipi-1997)[https://github.com/pipi-1997]
❌ @庄权
庄权 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@pipi-1997
Copy link
Author

I have read the Contributor License Agreement (CLA) and I agree to its terms.

@pipi-1997 pipi-1997 force-pushed the feat/smart-provider-failover branch from 70be0a7 to d98e091 Compare January 21, 2026 17:06
@pipi-1997
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@pipi-1997
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Jan 21, 2026
@pipi-1997 pipi-1997 force-pushed the feat/smart-provider-failover branch from b9429e9 to e44a4c0 Compare January 21, 2026 17:11
@pipi-1997
Copy link
Author

recheck

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 17 files

Confidence score: 3/5

  • Potential user impact: in src/features/failover/diagnoser.ts, Retry-After/x-ratelimit-reset can be misparsed (HTTP-date or timestamps treated as seconds), leading to incorrect cooldown timing.
  • Score reflects a concrete behavior risk (7/10 severity) even though it is localized to failover cooldown handling.
  • Pay close attention to src/features/failover/diagnoser.ts - header parsing could cause incorrect backoff intervals.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/failover/diagnoser.ts">

<violation number="1" location="src/features/failover/diagnoser.ts:29">
P1: Retry-After/x-ratelimit-reset headers are parsed as raw seconds, misinterpreting HTTP-date or timestamp values and causing incorrect cooldowns.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@pipi-1997
Copy link
Author

Addressed the cooldown header parsing issue identified by cubic: Retry-After now supports both delta-seconds and HTTP-date, and x-ratelimit-reset supports epoch seconds/ms as well as delta-seconds. Added unit tests and regenerated schema; also removed the unused failover strategy config from the schema/docs.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 18 files

Confidence score: 5/5

  • This PR looks low risk; the only noted issue is a documentation mismatch rather than runtime behavior.
  • docs/features/smart-failover.zh-CN.md claims exponential backoff, but the code uses a fixed 5‑minute retry-loop cooling, which could mislead readers.
  • Pay close attention to docs/features/smart-failover.zh-CN.md - align the description with the fixed 5‑minute cooling behavior.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/features/smart-failover.zh-CN.md">

<violation number="1" location="docs/features/smart-failover.zh-CN.md:46">
P2: Docs claim exponential backoff for cooling, but retry-loop cooling is fixed 5 minutes in code</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive smart provider failover system to automatically handle model unavailability due to rate limits, balance issues, or quota exhaustion. The system monitors session errors and retry loops, marks problematic providers with states (HEALTHY/COOLING/LOCKED/PROBATION), and seamlessly switches to fallback models while aborting and resuming sessions with minimal user disruption.

Changes:

  • Introduced a failover feature module with error diagnosis, provider status management, and model chain resolution supporting both pipe syntax (model1 | model2) and array syntax (["model1", "model2"])
  • Added smart-failover hook that detects provider issues, maintains provider state, performs automatic failover with exponential backoff, and shows throttled toast notifications
  • Extended configuration schema and JSON schema to accept model as either string or string array at root, category, and agent levels

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/hooks/smart-failover/index.ts Core hook implementing failover logic, session management, and error handling
src/hooks/smart-failover/index.test.ts Test coverage for hook behavior including swapping, cooling, and recovery
src/features/failover/types.ts TypeScript interfaces for provider states and diagnostic results
src/features/failover/status-manager.ts Singleton managing provider health states with automatic probation transitions
src/features/failover/resolver.ts Parses model configurations into primary/fallback chains
src/features/failover/diagnoser.ts Pattern-based error diagnosis with retry-after header parsing
src/features/failover/diagnoser.test.ts Tests for header-based cooldown calculation
src/index.ts Integration of smart-failover hook into plugin lifecycle
src/hooks/index.ts Export of createSmartFailoverHook function
src/config/schema.ts Schema updates allowing model as union of string or string array
src/plugin-handlers/config-handler.ts Extraction of primary model from arrays for system defaults
src/agents/utils.ts Agent resolution updated to extract primary from model arrays
src/agents/types.ts AgentOverrideConfig type updated to accept model as string or array
src/agents/sisyphus-junior.ts Sisyphus Junior agent updated to handle model arrays
src/tools/delegate-task/tools.ts Category config resolution updated to extract primary from arrays
docs/features/smart-failover.md English documentation for smart failover feature
docs/features/smart-failover.zh-CN.md Chinese documentation for smart failover feature
assets/oh-my-opencode.schema.json JSON schema updates for model field accepting string or array

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pipi-1997
Copy link
Author

Applied remaining Copilot review suggestions (13 inline comments) and pushed new commits:

  • refactor(smart-failover): address copilot review (constants, remove @ts-ignore, instance-local state, safer props, bounded cooldown/backoff, improved busy-retry behavior, tighter diagnoser patterns, ASI consistency)
  • docs(smart-failover): clarify enablement and cooldown

Local verification: typecheck + smart-failover/diagnoser tests pass.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 18 files

Confidence score: 3/5

  • Potential runtime bug: in src/agents/sisyphus-junior.ts, primaryModel can be undefined when override.model is an empty array, which may propagate into AgentConfig.model/isGptModel usage.
  • Moderate risk given the 6/10 severity and clear user-facing impact if the model selection ends up undefined.
  • Pay close attention to src/agents/sisyphus-junior.ts - guard against empty override.model to avoid undefined model selection.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/agents/sisyphus-junior.ts">

<violation number="1" location="src/agents/sisyphus-junior.ts:94">
P2: primaryModel can be undefined when override.model is an empty array, leading to invalid AgentConfig.model/isGptModel usage.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@pipi-1997
Copy link
Author

Fixed a new issue identified by cubic (confidence 3/5): createSisyphusJuniorAgentWithOverrides now guards against override.model being an empty array and falls back to system default / built-in default. Added unit tests.

Commit: 2fc548e (pushed to the PR branch)

@cubic-dev-ai could you please re-run the review for the latest HEAD? (aiming for 5/5)

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Jan 23, 2026

Fixed a new issue identified by cubic (confidence 3/5): createSisyphusJuniorAgentWithOverrides now guards against override.model being an empty array and falls back to system default / built-in default. Added unit tests.

Commit: 2fc548e (pushed to the PR branch)

@cubic-dev-ai could you please re-run the review for the latest HEAD? (aiming for 5/5)

@pipi-1997 I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 19 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

@pipi-1997
Copy link
Author

Resolved merge conflicts by syncing latest dev into the PR branch and fixing follow-up type issues from upstream model resolver changes (normalize model override arrays to primary strings where required). Verified: bun run typecheck + full bun test pass on the updated branch HEAD.

@code-yeongyu
Copy link
Owner

Thank you for this contribution! This PR has been superseded by #1408 which implements a more comprehensive runtime fallback system with automatic model switching on API errors. Closing in favor of that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants