Harden provider routing and implement health failover by luodaoyi · Pull Request #421 · awsl-project/maxx

luodaoyi · 2026-03-17T13:46:39Z

1. 这次主要解决什么问题

原来的问题有两个：

路由选择偏静态。
- priority 只看 Position
- weighted_random 之前不是真正权重随机
坏 provider 会放大损失。
- 当前请求会被慢失败、长超时、长 Retry-After 拖慢
- 后续请求还会继续撞向这个坏 provider

这次改动的目标就是：

当前请求不要被坏 provider 长时间拖住
后续请求自动更偏向近期健康的 provider
权重路由和健康状态同时生效

2. 最终方案概览

这次没有重写整个架构，而是在现有 Router + Executor + Handler 上补齐调度闭环。

路由层

Router.Match 现在会：

先按 tenant / clientType / project 过滤 route
跳过 cooldown 中的 provider
跳过 breaker 已打开的 provider
根据策略排序
- priority：按 Position
- weighted_random：按真实权重随机
如果启用了 health tracker，再把近期健康合并到最终顺序

执行层

Executor.dispatch 现在会：

给整次请求加共享预算 RequestTimeout
给每次 attempt 加单次上限 TotalTimeout
监控 FirstByteTimeout
监控 StreamIdleTimeout
对失败结果做统一错误分类
每次 attempt 后回灌 health tracker

健康层

Tracker 负责：

维护近期成功/失败
维护 EWMA latency 和 EWMA TTFT
维护 in-flight 数量
维护 closed / open / half_open breaker
给 Router 提供健康分

3. 路由算法怎么做

3.1 `priority`

这个策略不变，仍然是：

Position 越小越优先

3.2 真实 `weighted_random`

这次把 Route.Weight 真正打通了。

现在 route 已经有：

Weight int

默认规则：

weight <= 0 时按 1 处理

算法采用加权随机无放回排序：

u = Uniform(0, 1]
key = u^(1/weight)
按 key 从大到小排序

含义：

权重大，越容易排前
仍然有随机性，不会退化成死顺序

3.3 权重和健康同时生效

如果只是权重随机，不够。
如果只是按健康分硬排序，权重又会失效。

所以现在是联合计算：

effectiveWeight = routeWeight * healthWeightMultiplier(score)

当前健康乘子：

score 截断到 [-1500, 1500]
multiplier = exp(score / 300)

这样做的效果是：

高权重 route 仍然更容易被选中
近期更健康的 provider 会进一步放大权重
坏 provider 不会完全确定性排死，但会显著降权

4. 近期健康怎么做

健康状态的维度是：

providerID + clientType

Tracker 维护这些指标：

recentSuccess
recentFailure
recentTimeout
recentRateLimit
recentServerErr
ewmaLatency
ewmaTTFT
inFlight
consecutiveUnhealthyFailures
breaker state

时间衰减

近期统计不是永久累计，而是带衰减：

factor = exp(-elapsed / halfLife)
recentX *= factor

当前半衰期是：

2 分钟

意思是最近几分钟的表现权重更高，老失败会逐步淡出。

评分思路

健康分大致由这些项组成：

成功率加分
latency / TTFT 扣分
timeout / 429 / 5xx 扣分
in-flight 扣分
breaker open / half-open 额外扣分

这不是复杂模型，而是一个可解释的启发式打分。

5. Breaker 怎么做

状态有三种：

closed
open
half_open

状态流转：

closed --(连续不健康失败)--> open
open --(冷却结束)--> half_open
half_open --(probe 成功)--> closed
half_open --(probe 不健康失败)--> open

什么算不健康失败

正常阶段：

timeout
network error
5xx

在 half_open 探测阶段，429 也会 reopen。

为什么普通 4xx 不 reopen

普通 4xx 很可能是用户请求问题，不是 provider 坏了。

所以现在：

half_open 遇到普通 4xx
只释放 probe
不 reopen breaker

这样不会把健康 provider 因为一条 bad request 再次摘掉。

为什么只允许一个 probe

half_open 的关键就是只放一个恢复探测请求。

当前实现保证：

同一时刻只允许一个 probe
probe 生命周期直到 Record() 消费结果才结束
父级 context 取消时会显式释放 probe，不污染健康分

6. 执行链怎么避免被坏 provider 拖慢

共享请求预算

新增了：

type AttemptBudget struct {
    RequestTimeout
    TotalTimeout
    FirstByteTimeout
    StreamIdleTimeout
    MaxRetryAfter
    MaxRetryWait
}

默认值：

RequestTimeout    = 45s
TotalTimeout      = 30s
FirstByteTimeout  = 10s
StreamIdleTimeout = 15s
MaxRetryAfter     = 30s
MaxRetryWait      = 10s

作用：

RequestTimeout 限制整次请求总预算
TotalTimeout 限制单次 attempt
FirstByteTimeout 防止上游迟迟没首字节
StreamIdleTimeout 防止流式中途卡死
MaxRetryAfter / MaxRetryWait 防止等待时间吃光预算

实际效果

之前最坏情况接近：

routes * retries * provider timeout

现在会被：

RequestTimeout

硬截断。

7. first byte / stream idle 怎么做

新增了 attempt watchdog。

它会：

在 attempt 开始时启动
检查是否超过 FirstByteTimeout
检查是否超过 StreamIdleTimeout
超时后直接取消当前 attempt

为了避免误判，不只是在 ResponseWriter.Write 记录活动，还在读取上游 body 时记录活动：

第一次读到字节：记首字节
后续读到字节：记活动
读到 EOF：关闭 idle 监控

这样可以区分：

真的上游没首字节
上游已经有响应，只是本地后处理慢

8. 错误语义怎么处理

这次把 timeout / deadline / response-started 之后的错误统一归成 ProxyError。

规则是：

没开始响应前：
- timeout 会对外返回真实 504
已经开始响应后：
- 不再切下一个 route
- 也不再二次写错误体，避免破坏流

这解决了之前“已经写出部分响应后又补一份错误”的问题。

9. 关键代码位置

如果 reviewer 想快速看代码，优先看这几个文件：

路由排序：
- internal/router/router.go
健康跟踪与 breaker：
- internal/health/tracker.go
执行链与超时预算：
- internal/executor/middleware_dispatch.go
- internal/executor/attempt_budget.go
- internal/executor/attempt_watchdog.go
- internal/executor/attempt_health.go
上游 body 活动观测：
- internal/flow/attempt_activity.go
错误回写：
- internal/handler/proxy.go

10. 主要测试覆盖

单元测试

覆盖了：

真实 weighted_random
权重和健康联合排序
breaker 的 open / half-open / closed
单 probe 约束
普通 4xx 不 reopen
request budget
first-byte timeout
stream idle timeout

e2e

覆盖了：

连续 timeout 后优先健康 provider
breaker open 时跳过坏 provider
request budget 截断长串行 failover

11. 结果

这次改动完成后，核心收益是：

当前请求不再容易被坏 provider 长时间拖住
后续请求会自动更偏向近期健康的 provider
weighted_random 终于变成了真实权重随机
权重、健康、breaker、恢复探测现在是一个闭环

Summary by CodeRabbit

新功能
- 按权重路由支持，权重在创建/导入/备份中保留并在加权随机策略生效。
- 引入基于提供商的健康追踪与断路器，影响路由与故障转移决策。
- 新的尝试预算与请求看门狗，监控首字节/空闲活动并限制重试。
改进
- 更细化的超时与错误归类与记录；非流式路径可禁用首字节超时。
- 上游响应体包装以触发活动事件与度量；增强响应已开始检测与错误处理。
测试
- 大量单元与端到端测试覆盖健康、路由、预算、超时、流式与故障转移行为。

coderabbitai · 2026-03-17T13:47:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

将提供者健康跟踪器（断路器/评分）、基于健康的加权路由、尝试级别预算与首字节/空闲超时守护、响应体包装与活动回调、以及相关错误归一与健康记录逻辑，连同大量单元与端到端测试，集成到路由、执行器与适配器路径并完成启动/测试环境的 wiring。

Changes

Cohort / File(s)	Summary
健康跟踪核心 `internal/health/tracker.go`, `internal/health/tracker_test.go`	新增 ProviderTracker/Tracker 实现与全面单元测试：断路器状态、BeginAttempt/Record/ReleaseHalfOpenProbe、评分/快照与衰减逻辑。
路由器与策略 `internal/router/router.go`, `internal/router/router_test.go`	注入 healthTracker，跳过 open circuit 提供者并按权重+健康分数重排序；新增 weighted-random 与 health 加权排序与测试。
执行器核心与分发 `internal/executor/executor.go`, `internal/executor/middleware_dispatch.go`, `internal/executor/middleware_dispatch_test.go`	Executor 注入 healthTracker、新增 AttemptBudget、attemptWatchdog、尝试级别上下文与超时；加入 BeginAttempt/Record 健康记录、错误规范化、capRetryWait 与探针释放，配套大量测试。
尝试预算与守护 `internal/executor/attempt_budget.go`, `internal/executor/attempt_budget_test.go`, `internal/executor/attempt_watchdog.go`, `internal/executor/attempt_watchdog_test.go`	新增 AttemptBudget（默认值与计算方法）和 attemptWatchdog（首字节/流空闲超时监视、活动标记、writer 包装）及单元测试。
尝试健康细化 `internal/executor/attempt_health.go`, `internal/executor/attempt_health_test.go`	新增错误归一、normalizeResponseStartedError、recordAttemptHealth、capRetryWait、releaseHalfOpenProbeIfNeeded 等与 Executor 绑定的方法与测试。
响应活动与流层包装 `internal/flow/attempt_activity.go`, `internal/flow/keys.go`, `internal/flow/attempt_activity_test.go`	新增 AttemptActivityObserver 接口、WrapResponseBody、DisableFirstByteTimeout、GetAttemptActivityObserver 与包装 ReadCloser，提供读/关闭时的活动事件通知与测试。
适配器响应包装与超时调整 `internal/adapter/provider/.../adapter.go`	多个 provider 适配器在成功上游响应处调用 `flow.WrapResponseBody`，并在某些非流路径调用 `flow.DisableFirstByteTimeout`（CPA 非流场景）。
响应状态写入与错误分派 `internal/handler/response_state_writer.go`, `internal/handler/proxy.go`, `internal/handler/proxy_test.go`, `internal/handler/response_state_writer_test.go`	新增 responseStateWriter 跟踪响应是否已开始；集中化 writeDispatchError；错误响应状态由 proxyErr.HTTPStatusCode 决定；更新并扩展测试。
路由权重持久化与管理 `internal/domain/model.go`, `internal/domain/backup.go`, `internal/repository/sqlite/models.go`, `internal/repository/sqlite/route.go`	为 Route/BackupRoute/模型新增 `Weight` 字段，持久化/转换时对非正值默认设为 1。
管理与备份 API `internal/service/admin.go`, `internal/service/backup.go`, `internal/service/backup_test.go`, `internal/handler/admin.go`	在创建/更新/导入/导出流程中规范化 route.Weight（≤0 时设为 1），并支持 admin 路径对 weight 的部分更新。
启动与测试环境接入 `cmd/maxx/main.go`, `internal/core/database.go`, `tests/e2e/proxy_setup_test.go`	启动与测试环境创建并注入 `providerHealthTracker := health.NewTracker()` 到 Router 与 Executor；测试环境可选注入 AttemptBudget 并调用 Executor.SetAttemptBudget。
端到端与行为测试 `tests/e2e/provider_failover_health_test.go`, `tests/e2e/routes_test.go`	新增 e2e 测试覆盖故障转移、断路器跳过、请求预算行为，以及路由权重的创建/更新断言。

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Handler
    participant Router
    participant Executor
    participant Watchdog
    participant HealthTracker
    participant Adapter

    Client->>Handler: 发起 HTTP 请求
    Handler->>Router: 匹配路由（含健康/权重判断）
    Router->>HealthTracker: IsCircuitOpen / Score
    HealthTracker-->>Router: 返回状态与评分
    Router-->>Handler: 选定并排序后的路由

    Handler->>Executor: 开始执行（含 AttemptBudget）
    Executor->>HealthTracker: BeginAttempt(providerID, clientType)
    HealthTracker-->>Executor: 返回 releaseFunc

    Executor->>Watchdog: 创建 attemptWatchdog（首字节/空闲超时）
    Executor->>Adapter: 发起上游请求
    Adapter->>Adapter: WrapResponseBody(resp.Body)
    Adapter-->>Executor: 返回上游响应（含已包装 body）

    Executor->>Watchdog: NoteFirstByte()/NoteActivity()
    Executor->>Handler: 转发/流式响应
    Handler->>Client: 写入响应

    rect rgba(200,150,100,0.5)
    Executor->>HealthTracker: Record(AttemptResult)
    HealthTracker->>HealthTracker: 更新断路器/评分/统计
    end

    Executor->>Watchdog: Stop()
    Watchdog-->>Executor: 返回可能的 TimeoutErr

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat: 添加 Codex Provider 支持和路由自动排序功能 #158: 与本次 PR 的 NewRouter/NewExecutor 签名与 provider health tracker wiring 直接相关。
feat: 优化 Stats 页面和升级计费精度至纳美元 #146: 涉及 NewExecutor/启动 wiring 的改动，与本次 constructor 签名与注入调整重叠。
test: add comprehensive E2E test suite (180 cases) #326: E2E 测试与测试环境中对 health-tracker 和 AttemptBudget 的期望与本次变更高度相关。

Suggested reviewers

awsl233777

Poem

🐰 我是代码园里的小兔子，轻跳在权重与健康间，
包裹每个响应体，记录每次尝试的心跳，
断路器守门，预算丈量每次重试，首字节与空闲皆在瞧，
路由随健康而动，故障悄然换向，测试绿灯盼相见。

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.28% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	PR标题清晰准确地反映了主要变更：引入健康检查机制和加强提供商路由策略。

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR hardens provider routing and execution failover by combining weighted routing with a new provider health tracker (scoring + circuit breaker), and by enforcing shared request budgets plus per-attempt watchdog timeouts to prevent slow/bad providers from dragging requests.

Changes:

Add provider health tracking (score, EWMA latency/TTFT, in-flight penalty, circuit breaker w/ half-open probe) and integrate it into router ordering + executor attempt gating.
Implement true weighted_random routing using Route.Weight (with default weight=1) and persist weight through admin APIs, DB, and backup import/export.
Add shared request budget + per-attempt timeouts (total/first-byte/idle) with attempt watchdog and normalized error handling to stop serial failover from exceeding the overall request budget.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/e2e/routes_test.go	E2E coverage for default and explicit route `weight` create/update behavior.
tests/e2e/proxy_setup_test.go	Wire health tracker into router/executor in proxy e2e env; add attempt-budget override helper.
tests/e2e/provider_failover_health_test.go	New e2e tests for health-based failover, breaker open skip, and shared request budget cutoff.
internal/service/backup_test.go	Ensure backup roundtrip preserves route `weight`.
internal/service/backup.go	Export/import route `weight` in backups.
internal/service/admin.go	Default `Route.Weight` to 1 on create/update when unset/invalid.
internal/router/router_test.go	Unit tests for health-aware ordering, open-circuit filtering, and weighted_random bias + stability.
internal/router/router.go	Implement true weighted random ordering; integrate health score reordering and circuit-open filtering.
internal/repository/sqlite/route.go	Persist `weight` (defaulting to 1) in sqlite route model/domain conversions.
internal/repository/sqlite/models.go	Add `weight` column to Route model with default 1.
internal/health/tracker_test.go	Unit tests for scoring, in-flight penalty, breaker open/half-open probe semantics.
internal/health/tracker.go	New health tracker implementation (decayed recents, EWMA, breaker state machine, probe reservation).
internal/handler/response_state_writer.go	New ResponseWriter wrapper to detect whether response has started.
internal/handler/proxy_test.go	Tests ensuring dispatch error handling doesn’t write after response started; status code propagation for ProxyError.
internal/handler/proxy.go	Wrap writer to track “response started”; normalize dispatch error writing and use ProxyError HTTP status codes.
internal/handler/admin.go	Support updating route `weight` via admin routes handler.
internal/flow/keys.go	Add flow key for attempt activity observer.
internal/flow/attempt_activity.go	Add attempt activity observer plumbing + response body wrapper to report first-byte/activity/EOF.
internal/executor/response_capture.go	Track whether response has started in response capture wrapper.
internal/executor/middleware_dispatch_test.go	Extensive dispatch tests for budgets, timeouts, retry-after capping, warm token non-blocking, and health recording/release semantics.
internal/executor/middleware_dispatch.go	Enforce request budget + per-attempt timeouts; add watchdog; gate attempts via health tracker; record attempt health.
internal/executor/executor.go	Add attempt budget + health tracker to executor; helper to clear cooldown for original/converted client types; pass watchdog to event processor.
internal/executor/attempt_watchdog.go	New watchdog for first-byte and stream-idle timeouts + activity tracking writer.
internal/executor/attempt_health.go	Normalize attempt errors (timeouts/started-response semantics), record health results, cap retry waits, release half-open probe when needed.
internal/executor/attempt_budget_test.go	Tests for default budget values and budget clamp/remaining behavior.
internal/executor/attempt_budget.go	Define attempt budgets and remaining-time helpers for attempt/request scopes.
internal/domain/model.go	Add `Route.Weight` to domain model with semantics comment.
internal/domain/backup.go	Add `Weight` to backup route schema.
internal/core/database.go	Create and pass health tracker into router/executor during core initialization.
internal/adapter/provider/kiro/adapter.go	Wrap upstream response body to report activity to watchdog.
internal/adapter/provider/custom/adapter.go	Wrap upstream response body to report activity to watchdog.
internal/adapter/provider/codex/adapter.go	Wrap upstream response body to report activity to watchdog.
internal/adapter/provider/cliproxyapi_codex/adapter.go	Disable first-byte timeout for buffered non-stream CPA execution path.
internal/adapter/provider/cliproxyapi_antigravity/adapter.go	Disable first-byte timeout for buffered non-stream CPA execution path.
internal/adapter/provider/claude/adapter.go	Wrap upstream response body to report activity to watchdog.
internal/adapter/provider/antigravity/adapter.go	Wrap upstream response body to report activity to watchdog.
cmd/maxx/main.go	Create and pass health tracker into router/executor in CLI server bootstrap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

cmd/maxx/main.go

@@ -1,4 +1,4 @@
-package main
+package main


internal/core/database.go

@@ -1,4 +1,4 @@
-package core
+package core


internal/executor/attempt_watchdog.go

+func (w *attemptWatchdog) Stop() {
+	if w == nil {
+		return
+	}
+	select {
+	case <-w.stopCh:
+	default:
+		close(w.stopCh)
+	}
+}


internal/handler/response_state_writer.go

+}
+
+func (w *responseStateWriter) Flush() {
+	if flusher, ok := w.ResponseWriter.(http.Flusher); ok {


internal/health/tracker.go

+	factor := math.Exp(-float64(elapsed) / float64(t.decayHalfLife))
+	stats.recentSuccess *= factor
+	stats.recentFailure *= factor
+	stats.recentTimeout *= factor
+	stats.recentRateLimit *= factor
+	stats.recentServerErr *= factor


coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (13)

internal/health/tracker_test.go (1)
434-454: 建议把 2 分钟 half-life 的数值断言写死。

这几组用例验证了衰减趋势，但没有直接锁定“2 分钟后衰减到 50%”这个契约，所以衰减公式写偏时依然可能全部通过。建议补一个固定时钟的精确断言，直接校验 recent* 指标在一个 half-life 后变成 0.5 倍。

Also applies to: 456-490
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/health/tracker_test.go` around lines 434 - 454, Add a deterministic
assertion that the decay half-life is exactly 2 minutes by advancing the fixed
clock (tracker.now) by 2 minutes and asserting the Tracker's internal recent
decay metric (e.g., recentFailures or the relevant recent* field used by
NewTracker) has dropped to approximately 0.5x its previous value; do this inside
TestTrackerDoesNotOpenBreakerForFailuresSpreadAcrossDecayWindow (and the other
affected test block around 456-490) after calling
tracker.Record(AttemptResult...) so the decay formula is explicitly validated
rather than only relying on behavior via IsCircuitOpen.
tests/e2e/routes_test.go (1)
69-71: 权重相关 e2e 覆盖补充完整，方向正确。

建议再补一个 weight<=0 的边界用例（期望被规范化为 1），可以把默认化行为也锁住。

Also applies to: 143-157, 160-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/routes_test.go` around lines 69 - 71, Add explicit boundary tests
to verify weights <= 0 are normalized to 1: in tests/e2e/routes_test.go add a
subtest (near the existing created["weight"] assertion) that creates a route
with weight 0 and another with a negative weight and assert
int(created["weight"].(float64)) == 1 for both; repeat the same assertions for
the update flows referenced around the other blocks (the tests using the same
created/updated response objects) so both create and update paths confirm
normalization to 1.
internal/handler/admin.go (1)
364-368: 建议对 weight 做显式输入校验并在非法值时返回 400。

当前分支仅在 float64 时赋值，非法类型或小数会被静默忽略/截断，不利于接口可预期性。
可选修改示例
+import "math"
...
  if v, ok := updates["weight"]; ok {
-   if f, ok := v.(float64); ok {
-     existing.Weight = int(f)
-   }
+   f, ok := v.(float64)
+   if !ok || f != math.Trunc(f) {
+     writeJSON(w, http.StatusBadRequest, map[string]string{"error": "weight must be an integer"})
+     return
+   }
+   existing.Weight = int(f)
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/admin.go` around lines 364 - 368, When "weight" appears in
the updates map, validate it explicitly and return HTTP 400 on invalid input
instead of silently ignoring/truncating: inspect updates["weight"] (the updates
map) and accept only numeric values that are whole integers (e.g., float64 where
f == math.Trunc(f) or a json.Number parsed to int) and in the allowed range
(e.g., non-negative if that applies), then assign to existing.Weight; for any
other type or fractional value respond with a 400 error and a clear message
rather than silently skipping assignment.
internal/handler/response_state_writer.go (2)
5-8: started 字段缺少并发保护

started 是普通 bool 类型，在极端情况下如果存在并发写入（例如 SSE 流式响应的某些边缘场景），可能存在竞态条件。虽然 HTTP handler 通常在单一 goroutine 中执行请求处理，但建议评估是否需要使用 atomic.Bool 或 sync.Once 来保证线程安全。
♻️ 可选：使用 atomic.Bool 保证线程安全
+import "sync/atomic"
+
 type responseStateWriter struct {
 	http.ResponseWriter
-	started bool
+	started atomic.Bool
 }
相应地修改读写操作使用 Store(true) 和 Load()。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/response_state_writer.go` around lines 5 - 8, The
responseStateWriter struct's started bool is not concurrency-safe; replace it
with an atomic.Bool (or use sync.Once) to avoid race conditions in edge cases
(e.g., SSE). Update the responseStateWriter type to use atomic.Bool and change
all places that set or check started (e.g., in WriteHeader, Write, and any
helper methods on responseStateWriter) to call Store(true) when marking started
and Load() when checking it; also add the necessary import for sync/atomic (or
sync for Once) and ensure any conditional logic uses the atomic methods to avoid
non-atomic reads/writes.
27-31: 考虑保留 http.Hijacker 等可选接口

当前实现只显式实现了 http.Flusher，但 http.ResponseWriter 可能还实现 http.Hijacker（用于 WebSocket 升级）等可选接口。如果底层 writer 支持这些接口，包装后会丢失这些能力。

对于当前的代理用例，如果不需要 WebSocket 支持，可以忽略此建议。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/response_state_writer.go` around lines 27 - 31,
responseStateWriter currently only exposes Flush by type-asserting
w.ResponseWriter to http.Flusher in Flush(); this wrapper can drop optional
interfaces like http.Hijacker (and optionally http.CloseNotifier, http.Pusher)
if the underlying ResponseWriter supports them. Update the responseStateWriter
implementation to forward those optional interfaces: add type assertions and
method implementations (e.g., implement Hijack() when the embedded
ResponseWriter satisfies http.Hijacker, likewise for CloseNotify/Push) so the
wrapper preserves behavior of the underlying writer; look for the
responseStateWriter type, its Flush() method, and ResponseWriter field to locate
where to add the forwarded interface checks and methods.
internal/executor/attempt_watchdog.go (2)
175-180: Write 方法每次调用都触发 NoteFirstByte。

当前实现在每次 Write 时都调用 NoteFirstByte()，而不仅是首次写入。根据 watchdog 的实现，NoteFirstByte 会设置 activitySeen=true 并更新 lastActivity，这在功能上是正确的，但语义上可能略显混乱。

如果想更精确，可以在 attemptActivityWriter 中维护一个 firstByteSent 标志，仅在首次写入时调用 NoteFirstByte，后续调用 NoteActivity。不过当前实现也能正常工作。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_watchdog.go` around lines 175 - 180, The Write
method currently calls watchdog.NoteFirstByte() on every write; change
attemptActivityWriter to track a boolean flag (e.g., firstByteSent) and in Write
only call watchdog.NoteFirstByte() when firstByteSent is false, set
firstByteSent=true, and on subsequent writes call watchdog.NoteActivity()
instead; update the attemptActivityWriter struct to include the firstByteSent
field and adjust the Write method accordingly to preserve semantics but make the
first-byte vs subsequent-activity semantics explicit.
110-130: 5ms 轮询间隔可能较为激进。

当前使用 5ms ticker 检测超时，在高并发场景下每个请求都会有一个这样的 goroutine，可能带来一定的 CPU 开销。如果性能测试显示问题，可考虑增加到 50-100ms 或使用 time.AfterFunc 替代轮询。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_watchdog.go` around lines 110 - 130, The 5ms
polling interval in attemptWatchdog.run (time.NewTicker(5 * time.Millisecond))
is too aggressive and can cause CPU overhead under high concurrency; update the
ticker to a larger interval (e.g., 50-100ms) or replace the polling mechanism
with a timer-based approach such as time.AfterFunc to schedule the timeout check
for each attempt, ensuring you preserve the same stopCh handling and cancel()
behavior in the attemptWatchdog struct.
internal/executor/executor.go (1)
81-83: SetAttemptBudget 缺少并发保护。

如果在运行时调用此方法（而非仅在启动时），可能存在数据竞争。如果预期仅在启动时配置，建议添加注释说明；否则考虑添加互斥锁保护。
♻️ 可选：添加文档说明或互斥锁
+// SetAttemptBudget updates the executor's attempt budget.
+// This method is not thread-safe and should only be called during initialization.
 func (e *Executor) SetAttemptBudget(budget AttemptBudget) {
 	e.attemptBudget = budget
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/executor.go` around lines 81 - 83, SetAttemptBudget
currently writes e.attemptBudget without concurrency protection which can cause
data races if called at runtime; either document that SetAttemptBudget must only
be used at startup or add mutex protection (e.g., a sync.Mutex or sync.RWMutex
on the Executor struct) and lock/unlock around the write in SetAttemptBudget;
update the Executor struct to include the mutex (e.g., mu sync.Mutex) and use
mu.Lock()/mu.Unlock() (or mu.Lock()/defer mu.Unlock()) around the assignment to
e.attemptBudget, and add a brief comment on the method explaining the
concurrency semantics.
internal/executor/attempt_budget_test.go (1)
1-79: 测试覆盖良好，可选地补充边界用例。

当前测试很好地覆盖了核心场景。如果想提高健壮性，可考虑补充以下边界用例：

ClampRetryWait 当 wait <= 0 时返回 0

RemainingSince / RequestRemainingSince 当 timeout 字段 <= 0 时返回 0

ClampRetryWait 当 MaxRetryWait <= 0 时不进行截断
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_budget_test.go` around lines 1 - 79, Add unit tests
covering the suggested edge cases: create tests in attempt_budget_test.go that
verify AttemptBudget.ClampRetryWait returns 0 when the input wait is <= 0, that
RemainingSince returns 0 when TotalTimeout <= 0, that RequestRemainingSince
returns 0 when RequestTimeout <= 0, and that ClampRetryWait does not clamp when
MaxRetryWait <= 0 (i.e., returns the original wait). Reference the AttemptBudget
struct and its fields MaxRetryWait, RequestTimeout, TotalTimeout and the methods
ClampRetryWait, RemainingSince, RequestRemainingSince to locate where to add the
tests.
internal/executor/attempt_budget.go (1)
35-45: 当 TotalTimeout <= 0 时返回 0 可能导致语义歧义。

当 TotalTimeout 为 0（可能表示"无限制"）时，RemainingSince 返回 0，这可能被调用方误解为"预算已耗尽"。建议考虑返回一个特殊值（如 math.MaxInt64 对应的 duration）或负值来表示"无限制"，或在文档中明确说明此行为。

当前 middleware_dispatch.go:102-103 的调用方式 (if e.attemptBudget.RequestTimeout > 0) 已正确处理此情况，但未来调用方可能遗漏此检查。
♻️ 可选：添加文档注释说明行为
+// RemainingSince calculates remaining time from TotalTimeout.
+// Returns 0 if TotalTimeout is not positive (disabled) or if the budget is exhausted.
+// Callers should check if TotalTimeout > 0 before relying on this method.
 func (b AttemptBudget) RemainingSince(start time.Time) time.Duration {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_budget.go` around lines 35 - 45, Change
RemainingSince on AttemptBudget to signal "unlimited" instead of "exhausted"
when TotalTimeout <= 0: update AttemptBudget.RemainingSince to return a sentinel
long duration (e.g. time.Duration(math.MaxInt64)) for the unlimited case and add
a doc comment to AttemptBudget.RemainingSince explaining that a non‑positive
TotalTimeout means unlimited and will return the sentinel; also keep/mention the
existing consumer pattern (middleware_dispatch.go check of
e.attemptBudget.RequestTimeout > 0) so callers know to test RequestTimeout
rather than treating zero as "no time left".
internal/executor/middleware_dispatch_test.go (1)
46-55: 这些墙钟断言太紧，CI 上容易偶发失败

这些测试的核心语义其实已经由 callCount、fallbackCalls、ctx.Err 等断言覆盖了；再用 <25/40/60ms 这样的上界去卡 5–20ms 级别的 timeout，只要 runner 有一点调度或 GC 抖动就会随机变红。更稳的是改成基于信号的断言，或者至少把阈值放宽到 timeout 的数倍。

Also applies to: 116-128, 193-205, 234-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/middleware_dispatch_test.go` around lines 46 - 55, The
wall-clock assertions around exec.dispatch(ctx) (variables start, elapsed) are
too tight and flaky; update the tests that currently assert elapsed < 25/40/60ms
to instead rely on existing logical signals (callCount, fallbackCalls, ctx.Err)
or relax timing checks to a more generous bound (e.g., several times the
configured attempt timeout) or replace them with signal-based waits
(channel/condition) that assert the intended event occurred rather than a strict
short duration; apply the same change pattern to the other occurrences
referenced by the test (the blocks around lines asserting elapsed and using
callCount/fallbackCalls/ctx.Err).
tests/e2e/provider_failover_health_test.go (2)
67-69: 删掉这条相对耗时比较会更稳

前面的命中次数断言已经证明第二次请求没有再打到 slowProvider。这里再比较 secondDuration < firstDuration，基本只是在和环境噪声较劲，尤其是 e2e 跑在共享 runner 上时很容易随机反转。建议保留 provider 命中断言，删掉这条相对耗时比较，或改成更宽松的绝对上界。

Also applies to: 82-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/provider_failover_health_test.go` around lines 67 - 69, The
relative timing assertion comparing secondDuration < firstDuration is flaky;
remove that comparison and keep the provider hit-count assertions (the checks
proving the second request didn't hit slowProvider), or replace the relative
check with a relaxed absolute bound (e.g., assert secondDuration < some generous
duration) if you need a timing assertion; update the test code around the
firstStarted/firstDuration and secondDuration variables and the block that
checks provider hits (the same region covering the earlier hit-count assertions
referenced in the comment) accordingly.
245-257: 不要把预算语义钉死成“恰好命中前两个 provider”

共享请求预算只保证会在耗尽前停止串行 failover，并不保证第二个慢 provider 一定会被尝试一次。不同机器上的调度/取消时机下，只命中第一个 provider 也仍然符合语义。这里更稳的是断言 healthyHits == 0，再限制慢 provider 的总命中次数，而不是固定每个槽位的精确分布。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/provider_failover_health_test.go` around lines 245 - 257, The
assertions pin the budget semantics to an exact per-slot distribution
(slowHits[0]==1 && slowHits[1]==1); instead, change the checks to assert that
healthyHits.Load() == 0 and that the total number of slow provider hits (sum of
slowHits[i].Load()) is within an acceptable range (e.g., >=1 and <=2) so the
test only verifies that healthy fallback did not occur and that the shared
request budget limited serial failover without requiring a precise per-slot hit
pattern; keep the existing checks on resp.StatusCode and elapsed as-is.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/executor/attempt_health.go`:
- Around line 27-35: The timeout branch currently only sets
proxyErr.HTTPStatusCode = http.StatusGatewayTimeout when HTTPStatusCode == 0;
change it to always overwrite the underlying status code to 504 for true
timeouts: in the timeout handling block where err is type-asserted to
*domain.ProxyError (symbols: proxyErr, domain.ProxyError, timeoutErr,
http.StatusGatewayTimeout), unconditionally set proxyErr.HTTPStatusCode =
http.StatusGatewayTimeout and keep setting proxyErr.Retryable = true,
proxyErr.IsNetworkError = true and proxyErr.Err = timeoutErr; apply the same
unconditional overwrite to the second identical timeout branch around the other
occurrence referenced (lines ~66-74).
- Around line 42-43: The code currently returns the original err when attemptCtx
is nil or not deadline-exceeded, which lets raw transport failures
(dial/TLS/reset) bypass network/breaker stats; instead, before returning, run
the same network-detection/normalization used elsewhere and ensure such
transport errors are marked as network errors (e.g., wrap the err into a
ProxyError with IsNetworkError=true or call the existing normalization/isNetwork
detection used around the ProxyError handling at lines 151-154) so
dial/TLS/reset-style errors are classified as network failures and will
contribute to network/timeout and breaker logic rather than being returned raw.

In `@internal/flow/attempt_activity.go`:
- Around line 43-61: 在 attemptActivityReadCloser 中补齐 Close() 并避免重复触发
CompleteResponseBody(): 添加一个布尔字段（例如 responseBodyCompleted bool）并在 Read（EOF 分支）和
Close 方法里都检查/设置这个标志，只有在标志为 false 时才调用 observer.CompleteResponseBody() 并把标志设为
true；同时实现 Close() 去调用底层 r.ReadCloser.Close() 并在返回前触发一次受保护的
CompleteResponseBody()。如果该类型可能并发使用，保护该标志的访问（例如用 sync.Mutex 或 atomic）以防重入。

In `@internal/health/tracker.go`:
- Around line 325-330: The current decay uses
math.Exp(-float64(elapsed)/float64(t.decayHalfLife)) which yields
e^(-t/halfLife) instead of a true half-life drop; change the factor calculation
to use 0.5^(t/halfLife) by replacing math.Exp(...) with math.Pow(0.5,
float64(elapsed)/float64(t.decayHalfLife)), then apply that factor to
stats.recentSuccess, stats.recentFailure, stats.recentTimeout,
stats.recentRateLimit, and stats.recentServerErr so values halve after t ==
t.decayHalfLife.

---

Nitpick comments:
In `@internal/executor/attempt_budget_test.go`:
- Around line 1-79: Add unit tests covering the suggested edge cases: create
tests in attempt_budget_test.go that verify AttemptBudget.ClampRetryWait returns
0 when the input wait is <= 0, that RemainingSince returns 0 when TotalTimeout
<= 0, that RequestRemainingSince returns 0 when RequestTimeout <= 0, and that
ClampRetryWait does not clamp when MaxRetryWait <= 0 (i.e., returns the original
wait). Reference the AttemptBudget struct and its fields MaxRetryWait,
RequestTimeout, TotalTimeout and the methods ClampRetryWait, RemainingSince,
RequestRemainingSince to locate where to add the tests.

In `@internal/executor/attempt_budget.go`:
- Around line 35-45: Change RemainingSince on AttemptBudget to signal
"unlimited" instead of "exhausted" when TotalTimeout <= 0: update
AttemptBudget.RemainingSince to return a sentinel long duration (e.g.
time.Duration(math.MaxInt64)) for the unlimited case and add a doc comment to
AttemptBudget.RemainingSince explaining that a non‑positive TotalTimeout means
unlimited and will return the sentinel; also keep/mention the existing consumer
pattern (middleware_dispatch.go check of e.attemptBudget.RequestTimeout > 0) so
callers know to test RequestTimeout rather than treating zero as "no time left".

In `@internal/executor/attempt_watchdog.go`:
- Around line 175-180: The Write method currently calls watchdog.NoteFirstByte()
on every write; change attemptActivityWriter to track a boolean flag (e.g.,
firstByteSent) and in Write only call watchdog.NoteFirstByte() when
firstByteSent is false, set firstByteSent=true, and on subsequent writes call
watchdog.NoteActivity() instead; update the attemptActivityWriter struct to
include the firstByteSent field and adjust the Write method accordingly to
preserve semantics but make the first-byte vs subsequent-activity semantics
explicit.
- Around line 110-130: The 5ms polling interval in attemptWatchdog.run
(time.NewTicker(5 * time.Millisecond)) is too aggressive and can cause CPU
overhead under high concurrency; update the ticker to a larger interval (e.g.,
50-100ms) or replace the polling mechanism with a timer-based approach such as
time.AfterFunc to schedule the timeout check for each attempt, ensuring you
preserve the same stopCh handling and cancel() behavior in the attemptWatchdog
struct.

In `@internal/executor/executor.go`:
- Around line 81-83: SetAttemptBudget currently writes e.attemptBudget without
concurrency protection which can cause data races if called at runtime; either
document that SetAttemptBudget must only be used at startup or add mutex
protection (e.g., a sync.Mutex or sync.RWMutex on the Executor struct) and
lock/unlock around the write in SetAttemptBudget; update the Executor struct to
include the mutex (e.g., mu sync.Mutex) and use mu.Lock()/mu.Unlock() (or
mu.Lock()/defer mu.Unlock()) around the assignment to e.attemptBudget, and add a
brief comment on the method explaining the concurrency semantics.

In `@internal/executor/middleware_dispatch_test.go`:
- Around line 46-55: The wall-clock assertions around exec.dispatch(ctx)
(variables start, elapsed) are too tight and flaky; update the tests that
currently assert elapsed < 25/40/60ms to instead rely on existing logical
signals (callCount, fallbackCalls, ctx.Err) or relax timing checks to a more
generous bound (e.g., several times the configured attempt timeout) or replace
them with signal-based waits (channel/condition) that assert the intended event
occurred rather than a strict short duration; apply the same change pattern to
the other occurrences referenced by the test (the blocks around lines asserting
elapsed and using callCount/fallbackCalls/ctx.Err).

In `@internal/handler/admin.go`:
- Around line 364-368: When "weight" appears in the updates map, validate it
explicitly and return HTTP 400 on invalid input instead of silently
ignoring/truncating: inspect updates["weight"] (the updates map) and accept only
numeric values that are whole integers (e.g., float64 where f == math.Trunc(f)
or a json.Number parsed to int) and in the allowed range (e.g., non-negative if
that applies), then assign to existing.Weight; for any other type or fractional
value respond with a 400 error and a clear message rather than silently skipping
assignment.

In `@internal/handler/response_state_writer.go`:
- Around line 5-8: The responseStateWriter struct's started bool is not
concurrency-safe; replace it with an atomic.Bool (or use sync.Once) to avoid
race conditions in edge cases (e.g., SSE). Update the responseStateWriter type
to use atomic.Bool and change all places that set or check started (e.g., in
WriteHeader, Write, and any helper methods on responseStateWriter) to call
Store(true) when marking started and Load() when checking it; also add the
necessary import for sync/atomic (or sync for Once) and ensure any conditional
logic uses the atomic methods to avoid non-atomic reads/writes.
- Around line 27-31: responseStateWriter currently only exposes Flush by
type-asserting w.ResponseWriter to http.Flusher in Flush(); this wrapper can
drop optional interfaces like http.Hijacker (and optionally http.CloseNotifier,
http.Pusher) if the underlying ResponseWriter supports them. Update the
responseStateWriter implementation to forward those optional interfaces: add
type assertions and method implementations (e.g., implement Hijack() when the
embedded ResponseWriter satisfies http.Hijacker, likewise for CloseNotify/Push)
so the wrapper preserves behavior of the underlying writer; look for the
responseStateWriter type, its Flush() method, and ResponseWriter field to locate
where to add the forwarded interface checks and methods.

In `@internal/health/tracker_test.go`:
- Around line 434-454: Add a deterministic assertion that the decay half-life is
exactly 2 minutes by advancing the fixed clock (tracker.now) by 2 minutes and
asserting the Tracker's internal recent decay metric (e.g., recentFailures or
the relevant recent* field used by NewTracker) has dropped to approximately 0.5x
its previous value; do this inside
TestTrackerDoesNotOpenBreakerForFailuresSpreadAcrossDecayWindow (and the other
affected test block around 456-490) after calling
tracker.Record(AttemptResult...) so the decay formula is explicitly validated
rather than only relying on behavior via IsCircuitOpen.

In `@tests/e2e/provider_failover_health_test.go`:
- Around line 67-69: The relative timing assertion comparing secondDuration <
firstDuration is flaky; remove that comparison and keep the provider hit-count
assertions (the checks proving the second request didn't hit slowProvider), or
replace the relative check with a relaxed absolute bound (e.g., assert
secondDuration < some generous duration) if you need a timing assertion; update
the test code around the firstStarted/firstDuration and secondDuration variables
and the block that checks provider hits (the same region covering the earlier
hit-count assertions referenced in the comment) accordingly.
- Around line 245-257: The assertions pin the budget semantics to an exact
per-slot distribution (slowHits[0]==1 && slowHits[1]==1); instead, change the
checks to assert that healthyHits.Load() == 0 and that the total number of slow
provider hits (sum of slowHits[i].Load()) is within an acceptable range (e.g.,
>=1 and <=2) so the test only verifies that healthy fallback did not occur and
that the shared request budget limited serial failover without requiring a
precise per-slot hit pattern; keep the existing checks on resp.StatusCode and
elapsed as-is.

In `@tests/e2e/routes_test.go`:
- Around line 69-71: Add explicit boundary tests to verify weights <= 0 are
normalized to 1: in tests/e2e/routes_test.go add a subtest (near the existing
created["weight"] assertion) that creates a route with weight 0 and another with
a negative weight and assert int(created["weight"].(float64)) == 1 for both;
repeat the same assertions for the update flows referenced around the other
blocks (the tests using the same created/updated response objects) so both
create and update paths confirm normalization to 1.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc601bce-e451-45fd-a0fb-7283912a50f0

📥 Commits

Reviewing files that changed from the base of the PR and between beb041c and 36148c2.

📒 Files selected for processing (37)

cmd/maxx/main.go
internal/adapter/provider/antigravity/adapter.go
internal/adapter/provider/claude/adapter.go
internal/adapter/provider/cliproxyapi_antigravity/adapter.go
internal/adapter/provider/cliproxyapi_codex/adapter.go
internal/adapter/provider/codex/adapter.go
internal/adapter/provider/custom/adapter.go
internal/adapter/provider/kiro/adapter.go
internal/core/database.go
internal/domain/backup.go
internal/domain/model.go
internal/executor/attempt_budget.go
internal/executor/attempt_budget_test.go
internal/executor/attempt_health.go
internal/executor/attempt_watchdog.go
internal/executor/executor.go
internal/executor/middleware_dispatch.go
internal/executor/middleware_dispatch_test.go
internal/executor/response_capture.go
internal/flow/attempt_activity.go
internal/flow/keys.go
internal/handler/admin.go
internal/handler/proxy.go
internal/handler/proxy_test.go
internal/handler/response_state_writer.go
internal/health/tracker.go
internal/health/tracker_test.go
internal/repository/sqlite/models.go
internal/repository/sqlite/route.go
internal/router/router.go
internal/router/router_test.go
internal/service/admin.go
internal/service/backup.go
internal/service/backup_test.go
tests/e2e/provider_failover_health_test.go
tests/e2e/proxy_setup_test.go
tests/e2e/routes_test.go

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Agent
GitHub Check: e2e
GitHub Check: playwright

🔇 Additional comments (29)

internal/health/tracker.go (1)
92-100: 生产路径中已调用 AllowAttempt()，评论中的关切不成立。

原始评论声称生产路径中不存在对 AllowAttempt() 的调用，导致 half-open 语义未被启用。然而实际情况是：

internal/executor/middleware_dispatch.go 第 114 行确实调用了 AllowAttempt()：
if e.healthTracker != nil && !e.healthTracker.AllowAttempt(matchedRoute.Provider.ID, string(clientType)) {
    break
}
完整的生产流程已正确接入：

router.go 第 221 行：IsCircuitOpen() 初步过滤（跳过熔断的 provider）

middleware_dispatch.go 第 114 行：AllowAttempt() 进行 half-open 状态检查和单探针限制

middleware_dispatch.go 第 192 行：BeginAttempt() 开始记录尝试

attempt_health.go 第 189 行：ReleaseHalfOpenProbe() 记录探针结果并决定 breaker 转换

Half-open 探针语义已正确启用，不存在遗漏的集成问题。
			> Likely an incorrect or invalid review comment.
internal/service/backup_test.go (1)

109-110: 权重回写校验补得很到位。

这里把 Weight 的种子数据和 roundtrip 断言一起补齐，能有效防止备份导入导出链路丢字段。

Also applies to: 236-241

internal/domain/backup.go (1)

70-70: BackupRoute 增加 weight 字段设计合理。

该字段与路由权重能力保持一致，序列化语义清晰。

internal/adapter/provider/antigravity/adapter.go (1)

242-242: 响应体包装在重试前后都补齐了，改动正确。

这样可以保证两条成功路径都接入统一的活动观测与超时联动。

Also applies to: 277-277

internal/executor/response_capture.go (1)

15-15: Started() 信号补充得很好。

对“是否已开始写回客户端”的判定更直接，且不影响现有捕获行为。

Also applies to: 30-30, 36-36, 63-65

internal/core/database.go (1)

20-20: 健康追踪器注入链路一致，整体接线正确。

同一个 providerHealthTracker 同时传给 Router 和 Executor，符合本次健康感知路由目标。

Also applies to: 258-266, 328-328

internal/adapter/provider/cliproxyapi_codex/adapter.go (1)

274-276: 这里禁用 first-byte 超时是合理的。

与 CPA 非流式“完整缓冲后返回”的行为匹配，能减少误判超时。

internal/adapter/provider/cliproxyapi_antigravity/adapter.go (1)

130-132: LGTM!

非流式 CPA 接口会缓冲完整响应后一次性返回，禁用首字节超时并回退到 TotalTimeout 是正确的做法。flow.DisableFirstByteTimeout 内部已处理 observer 为 nil 的情况，调用安全。

internal/adapter/provider/kiro/adapter.go (1)

140-141: LGTM!

响应体包装与其他适配器（claude、antigravity、codex）模式一致，启用了 flow 层的首字节和活动跟踪。WrapResponseBody 内部安全处理了 nil observer 的情况。

Also applies to: 175-176

internal/repository/sqlite/route.go (1)

113-116: LGTM!

在 toModel 和 toDomain 双向都进行 weight 归一化，与 admin.go 服务层和 models.go 的默认值保持一致，确保非正值权重不会传播到路由逻辑中。

Also applies to: 139-142

internal/adapter/provider/claude/adapter.go (1)

138-139: LGTM!

响应体包装模式与 kiro 适配器一致，在初次请求和 401 重试后都正确包装，支持 flow 层的活动跟踪。

Also applies to: 169-170

internal/service/backup.go (1)

676-676: 导入路径的 Weight 归一化由仓库层保障

导入时直接调用 routeRepo.Create(r) 绕过了 AdminService.CreateRoute 的 weight 归一化逻辑。虽然 sqlite/route.go 的 toModel 会处理 weight <= 0 的情况，但这依赖于隐式的仓库层行为。

可以考虑在导入时显式归一化以保持与服务层一致，但当前实现由于仓库层保障也是安全的。

internal/handler/proxy_test.go (1)

12-52: LGTM!

测试覆盖了关键的错误处理场景：响应已开始时跳过错误写入、使用 ProxyError 的 HTTPStatusCode。测试结构清晰，命名明确。

internal/handler/proxy.go (3)

238-264: LGTM! 错误处理逻辑更加健壮

新的 dispatch 错误处理逻辑：

响应已开始时提前中止，避免重复写入

DeadlineExceeded 映射到 504 Gateway Timeout

Canceled 静默中止，不写入额外内容

其他错误返回 500

逻辑清晰，符合 HTTP 语义。

266-275: writeDispatchError 的双重检查是防御性设计

Line 267 再次检查 responseHasStarted 虽然与 line 238 有重复，但作为独立辅助函数的防御性检查是合理的，可避免未来直接调用时的错误。

328-332: HTTPStatusCode 默认值为 502 符合预期

当 ProxyError.HTTPStatusCode 未设置（为 0）时，默认使用 502 Bad Gateway 符合上游错误的语义。这与 internal/domain/errors.go 中 NewProxyError 和 NewProxyErrorWithMessage 不主动设置 HTTPStatusCode 的行为一致。

Also applies to: 352-356

cmd/maxx/main.go (1)

196-199: LGTM！健康追踪器集成正确。

根据 internal/health/tracker.go 的上下文，NewTracker() 返回的追踪器是纯内存实现，没有后台 goroutine 需要管理，因此无需在 shutdownServer 中添加额外的清理逻辑。

internal/executor/executor.go (2)

85-105: LGTM！去重逻辑正确。

clearSuccessCooldowns 使用 seen map 避免当 currentClientType 和 originalClientType 相同时重复清除冷却状态，实现合理。

425-434: LGTM！watchdog 集成正确。

在 EventFirstToken 事件时调用 watchdog.NoteFirstByte() 是合理的。根据上下文代码片段，NoteFirstByte 是幂等的，重复调用不会产生问题。

internal/router/router_test.go (2)

366-399: LGTM！fake health tracker 实现完整。

routerFakeHealthTracker 正确实现了 health.ProviderTracker 接口的所有方法，支持模拟健康分数和断路器状态，适合单元测试使用。

206-206: 项目已配置 Go 1.26.0，完全支持 for range 整数遍历语法。无需进一步操作。

internal/executor/attempt_budget.go (1)

14-23: LGTM！默认值与 PR 规格一致。

默认超时配置符合 PR 描述：RequestTimeout 45s、TotalTimeout 30s、FirstByteTimeout 10s、StreamIdleTimeout 15s、MaxRetryAfter 30s、MaxRetryWait 10s。

internal/executor/middleware_dispatch.go (5)

101-116: LGTM！预算检查和健康检查的时序正确。

预算耗尽检查（104-111）和健康检查（114-115）都在 BeginAttempt（192行）之前执行。这确保了当提前退出时不会泄漏 in-flight 计数器，符合上下文代码片段中 BeginAttempt 返回的清理函数必须被调用的要求。

190-224: LGTM！资源清理路径完整。

成功路径正确执行了：

recordAttemptHealth（219-222行）

attemptDone()（223行）- 调用 BeginAttempt 返回的清理函数

releaseHalfOpenProbeIfNeeded（224行）

清理顺序合理，确保健康指标正确记录后再释放资源。

302-311: LGTM！失败路径的资源清理与成功路径一致。

失败路径同样执行了 recordAttemptHealth、attemptDone() 和 releaseHalfOpenProbeIfNeeded，确保无论成功还是失败，in-flight 计数器都能正确递减。

460-462: LGTM！正确退出路由循环。

当请求预算耗尽或响应已开始时，跳出路由循环避免尝试其他路由。这符合 PR 规格中"错误语义：响应开始后不切换路由"的要求。

427-435: ✓ capRetryWait 方法在 internal/executor/attempt_health.go 中正确实现，包含完整的预算截断逻辑（MaxRetryAfter 限制、ClampRetryWait 约束和请求超时处理）。

tests/e2e/proxy_setup_test.go (2)

42-56: LGTM！测试环境选项设计良好。

ProxyTestEnvOptions 结构和 NewProxyTestEnvWithAttemptBudget 工厂函数提供了灵活的测试配置方式，允许 e2e 测试自定义超时预算。这种模式便于未来添加更多可选配置。

147-150: LGTM！健康追踪器集成与 main.go 一致。

测试环境正确创建 providerHealthTracker 并传递给 router.NewRouter，与 cmd/maxx/main.go 的接线方式保持一致。

internal/executor/attempt_health.go

internal/flow/attempt_activity.go

internal/health/tracker.go

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

internal/executor/attempt_watchdog.go (1)

109-129: 轮询间隔可考虑调优。

5ms 的轮询间隔在高并发场景下可能带来 CPU 开销。考虑到超时精度要求（首字节 10s、流空闲 15s 级别），可以适当放宽到 50-100ms 而不影响用户体验。

不过，当前实现功能正确，可在后续优化中调整。
♻️ 可选优化：增大轮询间隔
 func (w *attemptWatchdog) run() {
-	ticker := time.NewTicker(5 * time.Millisecond)
+	ticker := time.NewTicker(50 * time.Millisecond)
 	defer ticker.Stop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_watchdog.go` around lines 109 - 129, The run method
of attemptWatchdog currently uses a very tight polling interval
(time.NewTicker(5 * time.Millisecond)) which may cause unnecessary CPU overhead
under high concurrency; update the ticker creation in attemptWatchdog.run to use
a larger interval (e.g., 50-100ms) that still meets the timeout precision for
nextTimeoutErr/first-byte (10s) and stream-idle (15s), keeping the existing
logic with stopCh, nextTimeoutErr(), w.mu, w.timeoutErr, and w.cancel unchanged.

tests/e2e/provider_failover_health_test.go (1)

257-275: 建议在辅助函数中统一 defer resp.Body.Close()

当前成功路径手动 Close()，但失败路径依赖 t.Fatalf 提前退出，不够直观。这里可统一 defer，减少重复并避免遗漏。

可选重构示例

 func createRouteWithPosition(t *testing.T, env *ProxyTestEnv, clientType string, providerID uint64, position int) uint64 {
 	t.Helper()
 	resp := env.AdminPost("/api/admin/routes", map[string]any{
 		"isEnabled":  true,
 		"clientType": clientType,
 		"providerID": providerID,
 		"position":   position,
 	})
+	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusCreated {
 		body, _ := io.ReadAll(resp.Body)
 		t.Fatalf("Failed to create route: status=%d body=%s", resp.StatusCode, body)
 	}
@@
 	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
 		t.Fatalf("Failed to decode route create response: %v", err)
 	}
-	resp.Body.Close()
 	return result.ID
 }
@@
 func createProviderWithErrorCooldownDisabled(t *testing.T, env *ProxyTestEnv, name, baseURL string, supportedTypes []string) uint64 {
 	t.Helper()
 	resp := env.AdminPost("/api/admin/providers", map[string]any{
@@
 	})
+	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusCreated {
 		body, _ := io.ReadAll(resp.Body)
 		t.Fatalf("Failed to create provider %s: status=%d body=%s", name, resp.StatusCode, body)
 	}
@@
 	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
 		t.Fatalf("Failed to decode provider create response: %v", err)
 	}
-	resp.Body.Close()
 	return result.ID
 }

Also applies to: 280-304

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/provider_failover_health_test.go` around lines 257 - 275, The
response body isn't always closed on all paths; after calling env.AdminPost(...)
in the route-creation helper, immediately add defer resp.Body.Close() to ensure
the body is closed whether the test fails or succeeds, and remove the manual
resp.Body.Close() at the end of the successful path; apply the same change to
the similar block around lines 280-304 so both helper usages (the resp variable
in this function) consistently use defer for closing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/executor/attempt_health.go`:
- Around line 42-46: The current branch order calls isLikelyNetworkError(err)
before checking for attemptCtx deadline, causing context.DeadlineExceeded to be
treated as a network error and mapped to newNetworkProxyError (502); change the
logic in the functions/blocks around where attemptCtx and err are evaluated so
you first check for attemptCtx != nil && errors.Is(attemptCtx.Err(),
context.DeadlineExceeded) and handle/normalize it to the timeout path (504)
before calling isLikelyNetworkError(err), and apply the same reorder to the
other symmetric spot that currently checks isLikelyNetworkError first; reference
attemptCtx, errors.Is(..., context.DeadlineExceeded), isLikelyNetworkError, and
newNetworkProxyError when locating and fixing the code.

In `@tests/e2e/provider_failover_health_test.go`:
- Around line 169-175: The test's timing budgets (RequestTimeout, TotalTimeout,
FirstByteTimeout, StreamIdleTimeout, MaxRetryAfter, MaxRetryWait) are too tight
and cause flakiness; increase these values to larger, CI-safe numbers (e.g.,
RequestTimeout ~300ms, TotalTimeout/FirstByteTimeout/StreamIdleTimeout ~200ms,
MaxRetryAfter/MaxRetryWait ~50ms) in the test setup so retries/latency in CI
won't exhaust the budget, and update the expectation logic that checks provider
hit counts (the assertion that the first two providers "must each hit 1 time")
to accept >=1 hits instead of exactly 1 to avoid brittle timing-dependent
failures.

---

Nitpick comments:
In `@internal/executor/attempt_watchdog.go`:
- Around line 109-129: The run method of attemptWatchdog currently uses a very
tight polling interval (time.NewTicker(5 * time.Millisecond)) which may cause
unnecessary CPU overhead under high concurrency; update the ticker creation in
attemptWatchdog.run to use a larger interval (e.g., 50-100ms) that still meets
the timeout precision for nextTimeoutErr/first-byte (10s) and stream-idle (15s),
keeping the existing logic with stopCh, nextTimeoutErr(), w.mu, w.timeoutErr,
and w.cancel unchanged.

In `@tests/e2e/provider_failover_health_test.go`:
- Around line 257-275: The response body isn't always closed on all paths; after
calling env.AdminPost(...) in the route-creation helper, immediately add defer
resp.Body.Close() to ensure the body is closed whether the test fails or
succeeds, and remove the manual resp.Body.Close() at the end of the successful
path; apply the same change to the similar block around lines 280-304 so both
helper usages (the resp variable in this function) consistently use defer for
closing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d314791f-f1a6-479c-a057-13a39feeefe6

📥 Commits

Reviewing files that changed from the base of the PR and between 36148c2 and 417f83e.

📒 Files selected for processing (13)

cmd/maxx/main.go
internal/core/database.go
internal/executor/attempt_health.go
internal/executor/attempt_health_test.go
internal/executor/attempt_watchdog.go
internal/executor/attempt_watchdog_test.go
internal/flow/attempt_activity.go
internal/flow/attempt_activity_test.go
internal/handler/response_state_writer.go
internal/handler/response_state_writer_test.go
internal/health/tracker.go
internal/health/tracker_test.go
tests/e2e/provider_failover_health_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/handler/response_state_writer.go

📜 Review details

🔇 Additional comments (12)

internal/flow/attempt_activity_test.go (1)

1-52: LGTM!

测试用例清晰地验证了 WrapResponseBody 的完成信号语义：

确保在 Close 时（未读取 EOF）正确触发一次完成回调

确保在 EOF + Close 场景下完成回调仅触发一次

测试观察者实现简洁，计数器方式有效验证了幂等性。

internal/handler/response_state_writer_test.go (1)

1-23: LGTM!

测试简洁地验证了 Flush 操作正确标记响应已开始，并确认了 http.Flusher 接口实现。

cmd/maxx/main.go (1)

196-199: LGTM! 健康追踪器集成与 internal/core/database.go 保持一致。

两处创建和传递 providerHealthTracker 的方式相同，确保了在不同入口点（CLI 直接启动 vs. 通过 core 初始化）下行为一致。

Also applies to: 285-285

internal/executor/attempt_watchdog_test.go (1)

1-51: LGTM! 测试覆盖了关键的边界场景。

并发安全测试（200 次迭代 × 16 并发调用）有效验证了 Stop() 的线程安全性

Nil 接收者测试确保空指针调用不会 panic

超时错误测试通过设置过去的起始时间巧妙地触发立即超时

internal/executor/attempt_watchdog.go (3)

174-178: Write 方法的 NoteFirstByte 调用语义正确。

每次非空 Write 都调用 NoteFirstByte，首次调用设置 activitySeen=true，后续调用更新 lastActivity。虽然方法名暗示仅记录首字节，但实际上它也更新了活动时间戳，这对流式传输的空闲超时检测是必要的。

181-186: Flush 仅在首字节已见时记录活动是正确的设计。

NoteActivity 仅在 activitySeen=true 时更新时间戳。这确保了：

响应头刷新不会误触发活动记录

只有实际的响应体数据传输才被视为有效活动

28-48: LGTM! watchdog 生命周期管理正确。

返回 nil 的守卫条件（cancel==nil 或两个超时都 ≤0）避免了无意义的 watchdog 创建

run() goroutine 在构造时启动，通过 stopCh 优雅停止

stopOnce 确保 Stop() 的幂等性

internal/health/tracker_test.go (1)

10-578: LGTM! 测试套件全面且设计良好。

测试覆盖了健康追踪器的关键行为：

评分偏好（低延迟、高成功率）

断路器状态转换（closed → open → half-open → closed/open）

半开探针的单并发约束

陈旧成功的正确处理

衰减半衰期行为

并发测试（TestTrackerHalfOpenAllowsOnlySingleConcurrentProbe）正确使用了 channel 同步来验证单探针约束。项目配置 Go 1.26.0，完全支持 for range 5 语法。

internal/executor/attempt_health_test.go (1)

1-80: 优质的错误处理和健康记录测试覆盖。

测试设计清晰，覆盖了三个关键路径：

超时错误正确规范化为 504 Gateway Timeout

原始网络错误包装为 ProxyError，并正确设置 IsNetworkError 和 Retryable 标记

健康追踪器正确识别网络错误

testNetError 通过编译时接口检查（第 20 行）确保实现 net.Error，测试逻辑严谨。

internal/core/database.go (1)

258-266: 整合正确，无需显式清理。 健康追踪器是纯内存结构，未启动后台 goroutine，也无需显式的 Stop/Close/Shutdown。providerHealthTracker 在路由器和执行器之间正确共享，创建顺序合理。

internal/flow/attempt_activity.go (1)

53-76: CompleteResponseBody 的一次性收尾实现正确

Read(EOF) 与 Close 都走 complete()，并通过 sync.Once 保证只触发一次，状态机收尾语义清晰。

internal/health/tracker.go (1)

325-330: 半衰期衰减公式与目标语义一致

这里的衰减实现能保证在一个 half-life 周期后衰减到 50%，与“2 分钟半衰期”的设计目标一致。

internal/executor/attempt_health.go

tests/e2e/provider_failover_health_test.go

coderabbitai

🧹 Nitpick comments (2)

internal/executor/attempt_health.go (1)

61-64: 可去掉冗余分支以降低维护成本。

Line 61 到 Line 64 的 if attemptCtx == nil 分支当前与后续分支行为完全一致（都返回 err），建议合并为单一返回，减少后续改动时分支漂移风险。
建议修改
-	if attemptCtx == nil {
-		return err
-	}
 	return err
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_health.go` around lines 61 - 64, The if branch
checking attemptCtx == nil is redundant because both branches simply return err;
remove the entire if statement and replace with a single return err at the end
of the function (i.e., eliminate the duplicate conditional that references
attemptCtx and leave one unconditional return err), ensuring no other logic
depends on that check.

internal/executor/attempt_health_test.go (1)

41-60: 建议补充网络错误状态码断言。

当前已验证 raw network error 被包装并标记为 network/retryable；可再断言 HTTPStatusCode == 502，防止后续误改网络错误映射策略。

建议补充

 	if !proxyErr.Retryable {
 		t.Fatal("ProxyError.Retryable = false, want true for raw network error")
 	}
+	if proxyErr.HTTPStatusCode != http.StatusBadGateway {
+		t.Fatalf("ProxyError.HTTPStatusCode = %d, want %d", proxyErr.HTTPStatusCode, http.StatusBadGateway)
+	}
 	if !errors.Is(proxyErr.Err, rawErr) {
 		t.Fatalf("ProxyError.Err = %v, want wrapped raw network error", proxyErr.Err)
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/executor/attempt_health_test.go` around lines 41 - 60, Update the
test TestNormalizeAttemptErrorWrapsRawNetworkError to also assert that the
wrapped domain.ProxyError has HTTPStatusCode == 502: after casting to
*domain.ProxyError (from exec.normalizeAttemptError) add a check on
proxyErr.HTTPStatusCode and call t.Fatalf (or t.Fatalf with a clear message) if
it is not 502; this ensures the network error mapping in
Executor.normalizeAttemptError remains 502 for raw network errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/executor/attempt_health_test.go`:
- Around line 41-60: Update the test
TestNormalizeAttemptErrorWrapsRawNetworkError to also assert that the wrapped
domain.ProxyError has HTTPStatusCode == 502: after casting to *domain.ProxyError
(from exec.normalizeAttemptError) add a check on proxyErr.HTTPStatusCode and
call t.Fatalf (or t.Fatalf with a clear message) if it is not 502; this ensures
the network error mapping in Executor.normalizeAttemptError remains 502 for raw
network errors.

In `@internal/executor/attempt_health.go`:
- Around line 61-64: The if branch checking attemptCtx == nil is redundant
because both branches simply return err; remove the entire if statement and
replace with a single return err at the end of the function (i.e., eliminate the
duplicate conditional that references attemptCtx and leave one unconditional
return err), ensuring no other logic depends on that check.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22343134-5d51-4658-acd5-ca31b710f42b

📥 Commits

Reviewing files that changed from the base of the PR and between 417f83e and 1b29cdf.

📒 Files selected for processing (3)

internal/executor/attempt_health.go
internal/executor/attempt_health_test.go
tests/e2e/provider_failover_health_test.go

📜 Review details

🔇 Additional comments (13)

tests/e2e/provider_failover_health_test.go (7)

1-14: LGTM!

导入和包声明正确，所有依赖项都在测试中使用。

16-90: LGTM!

测试逻辑清晰：第一次请求验证故障转移行为，第二次请求验证健康感知路由跳过不健康的 provider。

slowProvider 返回 502 并延迟 60ms，会正确影响健康分数，使后续请求优先选择健康的 provider。

92-165: LGTM!

断路器测试设计合理：

预热阶段发送 3 个请求触发断路器打开

在断路器打开后添加健康 provider

验证后续请求跳过不健康 provider 并成功路由到健康 provider

167-254: LGTM!

预算测试逻辑正确：

RequestTimeout=120ms 总预算，TotalTimeout=60ms 单次尝试超时

每个慢 provider 延迟 100ms > 60ms，会触发超时

断言灵活允许 1-2 次命中（Lines 247-249），适应 CI 环境的时序波动

验证预算耗尽后返回 504，不会继续尝试后续 provider

时序预算已根据之前的审查反馈调整，测试稳定性得到改善。

256-277: LGTM!

辅助函数结构清晰，正确使用 t.Helper() 标记，错误处理完善。

279-306: LGTM!

JSON payload 结构正确匹配 ProviderConfig 结构体定义：disableErrorCooldown 字段位于 config 对象内，与 custom 同级，符合 internal/domain/model.go 中的字段定义。

308-314: LGTM!

优雅关闭测试服务器的辅助函数：先断开客户端连接再关闭服务器，防止连接泄漏。空指针检查提供了防御性保护。

internal/executor/attempt_health.go (5)

15-60: 错误归一化顺序与 504 映射实现正确。

timeout/deadline 在网络错误判定之前处理，且超时场景统一覆盖为 504，能稳定避免 deadline 被误归类为 502。

67-106: response-started 场景的错误策略一致且清晰。

该分支把 timeout/deadline 统一映射为 504，并将 Retryable 置为 false，和“响应已开始后不切路由”的语义一致。

115-159: 健康记录字段映射完整，分类信号充足。

IsTimeout/IsNetwork/IsRateLimit/IsServerErr 的提取覆盖了 proxy/raw error 两条路径，且与 tracker 的 AttemptResult 结构契合。

195-214: 重试等待裁剪链路实现合理。

MaxRetryAfter -> MaxRetryWait -> RequestRemaining 的顺序明确，能同时约束协议建议等待与全局请求预算。

216-226: half-open probe 释放条件控制正确。

仅在未记录健康结果时补偿释放 probe，避免重复释放并降低状态机错位风险。

internal/executor/attempt_health_test.go (1)

22-39: 关键回归路径测试覆盖到位。

504 覆盖与 deadline 优先级（含 response-started）都被直接断言，这组用例对错误归一化回归非常有价值。

Also applies to: 82-119

ymkiux · 2026-03-17T16:02:15Z

@coderabbitai If the current suggestion is not adopted, will it lead to disruptive changes?

coderabbitai · 2026-03-17T16:02:37Z

Tip

For best results, initiate chat on the files or code changes.

@ymkiux: Could you clarify which specific suggestion you're referring to? Since no particular diff hunk or comment was linked to this question, I'm not sure which suggestion you mean.

If you can point me to the specific review comment (e.g., by quoting it or sharing a link to it), I can give you a precise assessment of whether skipping it would cause breaking or disruptive behavior.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/handler/proxy.go (1)
74-79: ⚠️ Potential issue | 🟡 Minor

responseStateWriter 未实现 http.Hijacker 和 http.Pusher 接口。

当前实现正确处理了 Flusher 接口（通过类型断言委托给底层 writer），WriteHeader 和 Write 也正确标记 started 状态。然而，Hijacker 和 Pusher 接口未实现——嵌入的 http.ResponseWriter 字段无法满足针对这两个接口的类型断言。

虽然当前代码（如 writeStreamError）仅使用 Flusher 和基础的 Write/WriteHeader 方法，没有直接引发运行时回归，但这构成了一个接口契约缺陷。如果未来添加 WebSocket 或 HTTP/2 Push 支持，传入此 writer 的代码将无法访问这些功能。建议补充实现这两个接口，或至少提供 Unwrap() 方法以便调用者访问底层 writer。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/proxy.go` around lines 74 - 79, The responseStateWriter
currently returned by newResponseStateWriter doesn't implement http.Hijacker or
http.Pusher causing type-assert failures for callers expecting those interfaces;
update the responseStateWriter type to implement Hijacker (Hijack() (net.Conn,
*bufio.ReadWriter, error)) and Pusher (Push(target string, opts
*http.PushOptions) error) by delegating to the embedded underlying
http.ResponseWriter when it supports those interfaces, and add an Unwrap()
http.ResponseWriter method returning the underlying writer so callers (and
future WebSocket/HTTP2 code like writeStreamError consumers) can access the raw
writer if needed. Ensure Hijack/Push return appropriate errors when the
underlying writer does not implement the interface.

🧹 Nitpick comments (2)

internal/handler/proxy.go (1)
244-250: 这里最好改用 errors.As 解出 *domain.ProxyError。

当前的直接类型断言只匹配最外层 error，而下面两段已经用 errors.Is 接受 unwrap 之后的 DeadlineExceeded / Canceled。如果后续链路用 %w 包了一层 ProxyError，这里就会绕过 writeDispatchError，丢掉 HTTPStatusCode、Retry-After 和 stream/non-stream 专用错误体。
🔧 建议修改
-	proxyErr, ok := err.(*domain.ProxyError)
-	if ok {
+	var proxyErr *domain.ProxyError
+	if errors.As(err, &proxyErr) {
 		h.writeDispatchError(c, proxyErr, stream)
 		c.Err = err
 		c.Abort()
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/proxy.go` around lines 244 - 250, Replace the direct type
assertion with errors.As so wrapped *domain.ProxyError values are detected:
attempt to unwrap err into a variable (e.g., proxyErr) using errors.As(err,
&proxyErr) and, if true, call h.writeDispatchError(c, proxyErr, stream), set
c.Err = err and c.Abort() as before; this ensures HTTPStatusCode / Retry-After
and stream/non-stream error bodies from domain.ProxyError are not lost when err
is wrapped with %w.
internal/handler/proxy_test.go (1)
36-52: 这个用例没有覆盖 dispatch 的真实回归点。

这里直接调用 writeDispatchError，只能证明 helper 在 responseHasStarted=true 时不再写二次错误；如果 dispatch 顶部的 started 判断回退，DeadlineExceeded 和普通 error 分支依然可能在半响应后写出 504/500，而这个测试不会失败。建议再补一个走 dispatch 的 case，并顺手覆盖仅 WriteHeader 就开始响应的场景。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handler/proxy_test.go` around lines 36 - 52, The current test calls
writeDispatchError directly and doesn't exercise the dispatch entrypoint; add a
new test that invokes ProxyHandler.dispatch (or the exported method that routes
to dispatch) so the started-check at the top of dispatch is exercised
end-to-end, and also add a subcase where the response is considered started by
calling WriteHeader (but not Write) before triggering the error to ensure
dispatch does not write a 504/500 after a partial response; reference
ProxyHandler.dispatch, writeDispatchError, and the existing
TestWriteDispatchErrorSkipsWhenResponseAlreadyStarted to mirror setup and
assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/handler/proxy.go`:
- Around line 74-79: The responseStateWriter currently returned by
newResponseStateWriter doesn't implement http.Hijacker or http.Pusher causing
type-assert failures for callers expecting those interfaces; update the
responseStateWriter type to implement Hijacker (Hijack() (net.Conn,
*bufio.ReadWriter, error)) and Pusher (Push(target string, opts
*http.PushOptions) error) by delegating to the embedded underlying
http.ResponseWriter when it supports those interfaces, and add an Unwrap()
http.ResponseWriter method returning the underlying writer so callers (and
future WebSocket/HTTP2 code like writeStreamError consumers) can access the raw
writer if needed. Ensure Hijack/Push return appropriate errors when the
underlying writer does not implement the interface.

---

Nitpick comments:
In `@internal/handler/proxy_test.go`:
- Around line 36-52: The current test calls writeDispatchError directly and
doesn't exercise the dispatch entrypoint; add a new test that invokes
ProxyHandler.dispatch (or the exported method that routes to dispatch) so the
started-check at the top of dispatch is exercised end-to-end, and also add a
subcase where the response is considered started by calling WriteHeader (but not
Write) before triggering the error to ensure dispatch does not write a 504/500
after a partial response; reference ProxyHandler.dispatch, writeDispatchError,
and the existing TestWriteDispatchErrorSkipsWhenResponseAlreadyStarted to mirror
setup and assertions.

In `@internal/handler/proxy.go`:
- Around line 244-250: Replace the direct type assertion with errors.As so
wrapped *domain.ProxyError values are detected: attempt to unwrap err into a
variable (e.g., proxyErr) using errors.As(err, &proxyErr) and, if true, call
h.writeDispatchError(c, proxyErr, stream), set c.Err = err and c.Abort() as
before; this ensures HTTPStatusCode / Retry-After and stream/non-stream error
bodies from domain.ProxyError are not lost when err is wrapped with %w.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccf74aaa-68a8-4a1d-ba2b-64cfe5b5bf5a

📥 Commits

Reviewing files that changed from the base of the PR and between 1b29cdf and 21820c2.

📒 Files selected for processing (7)

internal/adapter/provider/custom/adapter.go
internal/domain/model.go
internal/handler/proxy.go
internal/handler/proxy_test.go
internal/repository/sqlite/models.go
internal/service/admin.go
internal/service/backup.go

🚧 Files skipped from review as they are similar to previous changes (5)

internal/domain/model.go
internal/repository/sqlite/models.go
internal/adapter/provider/custom/adapter.go
internal/service/backup.go
internal/service/admin.go

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: e2e
GitHub Check: playwright

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ymkiux
Repo: awsl-project/maxx PR: 0
File: :0-0
Timestamp: 2026-03-11T07:29:12.071Z
Learning: Reviewer ymkiux (repo: awsl-project/maxx) requires strict attention to breaking changes in all PRs. Always thoroughly audit public API, interface, and behavioral contract changes — especially in sidebar/auth/layout components — before raising review comments. Do not flag superficial style issues; focus on correctness, regressions, and breaking changes.

🔇 Additional comments (1)

internal/handler/proxy_test.go (1)

54-64: 这组状态码回归测试很好。

它把 ProxyError.HTTPStatusCode 在普通响应和 SSE 响应里的透传行为都锁住了，后续很难再无意回退到默认 502。

Also applies to: 66-76

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

internal/handler/response_state_writer.go (1)

59-61: responseHasStarted 建议支持包装链解包

当前只做直接类型断言；如果 c.Writer 后续被其它中间层再包一层，会误判为未开始，进而触发“响应已开始后再次写错误体”的回归风险。

建议改为可解包检查

 func responseHasStarted(w http.ResponseWriter) bool {
-	state, ok := w.(*responseStateWriter)
-	return ok && state.started
+	for {
+		if state, ok := w.(*responseStateWriter); ok {
+			return state.started
+		}
+		uw, ok := w.(interface{ Unwrap() http.ResponseWriter })
+		if !ok {
+			return false
+		}
+		w = uw.Unwrap()
+	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handler/response_state_writer.go` around lines 59 - 61, 在
responseHasStarted 中支持包装链解包：当前只做一次对 *responseStateWriter 的断言会在 Writer
被中间件再包一层时误判。修改 responseHasStarted 函数为循环解包：先检查当前 w 是否为 *responseStateWriter 并返回其
started 字段；否则检查 w 是否实现了 Unwrap() http.ResponseWriter（或类似的解包方法），如果实现则把 w 设为
Unwrap() 的返回值继续循环，直到不能再解包为止，最终返回 false。这样在存在多个包装器时仍能正确检测
responseStateWriter.started 状态。

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/handler/response_state_writer.go`:
- Around line 31-35: The Flush method on responseStateWriter currently sets
started = true before checking support for http.Flusher; change it to only set
w.started = true after confirming the wrapped ResponseWriter implements
http.Flusher (same pattern used by Hijack), so if the type assertion to
(http.Flusher) fails you do not mark the response as started; update
responseStateWriter.Flush (and any related assumptions in responseHasStarted())
to follow this behavior so downstream error handling in proxy.go correctly
detects an unstarted response.

---

Nitpick comments:
In `@internal/handler/response_state_writer.go`:
- Around line 59-61: 在 responseHasStarted 中支持包装链解包：当前只做一次对 *responseStateWriter
的断言会在 Writer 被中间件再包一层时误判。修改 responseHasStarted 函数为循环解包：先检查当前 w 是否为
*responseStateWriter 并返回其 started 字段；否则检查 w 是否实现了 Unwrap()
http.ResponseWriter（或类似的解包方法），如果实现则把 w 设为 Unwrap() 的返回值继续循环，直到不能再解包为止，最终返回
false。这样在存在多个包装器时仍能正确检测 responseStateWriter.started 状态。

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99b04724-52ec-44b8-b939-7bcfff486dc9

📥 Commits

Reviewing files that changed from the base of the PR and between 21820c2 and db91550.

📒 Files selected for processing (2)

internal/handler/response_state_writer.go
internal/handler/response_state_writer_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/handler/response_state_writer_test.go

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: e2e
GitHub Check: playwright

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: ymkiux
Repo: awsl-project/maxx PR: 0
File: :0-0
Timestamp: 2026-03-11T07:29:12.071Z
Learning: Reviewer ymkiux (repo: awsl-project/maxx) requires strict attention to breaking changes in all PRs. Always thoroughly audit public API, interface, and behavioral contract changes — especially in sidebar/auth/layout components — before raising review comments. Do not flag superficial style issues; focus on correctness, regressions, and breaking changes.

🔇 Additional comments (1)

internal/handler/response_state_writer.go (1)

14-19: 避免重复包装的处理是正确的

这里先判断 *responseStateWriter 再返回原对象，能保留已有 started 状态并避免行为漂移。

internal/handler/response_state_writer.go

…ening-codeonly

…roject#421)" This reverts commit 67ab9e0.

…#431) This reverts commit 67ab9e0.

feat: harden provider routing and health failover

36148c2

luodaoyi requested review from Bowl42, awsl233777 and Copilot March 17, 2026 13:46

Copilot started reviewing on behalf of luodaoyi March 17, 2026 13:47 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

internal/executor/attempt_health.go Show resolved Hide resolved

internal/executor/attempt_health.go Outdated Show resolved Hide resolved

internal/flow/attempt_activity.go Show resolved Hide resolved

internal/health/tracker.go Outdated Show resolved Hide resolved

fix: address review feedback for routing health

417f83e

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

internal/executor/attempt_health.go Outdated Show resolved Hide resolved

tests/e2e/provider_failover_health_test.go Outdated Show resolved Hide resolved

fix: address latest routing health review feedback

1b29cdf

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Merge origin/main into pr/provider-routing-health-hardening-codeonly

21820c2

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

fix: preserve response writer optional interfaces

db91550

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

internal/handler/response_state_writer.go Show resolved Hide resolved

luodaoyi and others added 3 commits March 18, 2026 01:16

fix: only mark started when flush is supported

a18a964

Merge branch 'awsl-project:main' into pr/provider-routing-health-hard…

ac97eb2

…ening-codeonly

test: align playwright suite with backend-served UI

b9b63b6

awsl233777 approved these changes Mar 18, 2026

View reviewed changes

awsl233777 merged commit 67ab9e0 into awsl-project:main Mar 18, 2026
3 checks passed

luodaoyi added a commit to luodaoyi/maxx that referenced this pull request Mar 18, 2026

Revert "Harden provider routing and implement health failover (awsl-p…

067aa47

…roject#421)" This reverts commit 67ab9e0.

awsl233777 mentioned this pull request Mar 18, 2026

Revert "Harden provider routing and implement health failover (#421)" #431

Merged

awsl233777 pushed a commit that referenced this pull request Mar 18, 2026

Revert "Harden provider routing and implement health failover (#421)" (…

9043213

…#431) This reverts commit 67ab9e0.

Conversation

luodaoyi commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. 这次主要解决什么问题

2. 最终方案概览

路由层

执行层

健康层

3. 路由算法怎么做

3.1 priority

3.2 真实 weighted_random

3.3 权重和健康同时生效

4. 近期健康怎么做

时间衰减

评分思路

5. Breaker 怎么做

什么算不健康失败

为什么普通 4xx 不 reopen

为什么只允许一个 probe

6. 执行链怎么避免被坏 provider 拖慢

共享请求预算

实际效果

7. first byte / stream idle 怎么做

8. 错误语义怎么处理

9. 关键代码位置

10. 主要测试覆盖

单元测试

e2e

11. 结果

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ymkiux commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

luodaoyi commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

3.1 `priority`

3.2 真实 `weighted_random`

coderabbitai bot commented Mar 17, 2026 •

edited

Loading