Harden provider routing and implement health failover#421
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough将提供者健康跟踪器(断路器/评分)、基于健康的加权路由、尝试级别预算与首字节/空闲超时守护、响应体包装与活动回调、以及相关错误归一与健康记录逻辑,连同大量单元与端到端测试,集成到路由、执行器与适配器路径并完成启动/测试环境的 wiring。 Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Handler
participant Router
participant Executor
participant Watchdog
participant HealthTracker
participant Adapter
Client->>Handler: 发起 HTTP 请求
Handler->>Router: 匹配路由(含健康/权重判断)
Router->>HealthTracker: IsCircuitOpen / Score
HealthTracker-->>Router: 返回状态与评分
Router-->>Handler: 选定并排序后的路由
Handler->>Executor: 开始执行(含 AttemptBudget)
Executor->>HealthTracker: BeginAttempt(providerID, clientType)
HealthTracker-->>Executor: 返回 releaseFunc
Executor->>Watchdog: 创建 attemptWatchdog(首字节/空闲超时)
Executor->>Adapter: 发起上游请求
Adapter->>Adapter: WrapResponseBody(resp.Body)
Adapter-->>Executor: 返回上游响应(含已包装 body)
Executor->>Watchdog: NoteFirstByte()/NoteActivity()
Executor->>Handler: 转发/流式响应
Handler->>Client: 写入响应
rect rgba(200,150,100,0.5)
Executor->>HealthTracker: Record(AttemptResult)
HealthTracker->>HealthTracker: 更新断路器/评分/统计
end
Executor->>Watchdog: Stop()
Watchdog-->>Executor: 返回可能的 TimeoutErr
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR hardens provider routing and execution failover by combining weighted routing with a new provider health tracker (scoring + circuit breaker), and by enforcing shared request budgets plus per-attempt watchdog timeouts to prevent slow/bad providers from dragging requests.
Changes:
- Add provider health tracking (score, EWMA latency/TTFT, in-flight penalty, circuit breaker w/ half-open probe) and integrate it into router ordering + executor attempt gating.
- Implement true
weighted_randomrouting usingRoute.Weight(with default weight=1) and persistweightthrough admin APIs, DB, and backup import/export. - Add shared request budget + per-attempt timeouts (total/first-byte/idle) with attempt watchdog and normalized error handling to stop serial failover from exceeding the overall request budget.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/routes_test.go | E2E coverage for default and explicit route weight create/update behavior. |
| tests/e2e/proxy_setup_test.go | Wire health tracker into router/executor in proxy e2e env; add attempt-budget override helper. |
| tests/e2e/provider_failover_health_test.go | New e2e tests for health-based failover, breaker open skip, and shared request budget cutoff. |
| internal/service/backup_test.go | Ensure backup roundtrip preserves route weight. |
| internal/service/backup.go | Export/import route weight in backups. |
| internal/service/admin.go | Default Route.Weight to 1 on create/update when unset/invalid. |
| internal/router/router_test.go | Unit tests for health-aware ordering, open-circuit filtering, and weighted_random bias + stability. |
| internal/router/router.go | Implement true weighted random ordering; integrate health score reordering and circuit-open filtering. |
| internal/repository/sqlite/route.go | Persist weight (defaulting to 1) in sqlite route model/domain conversions. |
| internal/repository/sqlite/models.go | Add weight column to Route model with default 1. |
| internal/health/tracker_test.go | Unit tests for scoring, in-flight penalty, breaker open/half-open probe semantics. |
| internal/health/tracker.go | New health tracker implementation (decayed recents, EWMA, breaker state machine, probe reservation). |
| internal/handler/response_state_writer.go | New ResponseWriter wrapper to detect whether response has started. |
| internal/handler/proxy_test.go | Tests ensuring dispatch error handling doesn’t write after response started; status code propagation for ProxyError. |
| internal/handler/proxy.go | Wrap writer to track “response started”; normalize dispatch error writing and use ProxyError HTTP status codes. |
| internal/handler/admin.go | Support updating route weight via admin routes handler. |
| internal/flow/keys.go | Add flow key for attempt activity observer. |
| internal/flow/attempt_activity.go | Add attempt activity observer plumbing + response body wrapper to report first-byte/activity/EOF. |
| internal/executor/response_capture.go | Track whether response has started in response capture wrapper. |
| internal/executor/middleware_dispatch_test.go | Extensive dispatch tests for budgets, timeouts, retry-after capping, warm token non-blocking, and health recording/release semantics. |
| internal/executor/middleware_dispatch.go | Enforce request budget + per-attempt timeouts; add watchdog; gate attempts via health tracker; record attempt health. |
| internal/executor/executor.go | Add attempt budget + health tracker to executor; helper to clear cooldown for original/converted client types; pass watchdog to event processor. |
| internal/executor/attempt_watchdog.go | New watchdog for first-byte and stream-idle timeouts + activity tracking writer. |
| internal/executor/attempt_health.go | Normalize attempt errors (timeouts/started-response semantics), record health results, cap retry waits, release half-open probe when needed. |
| internal/executor/attempt_budget_test.go | Tests for default budget values and budget clamp/remaining behavior. |
| internal/executor/attempt_budget.go | Define attempt budgets and remaining-time helpers for attempt/request scopes. |
| internal/domain/model.go | Add Route.Weight to domain model with semantics comment. |
| internal/domain/backup.go | Add Weight to backup route schema. |
| internal/core/database.go | Create and pass health tracker into router/executor during core initialization. |
| internal/adapter/provider/kiro/adapter.go | Wrap upstream response body to report activity to watchdog. |
| internal/adapter/provider/custom/adapter.go | Wrap upstream response body to report activity to watchdog. |
| internal/adapter/provider/codex/adapter.go | Wrap upstream response body to report activity to watchdog. |
| internal/adapter/provider/cliproxyapi_codex/adapter.go | Disable first-byte timeout for buffered non-stream CPA execution path. |
| internal/adapter/provider/cliproxyapi_antigravity/adapter.go | Disable first-byte timeout for buffered non-stream CPA execution path. |
| internal/adapter/provider/claude/adapter.go | Wrap upstream response body to report activity to watchdog. |
| internal/adapter/provider/antigravity/adapter.go | Wrap upstream response body to report activity to watchdog. |
| cmd/maxx/main.go | Create and pass health tracker into router/executor in CLI server bootstrap. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
cmd/maxx/main.go
Outdated
| @@ -1,4 +1,4 @@ | |||
| package main | |||
| package main | |||
internal/core/database.go
Outdated
| @@ -1,4 +1,4 @@ | |||
| package core | |||
| package core | |||
| func (w *attemptWatchdog) Stop() { | ||
| if w == nil { | ||
| return | ||
| } | ||
| select { | ||
| case <-w.stopCh: | ||
| default: | ||
| close(w.stopCh) | ||
| } | ||
| } |
| } | ||
|
|
||
| func (w *responseStateWriter) Flush() { | ||
| if flusher, ok := w.ResponseWriter.(http.Flusher); ok { |
internal/health/tracker.go
Outdated
| factor := math.Exp(-float64(elapsed) / float64(t.decayHalfLife)) | ||
| stats.recentSuccess *= factor | ||
| stats.recentFailure *= factor | ||
| stats.recentTimeout *= factor | ||
| stats.recentRateLimit *= factor | ||
| stats.recentServerErr *= factor |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (13)
internal/health/tracker_test.go (1)
434-454: 建议把 2 分钟 half-life 的数值断言写死。这几组用例验证了衰减趋势,但没有直接锁定“2 分钟后衰减到 50%”这个契约,所以衰减公式写偏时依然可能全部通过。建议补一个固定时钟的精确断言,直接校验
recent*指标在一个 half-life 后变成 0.5 倍。Also applies to: 456-490
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/health/tracker_test.go` around lines 434 - 454, Add a deterministic assertion that the decay half-life is exactly 2 minutes by advancing the fixed clock (tracker.now) by 2 minutes and asserting the Tracker's internal recent decay metric (e.g., recentFailures or the relevant recent* field used by NewTracker) has dropped to approximately 0.5x its previous value; do this inside TestTrackerDoesNotOpenBreakerForFailuresSpreadAcrossDecayWindow (and the other affected test block around 456-490) after calling tracker.Record(AttemptResult...) so the decay formula is explicitly validated rather than only relying on behavior via IsCircuitOpen.tests/e2e/routes_test.go (1)
69-71: 权重相关 e2e 覆盖补充完整,方向正确。建议再补一个
weight<=0的边界用例(期望被规范化为 1),可以把默认化行为也锁住。Also applies to: 143-157, 160-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/routes_test.go` around lines 69 - 71, Add explicit boundary tests to verify weights <= 0 are normalized to 1: in tests/e2e/routes_test.go add a subtest (near the existing created["weight"] assertion) that creates a route with weight 0 and another with a negative weight and assert int(created["weight"].(float64)) == 1 for both; repeat the same assertions for the update flows referenced around the other blocks (the tests using the same created/updated response objects) so both create and update paths confirm normalization to 1.internal/handler/admin.go (1)
364-368: 建议对weight做显式输入校验并在非法值时返回 400。当前分支仅在
float64时赋值,非法类型或小数会被静默忽略/截断,不利于接口可预期性。可选修改示例
+import "math" ... if v, ok := updates["weight"]; ok { - if f, ok := v.(float64); ok { - existing.Weight = int(f) - } + f, ok := v.(float64) + if !ok || f != math.Trunc(f) { + writeJSON(w, http.StatusBadRequest, map[string]string{"error": "weight must be an integer"}) + return + } + existing.Weight = int(f) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/admin.go` around lines 364 - 368, When "weight" appears in the updates map, validate it explicitly and return HTTP 400 on invalid input instead of silently ignoring/truncating: inspect updates["weight"] (the updates map) and accept only numeric values that are whole integers (e.g., float64 where f == math.Trunc(f) or a json.Number parsed to int) and in the allowed range (e.g., non-negative if that applies), then assign to existing.Weight; for any other type or fractional value respond with a 400 error and a clear message rather than silently skipping assignment.internal/handler/response_state_writer.go (2)
5-8:started字段缺少并发保护
started是普通 bool 类型,在极端情况下如果存在并发写入(例如 SSE 流式响应的某些边缘场景),可能存在竞态条件。虽然 HTTP handler 通常在单一 goroutine 中执行请求处理,但建议评估是否需要使用atomic.Bool或sync.Once来保证线程安全。♻️ 可选:使用 atomic.Bool 保证线程安全
+import "sync/atomic" + type responseStateWriter struct { http.ResponseWriter - started bool + started atomic.Bool }相应地修改读写操作使用
Store(true)和Load()。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/response_state_writer.go` around lines 5 - 8, The responseStateWriter struct's started bool is not concurrency-safe; replace it with an atomic.Bool (or use sync.Once) to avoid race conditions in edge cases (e.g., SSE). Update the responseStateWriter type to use atomic.Bool and change all places that set or check started (e.g., in WriteHeader, Write, and any helper methods on responseStateWriter) to call Store(true) when marking started and Load() when checking it; also add the necessary import for sync/atomic (or sync for Once) and ensure any conditional logic uses the atomic methods to avoid non-atomic reads/writes.
27-31: 考虑保留 http.Hijacker 等可选接口当前实现只显式实现了
http.Flusher,但http.ResponseWriter可能还实现http.Hijacker(用于 WebSocket 升级)等可选接口。如果底层 writer 支持这些接口,包装后会丢失这些能力。对于当前的代理用例,如果不需要 WebSocket 支持,可以忽略此建议。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/response_state_writer.go` around lines 27 - 31, responseStateWriter currently only exposes Flush by type-asserting w.ResponseWriter to http.Flusher in Flush(); this wrapper can drop optional interfaces like http.Hijacker (and optionally http.CloseNotifier, http.Pusher) if the underlying ResponseWriter supports them. Update the responseStateWriter implementation to forward those optional interfaces: add type assertions and method implementations (e.g., implement Hijack() when the embedded ResponseWriter satisfies http.Hijacker, likewise for CloseNotify/Push) so the wrapper preserves behavior of the underlying writer; look for the responseStateWriter type, its Flush() method, and ResponseWriter field to locate where to add the forwarded interface checks and methods.internal/executor/attempt_watchdog.go (2)
175-180:Write方法每次调用都触发NoteFirstByte。当前实现在每次
Write时都调用NoteFirstByte(),而不仅是首次写入。根据 watchdog 的实现,NoteFirstByte会设置activitySeen=true并更新lastActivity,这在功能上是正确的,但语义上可能略显混乱。如果想更精确,可以在
attemptActivityWriter中维护一个firstByteSent标志,仅在首次写入时调用NoteFirstByte,后续调用NoteActivity。不过当前实现也能正常工作。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_watchdog.go` around lines 175 - 180, The Write method currently calls watchdog.NoteFirstByte() on every write; change attemptActivityWriter to track a boolean flag (e.g., firstByteSent) and in Write only call watchdog.NoteFirstByte() when firstByteSent is false, set firstByteSent=true, and on subsequent writes call watchdog.NoteActivity() instead; update the attemptActivityWriter struct to include the firstByteSent field and adjust the Write method accordingly to preserve semantics but make the first-byte vs subsequent-activity semantics explicit.
110-130: 5ms 轮询间隔可能较为激进。当前使用 5ms ticker 检测超时,在高并发场景下每个请求都会有一个这样的 goroutine,可能带来一定的 CPU 开销。如果性能测试显示问题,可考虑增加到 50-100ms 或使用
time.AfterFunc替代轮询。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_watchdog.go` around lines 110 - 130, The 5ms polling interval in attemptWatchdog.run (time.NewTicker(5 * time.Millisecond)) is too aggressive and can cause CPU overhead under high concurrency; update the ticker to a larger interval (e.g., 50-100ms) or replace the polling mechanism with a timer-based approach such as time.AfterFunc to schedule the timeout check for each attempt, ensuring you preserve the same stopCh handling and cancel() behavior in the attemptWatchdog struct.internal/executor/executor.go (1)
81-83:SetAttemptBudget缺少并发保护。如果在运行时调用此方法(而非仅在启动时),可能存在数据竞争。如果预期仅在启动时配置,建议添加注释说明;否则考虑添加互斥锁保护。
♻️ 可选:添加文档说明或互斥锁
+// SetAttemptBudget updates the executor's attempt budget. +// This method is not thread-safe and should only be called during initialization. func (e *Executor) SetAttemptBudget(budget AttemptBudget) { e.attemptBudget = budget }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/executor.go` around lines 81 - 83, SetAttemptBudget currently writes e.attemptBudget without concurrency protection which can cause data races if called at runtime; either document that SetAttemptBudget must only be used at startup or add mutex protection (e.g., a sync.Mutex or sync.RWMutex on the Executor struct) and lock/unlock around the write in SetAttemptBudget; update the Executor struct to include the mutex (e.g., mu sync.Mutex) and use mu.Lock()/mu.Unlock() (or mu.Lock()/defer mu.Unlock()) around the assignment to e.attemptBudget, and add a brief comment on the method explaining the concurrency semantics.internal/executor/attempt_budget_test.go (1)
1-79: 测试覆盖良好,可选地补充边界用例。当前测试很好地覆盖了核心场景。如果想提高健壮性,可考虑补充以下边界用例:
ClampRetryWait当wait <= 0时返回 0RemainingSince/RequestRemainingSince当 timeout 字段<= 0时返回 0ClampRetryWait当MaxRetryWait <= 0时不进行截断🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_budget_test.go` around lines 1 - 79, Add unit tests covering the suggested edge cases: create tests in attempt_budget_test.go that verify AttemptBudget.ClampRetryWait returns 0 when the input wait is <= 0, that RemainingSince returns 0 when TotalTimeout <= 0, that RequestRemainingSince returns 0 when RequestTimeout <= 0, and that ClampRetryWait does not clamp when MaxRetryWait <= 0 (i.e., returns the original wait). Reference the AttemptBudget struct and its fields MaxRetryWait, RequestTimeout, TotalTimeout and the methods ClampRetryWait, RemainingSince, RequestRemainingSince to locate where to add the tests.internal/executor/attempt_budget.go (1)
35-45: 当TotalTimeout <= 0时返回 0 可能导致语义歧义。当
TotalTimeout为 0(可能表示"无限制")时,RemainingSince返回 0,这可能被调用方误解为"预算已耗尽"。建议考虑返回一个特殊值(如math.MaxInt64对应的 duration)或负值来表示"无限制",或在文档中明确说明此行为。当前
middleware_dispatch.go:102-103的调用方式 (if e.attemptBudget.RequestTimeout > 0) 已正确处理此情况,但未来调用方可能遗漏此检查。♻️ 可选:添加文档注释说明行为
+// RemainingSince calculates remaining time from TotalTimeout. +// Returns 0 if TotalTimeout is not positive (disabled) or if the budget is exhausted. +// Callers should check if TotalTimeout > 0 before relying on this method. func (b AttemptBudget) RemainingSince(start time.Time) time.Duration {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_budget.go` around lines 35 - 45, Change RemainingSince on AttemptBudget to signal "unlimited" instead of "exhausted" when TotalTimeout <= 0: update AttemptBudget.RemainingSince to return a sentinel long duration (e.g. time.Duration(math.MaxInt64)) for the unlimited case and add a doc comment to AttemptBudget.RemainingSince explaining that a non‑positive TotalTimeout means unlimited and will return the sentinel; also keep/mention the existing consumer pattern (middleware_dispatch.go check of e.attemptBudget.RequestTimeout > 0) so callers know to test RequestTimeout rather than treating zero as "no time left".internal/executor/middleware_dispatch_test.go (1)
46-55: 这些墙钟断言太紧,CI 上容易偶发失败这些测试的核心语义其实已经由
callCount、fallbackCalls、ctx.Err等断言覆盖了;再用<25/40/60ms这样的上界去卡 5–20ms 级别的 timeout,只要 runner 有一点调度或 GC 抖动就会随机变红。更稳的是改成基于信号的断言,或者至少把阈值放宽到 timeout 的数倍。Also applies to: 116-128, 193-205, 234-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/middleware_dispatch_test.go` around lines 46 - 55, The wall-clock assertions around exec.dispatch(ctx) (variables start, elapsed) are too tight and flaky; update the tests that currently assert elapsed < 25/40/60ms to instead rely on existing logical signals (callCount, fallbackCalls, ctx.Err) or relax timing checks to a more generous bound (e.g., several times the configured attempt timeout) or replace them with signal-based waits (channel/condition) that assert the intended event occurred rather than a strict short duration; apply the same change pattern to the other occurrences referenced by the test (the blocks around lines asserting elapsed and using callCount/fallbackCalls/ctx.Err).tests/e2e/provider_failover_health_test.go (2)
67-69: 删掉这条相对耗时比较会更稳前面的命中次数断言已经证明第二次请求没有再打到
slowProvider。这里再比较secondDuration < firstDuration,基本只是在和环境噪声较劲,尤其是 e2e 跑在共享 runner 上时很容易随机反转。建议保留 provider 命中断言,删掉这条相对耗时比较,或改成更宽松的绝对上界。Also applies to: 82-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/provider_failover_health_test.go` around lines 67 - 69, The relative timing assertion comparing secondDuration < firstDuration is flaky; remove that comparison and keep the provider hit-count assertions (the checks proving the second request didn't hit slowProvider), or replace the relative check with a relaxed absolute bound (e.g., assert secondDuration < some generous duration) if you need a timing assertion; update the test code around the firstStarted/firstDuration and secondDuration variables and the block that checks provider hits (the same region covering the earlier hit-count assertions referenced in the comment) accordingly.
245-257: 不要把预算语义钉死成“恰好命中前两个 provider”共享请求预算只保证会在耗尽前停止串行 failover,并不保证第二个慢 provider 一定会被尝试一次。不同机器上的调度/取消时机下,只命中第一个 provider 也仍然符合语义。这里更稳的是断言
healthyHits == 0,再限制慢 provider 的总命中次数,而不是固定每个槽位的精确分布。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/provider_failover_health_test.go` around lines 245 - 257, The assertions pin the budget semantics to an exact per-slot distribution (slowHits[0]==1 && slowHits[1]==1); instead, change the checks to assert that healthyHits.Load() == 0 and that the total number of slow provider hits (sum of slowHits[i].Load()) is within an acceptable range (e.g., >=1 and <=2) so the test only verifies that healthy fallback did not occur and that the shared request budget limited serial failover without requiring a precise per-slot hit pattern; keep the existing checks on resp.StatusCode and elapsed as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/executor/attempt_health.go`:
- Around line 27-35: The timeout branch currently only sets
proxyErr.HTTPStatusCode = http.StatusGatewayTimeout when HTTPStatusCode == 0;
change it to always overwrite the underlying status code to 504 for true
timeouts: in the timeout handling block where err is type-asserted to
*domain.ProxyError (symbols: proxyErr, domain.ProxyError, timeoutErr,
http.StatusGatewayTimeout), unconditionally set proxyErr.HTTPStatusCode =
http.StatusGatewayTimeout and keep setting proxyErr.Retryable = true,
proxyErr.IsNetworkError = true and proxyErr.Err = timeoutErr; apply the same
unconditional overwrite to the second identical timeout branch around the other
occurrence referenced (lines ~66-74).
- Around line 42-43: The code currently returns the original err when attemptCtx
is nil or not deadline-exceeded, which lets raw transport failures
(dial/TLS/reset) bypass network/breaker stats; instead, before returning, run
the same network-detection/normalization used elsewhere and ensure such
transport errors are marked as network errors (e.g., wrap the err into a
ProxyError with IsNetworkError=true or call the existing normalization/isNetwork
detection used around the ProxyError handling at lines 151-154) so
dial/TLS/reset-style errors are classified as network failures and will
contribute to network/timeout and breaker logic rather than being returned raw.
In `@internal/flow/attempt_activity.go`:
- Around line 43-61: 在 attemptActivityReadCloser 中补齐 Close() 并避免重复触发
CompleteResponseBody(): 添加一个布尔字段(例如 responseBodyCompleted bool)并在 Read(EOF 分支)和
Close 方法里都检查/设置这个标志,只有在标志为 false 时才调用 observer.CompleteResponseBody() 并把标志设为
true;同时实现 Close() 去调用底层 r.ReadCloser.Close() 并在返回前触发一次受保护的
CompleteResponseBody()。如果该类型可能并发使用,保护该标志的访问(例如用 sync.Mutex 或 atomic)以防重入。
In `@internal/health/tracker.go`:
- Around line 325-330: The current decay uses
math.Exp(-float64(elapsed)/float64(t.decayHalfLife)) which yields
e^(-t/halfLife) instead of a true half-life drop; change the factor calculation
to use 0.5^(t/halfLife) by replacing math.Exp(...) with math.Pow(0.5,
float64(elapsed)/float64(t.decayHalfLife)), then apply that factor to
stats.recentSuccess, stats.recentFailure, stats.recentTimeout,
stats.recentRateLimit, and stats.recentServerErr so values halve after t ==
t.decayHalfLife.
---
Nitpick comments:
In `@internal/executor/attempt_budget_test.go`:
- Around line 1-79: Add unit tests covering the suggested edge cases: create
tests in attempt_budget_test.go that verify AttemptBudget.ClampRetryWait returns
0 when the input wait is <= 0, that RemainingSince returns 0 when TotalTimeout
<= 0, that RequestRemainingSince returns 0 when RequestTimeout <= 0, and that
ClampRetryWait does not clamp when MaxRetryWait <= 0 (i.e., returns the original
wait). Reference the AttemptBudget struct and its fields MaxRetryWait,
RequestTimeout, TotalTimeout and the methods ClampRetryWait, RemainingSince,
RequestRemainingSince to locate where to add the tests.
In `@internal/executor/attempt_budget.go`:
- Around line 35-45: Change RemainingSince on AttemptBudget to signal
"unlimited" instead of "exhausted" when TotalTimeout <= 0: update
AttemptBudget.RemainingSince to return a sentinel long duration (e.g.
time.Duration(math.MaxInt64)) for the unlimited case and add a doc comment to
AttemptBudget.RemainingSince explaining that a non‑positive TotalTimeout means
unlimited and will return the sentinel; also keep/mention the existing consumer
pattern (middleware_dispatch.go check of e.attemptBudget.RequestTimeout > 0) so
callers know to test RequestTimeout rather than treating zero as "no time left".
In `@internal/executor/attempt_watchdog.go`:
- Around line 175-180: The Write method currently calls watchdog.NoteFirstByte()
on every write; change attemptActivityWriter to track a boolean flag (e.g.,
firstByteSent) and in Write only call watchdog.NoteFirstByte() when
firstByteSent is false, set firstByteSent=true, and on subsequent writes call
watchdog.NoteActivity() instead; update the attemptActivityWriter struct to
include the firstByteSent field and adjust the Write method accordingly to
preserve semantics but make the first-byte vs subsequent-activity semantics
explicit.
- Around line 110-130: The 5ms polling interval in attemptWatchdog.run
(time.NewTicker(5 * time.Millisecond)) is too aggressive and can cause CPU
overhead under high concurrency; update the ticker to a larger interval (e.g.,
50-100ms) or replace the polling mechanism with a timer-based approach such as
time.AfterFunc to schedule the timeout check for each attempt, ensuring you
preserve the same stopCh handling and cancel() behavior in the attemptWatchdog
struct.
In `@internal/executor/executor.go`:
- Around line 81-83: SetAttemptBudget currently writes e.attemptBudget without
concurrency protection which can cause data races if called at runtime; either
document that SetAttemptBudget must only be used at startup or add mutex
protection (e.g., a sync.Mutex or sync.RWMutex on the Executor struct) and
lock/unlock around the write in SetAttemptBudget; update the Executor struct to
include the mutex (e.g., mu sync.Mutex) and use mu.Lock()/mu.Unlock() (or
mu.Lock()/defer mu.Unlock()) around the assignment to e.attemptBudget, and add a
brief comment on the method explaining the concurrency semantics.
In `@internal/executor/middleware_dispatch_test.go`:
- Around line 46-55: The wall-clock assertions around exec.dispatch(ctx)
(variables start, elapsed) are too tight and flaky; update the tests that
currently assert elapsed < 25/40/60ms to instead rely on existing logical
signals (callCount, fallbackCalls, ctx.Err) or relax timing checks to a more
generous bound (e.g., several times the configured attempt timeout) or replace
them with signal-based waits (channel/condition) that assert the intended event
occurred rather than a strict short duration; apply the same change pattern to
the other occurrences referenced by the test (the blocks around lines asserting
elapsed and using callCount/fallbackCalls/ctx.Err).
In `@internal/handler/admin.go`:
- Around line 364-368: When "weight" appears in the updates map, validate it
explicitly and return HTTP 400 on invalid input instead of silently
ignoring/truncating: inspect updates["weight"] (the updates map) and accept only
numeric values that are whole integers (e.g., float64 where f == math.Trunc(f)
or a json.Number parsed to int) and in the allowed range (e.g., non-negative if
that applies), then assign to existing.Weight; for any other type or fractional
value respond with a 400 error and a clear message rather than silently skipping
assignment.
In `@internal/handler/response_state_writer.go`:
- Around line 5-8: The responseStateWriter struct's started bool is not
concurrency-safe; replace it with an atomic.Bool (or use sync.Once) to avoid
race conditions in edge cases (e.g., SSE). Update the responseStateWriter type
to use atomic.Bool and change all places that set or check started (e.g., in
WriteHeader, Write, and any helper methods on responseStateWriter) to call
Store(true) when marking started and Load() when checking it; also add the
necessary import for sync/atomic (or sync for Once) and ensure any conditional
logic uses the atomic methods to avoid non-atomic reads/writes.
- Around line 27-31: responseStateWriter currently only exposes Flush by
type-asserting w.ResponseWriter to http.Flusher in Flush(); this wrapper can
drop optional interfaces like http.Hijacker (and optionally http.CloseNotifier,
http.Pusher) if the underlying ResponseWriter supports them. Update the
responseStateWriter implementation to forward those optional interfaces: add
type assertions and method implementations (e.g., implement Hijack() when the
embedded ResponseWriter satisfies http.Hijacker, likewise for CloseNotify/Push)
so the wrapper preserves behavior of the underlying writer; look for the
responseStateWriter type, its Flush() method, and ResponseWriter field to locate
where to add the forwarded interface checks and methods.
In `@internal/health/tracker_test.go`:
- Around line 434-454: Add a deterministic assertion that the decay half-life is
exactly 2 minutes by advancing the fixed clock (tracker.now) by 2 minutes and
asserting the Tracker's internal recent decay metric (e.g., recentFailures or
the relevant recent* field used by NewTracker) has dropped to approximately 0.5x
its previous value; do this inside
TestTrackerDoesNotOpenBreakerForFailuresSpreadAcrossDecayWindow (and the other
affected test block around 456-490) after calling
tracker.Record(AttemptResult...) so the decay formula is explicitly validated
rather than only relying on behavior via IsCircuitOpen.
In `@tests/e2e/provider_failover_health_test.go`:
- Around line 67-69: The relative timing assertion comparing secondDuration <
firstDuration is flaky; remove that comparison and keep the provider hit-count
assertions (the checks proving the second request didn't hit slowProvider), or
replace the relative check with a relaxed absolute bound (e.g., assert
secondDuration < some generous duration) if you need a timing assertion; update
the test code around the firstStarted/firstDuration and secondDuration variables
and the block that checks provider hits (the same region covering the earlier
hit-count assertions referenced in the comment) accordingly.
- Around line 245-257: The assertions pin the budget semantics to an exact
per-slot distribution (slowHits[0]==1 && slowHits[1]==1); instead, change the
checks to assert that healthyHits.Load() == 0 and that the total number of slow
provider hits (sum of slowHits[i].Load()) is within an acceptable range (e.g.,
>=1 and <=2) so the test only verifies that healthy fallback did not occur and
that the shared request budget limited serial failover without requiring a
precise per-slot hit pattern; keep the existing checks on resp.StatusCode and
elapsed as-is.
In `@tests/e2e/routes_test.go`:
- Around line 69-71: Add explicit boundary tests to verify weights <= 0 are
normalized to 1: in tests/e2e/routes_test.go add a subtest (near the existing
created["weight"] assertion) that creates a route with weight 0 and another with
a negative weight and assert int(created["weight"].(float64)) == 1 for both;
repeat the same assertions for the update flows referenced around the other
blocks (the tests using the same created/updated response objects) so both
create and update paths confirm normalization to 1.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bc601bce-e451-45fd-a0fb-7283912a50f0
📒 Files selected for processing (37)
cmd/maxx/main.gointernal/adapter/provider/antigravity/adapter.gointernal/adapter/provider/claude/adapter.gointernal/adapter/provider/cliproxyapi_antigravity/adapter.gointernal/adapter/provider/cliproxyapi_codex/adapter.gointernal/adapter/provider/codex/adapter.gointernal/adapter/provider/custom/adapter.gointernal/adapter/provider/kiro/adapter.gointernal/core/database.gointernal/domain/backup.gointernal/domain/model.gointernal/executor/attempt_budget.gointernal/executor/attempt_budget_test.gointernal/executor/attempt_health.gointernal/executor/attempt_watchdog.gointernal/executor/executor.gointernal/executor/middleware_dispatch.gointernal/executor/middleware_dispatch_test.gointernal/executor/response_capture.gointernal/flow/attempt_activity.gointernal/flow/keys.gointernal/handler/admin.gointernal/handler/proxy.gointernal/handler/proxy_test.gointernal/handler/response_state_writer.gointernal/health/tracker.gointernal/health/tracker_test.gointernal/repository/sqlite/models.gointernal/repository/sqlite/route.gointernal/router/router.gointernal/router/router_test.gointernal/service/admin.gointernal/service/backup.gointernal/service/backup_test.gotests/e2e/provider_failover_health_test.gotests/e2e/proxy_setup_test.gotests/e2e/routes_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Agent
- GitHub Check: e2e
- GitHub Check: playwright
🔇 Additional comments (29)
internal/health/tracker.go (1)
92-100: 生产路径中已调用AllowAttempt(),评论中的关切不成立。原始评论声称生产路径中不存在对
AllowAttempt()的调用,导致 half-open 语义未被启用。然而实际情况是:
internal/executor/middleware_dispatch.go第 114 行确实调用了AllowAttempt():if e.healthTracker != nil && !e.healthTracker.AllowAttempt(matchedRoute.Provider.ID, string(clientType)) { break }完整的生产流程已正确接入:
router.go第 221 行:IsCircuitOpen()初步过滤(跳过熔断的 provider)middleware_dispatch.go第 114 行:AllowAttempt()进行 half-open 状态检查和单探针限制middleware_dispatch.go第 192 行:BeginAttempt()开始记录尝试attempt_health.go第 189 行:ReleaseHalfOpenProbe()记录探针结果并决定 breaker 转换Half-open 探针语义已正确启用,不存在遗漏的集成问题。
> Likely an incorrect or invalid review comment.internal/service/backup_test.go (1)
109-110: 权重回写校验补得很到位。这里把
Weight的种子数据和 roundtrip 断言一起补齐,能有效防止备份导入导出链路丢字段。Also applies to: 236-241
internal/domain/backup.go (1)
70-70:BackupRoute增加weight字段设计合理。该字段与路由权重能力保持一致,序列化语义清晰。
internal/adapter/provider/antigravity/adapter.go (1)
242-242: 响应体包装在重试前后都补齐了,改动正确。这样可以保证两条成功路径都接入统一的活动观测与超时联动。
Also applies to: 277-277
internal/executor/response_capture.go (1)
15-15:Started()信号补充得很好。对“是否已开始写回客户端”的判定更直接,且不影响现有捕获行为。
Also applies to: 30-30, 36-36, 63-65
internal/core/database.go (1)
20-20: 健康追踪器注入链路一致,整体接线正确。同一个
providerHealthTracker同时传给 Router 和 Executor,符合本次健康感知路由目标。Also applies to: 258-266, 328-328
internal/adapter/provider/cliproxyapi_codex/adapter.go (1)
274-276: 这里禁用 first-byte 超时是合理的。与 CPA 非流式“完整缓冲后返回”的行为匹配,能减少误判超时。
internal/adapter/provider/cliproxyapi_antigravity/adapter.go (1)
130-132: LGTM!非流式 CPA 接口会缓冲完整响应后一次性返回,禁用首字节超时并回退到 TotalTimeout 是正确的做法。
flow.DisableFirstByteTimeout内部已处理 observer 为 nil 的情况,调用安全。internal/adapter/provider/kiro/adapter.go (1)
140-141: LGTM!响应体包装与其他适配器(claude、antigravity、codex)模式一致,启用了 flow 层的首字节和活动跟踪。
WrapResponseBody内部安全处理了 nil observer 的情况。Also applies to: 175-176
internal/repository/sqlite/route.go (1)
113-116: LGTM!在
toModel和toDomain双向都进行 weight 归一化,与admin.go服务层和models.go的默认值保持一致,确保非正值权重不会传播到路由逻辑中。Also applies to: 139-142
internal/adapter/provider/claude/adapter.go (1)
138-139: LGTM!响应体包装模式与 kiro 适配器一致,在初次请求和 401 重试后都正确包装,支持 flow 层的活动跟踪。
Also applies to: 169-170
internal/service/backup.go (1)
676-676: 导入路径的 Weight 归一化由仓库层保障导入时直接调用
routeRepo.Create(r)绕过了AdminService.CreateRoute的 weight 归一化逻辑。虽然sqlite/route.go的toModel会处理weight <= 0的情况,但这依赖于隐式的仓库层行为。可以考虑在导入时显式归一化以保持与服务层一致,但当前实现由于仓库层保障也是安全的。
internal/handler/proxy_test.go (1)
12-52: LGTM!测试覆盖了关键的错误处理场景:响应已开始时跳过错误写入、使用 ProxyError 的 HTTPStatusCode。测试结构清晰,命名明确。
internal/handler/proxy.go (3)
238-264: LGTM! 错误处理逻辑更加健壮新的 dispatch 错误处理逻辑:
- 响应已开始时提前中止,避免重复写入
DeadlineExceeded映射到 504 Gateway TimeoutCanceled静默中止,不写入额外内容- 其他错误返回 500
逻辑清晰,符合 HTTP 语义。
266-275: writeDispatchError 的双重检查是防御性设计Line 267 再次检查
responseHasStarted虽然与 line 238 有重复,但作为独立辅助函数的防御性检查是合理的,可避免未来直接调用时的错误。
328-332: HTTPStatusCode 默认值为 502 符合预期当
ProxyError.HTTPStatusCode未设置(为 0)时,默认使用 502 Bad Gateway 符合上游错误的语义。这与internal/domain/errors.go中NewProxyError和NewProxyErrorWithMessage不主动设置 HTTPStatusCode 的行为一致。Also applies to: 352-356
cmd/maxx/main.go (1)
196-199: LGTM!健康追踪器集成正确。根据
internal/health/tracker.go的上下文,NewTracker()返回的追踪器是纯内存实现,没有后台 goroutine 需要管理,因此无需在shutdownServer中添加额外的清理逻辑。internal/executor/executor.go (2)
85-105: LGTM!去重逻辑正确。
clearSuccessCooldowns使用seenmap 避免当currentClientType和originalClientType相同时重复清除冷却状态,实现合理。
425-434: LGTM!watchdog 集成正确。在
EventFirstToken事件时调用watchdog.NoteFirstByte()是合理的。根据上下文代码片段,NoteFirstByte是幂等的,重复调用不会产生问题。internal/router/router_test.go (2)
366-399: LGTM!fake health tracker 实现完整。
routerFakeHealthTracker正确实现了health.ProviderTracker接口的所有方法,支持模拟健康分数和断路器状态,适合单元测试使用。
206-206: 项目已配置 Go 1.26.0,完全支持for range整数遍历语法。无需进一步操作。internal/executor/attempt_budget.go (1)
14-23: LGTM!默认值与 PR 规格一致。默认超时配置符合 PR 描述:RequestTimeout 45s、TotalTimeout 30s、FirstByteTimeout 10s、StreamIdleTimeout 15s、MaxRetryAfter 30s、MaxRetryWait 10s。
internal/executor/middleware_dispatch.go (5)
101-116: LGTM!预算检查和健康检查的时序正确。预算耗尽检查(104-111)和健康检查(114-115)都在
BeginAttempt(192行)之前执行。这确保了当提前退出时不会泄漏 in-flight 计数器,符合上下文代码片段中BeginAttempt返回的清理函数必须被调用的要求。
190-224: LGTM!资源清理路径完整。成功路径正确执行了:
recordAttemptHealth(219-222行)attemptDone()(223行)- 调用 BeginAttempt 返回的清理函数releaseHalfOpenProbeIfNeeded(224行)清理顺序合理,确保健康指标正确记录后再释放资源。
302-311: LGTM!失败路径的资源清理与成功路径一致。失败路径同样执行了
recordAttemptHealth、attemptDone()和releaseHalfOpenProbeIfNeeded,确保无论成功还是失败,in-flight 计数器都能正确递减。
460-462: LGTM!正确退出路由循环。当请求预算耗尽或响应已开始时,跳出路由循环避免尝试其他路由。这符合 PR 规格中"错误语义:响应开始后不切换路由"的要求。
427-435: ✓capRetryWait方法在internal/executor/attempt_health.go中正确实现,包含完整的预算截断逻辑(MaxRetryAfter 限制、ClampRetryWait 约束和请求超时处理)。tests/e2e/proxy_setup_test.go (2)
42-56: LGTM!测试环境选项设计良好。
ProxyTestEnvOptions结构和NewProxyTestEnvWithAttemptBudget工厂函数提供了灵活的测试配置方式,允许 e2e 测试自定义超时预算。这种模式便于未来添加更多可选配置。
147-150: LGTM!健康追踪器集成与 main.go 一致。测试环境正确创建
providerHealthTracker并传递给router.NewRouter,与cmd/maxx/main.go的接线方式保持一致。
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
internal/executor/attempt_watchdog.go (1)
109-129: 轮询间隔可考虑调优。5ms 的轮询间隔在高并发场景下可能带来 CPU 开销。考虑到超时精度要求(首字节 10s、流空闲 15s 级别),可以适当放宽到 50-100ms 而不影响用户体验。
不过,当前实现功能正确,可在后续优化中调整。
♻️ 可选优化:增大轮询间隔
func (w *attemptWatchdog) run() { - ticker := time.NewTicker(5 * time.Millisecond) + ticker := time.NewTicker(50 * time.Millisecond) defer ticker.Stop()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_watchdog.go` around lines 109 - 129, The run method of attemptWatchdog currently uses a very tight polling interval (time.NewTicker(5 * time.Millisecond)) which may cause unnecessary CPU overhead under high concurrency; update the ticker creation in attemptWatchdog.run to use a larger interval (e.g., 50-100ms) that still meets the timeout precision for nextTimeoutErr/first-byte (10s) and stream-idle (15s), keeping the existing logic with stopCh, nextTimeoutErr(), w.mu, w.timeoutErr, and w.cancel unchanged.tests/e2e/provider_failover_health_test.go (1)
257-275: 建议在辅助函数中统一defer resp.Body.Close()当前成功路径手动
Close(),但失败路径依赖t.Fatalf提前退出,不够直观。这里可统一 defer,减少重复并避免遗漏。可选重构示例
func createRouteWithPosition(t *testing.T, env *ProxyTestEnv, clientType string, providerID uint64, position int) uint64 { t.Helper() resp := env.AdminPost("/api/admin/routes", map[string]any{ "isEnabled": true, "clientType": clientType, "providerID": providerID, "position": position, }) + defer resp.Body.Close() if resp.StatusCode != http.StatusCreated { body, _ := io.ReadAll(resp.Body) t.Fatalf("Failed to create route: status=%d body=%s", resp.StatusCode, body) } @@ if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { t.Fatalf("Failed to decode route create response: %v", err) } - resp.Body.Close() return result.ID } @@ func createProviderWithErrorCooldownDisabled(t *testing.T, env *ProxyTestEnv, name, baseURL string, supportedTypes []string) uint64 { t.Helper() resp := env.AdminPost("/api/admin/providers", map[string]any{ @@ }) + defer resp.Body.Close() if resp.StatusCode != http.StatusCreated { body, _ := io.ReadAll(resp.Body) t.Fatalf("Failed to create provider %s: status=%d body=%s", name, resp.StatusCode, body) } @@ if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { t.Fatalf("Failed to decode provider create response: %v", err) } - resp.Body.Close() return result.ID }Also applies to: 280-304
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/provider_failover_health_test.go` around lines 257 - 275, The response body isn't always closed on all paths; after calling env.AdminPost(...) in the route-creation helper, immediately add defer resp.Body.Close() to ensure the body is closed whether the test fails or succeeds, and remove the manual resp.Body.Close() at the end of the successful path; apply the same change to the similar block around lines 280-304 so both helper usages (the resp variable in this function) consistently use defer for closing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/executor/attempt_health.go`:
- Around line 42-46: The current branch order calls isLikelyNetworkError(err)
before checking for attemptCtx deadline, causing context.DeadlineExceeded to be
treated as a network error and mapped to newNetworkProxyError (502); change the
logic in the functions/blocks around where attemptCtx and err are evaluated so
you first check for attemptCtx != nil && errors.Is(attemptCtx.Err(),
context.DeadlineExceeded) and handle/normalize it to the timeout path (504)
before calling isLikelyNetworkError(err), and apply the same reorder to the
other symmetric spot that currently checks isLikelyNetworkError first; reference
attemptCtx, errors.Is(..., context.DeadlineExceeded), isLikelyNetworkError, and
newNetworkProxyError when locating and fixing the code.
In `@tests/e2e/provider_failover_health_test.go`:
- Around line 169-175: The test's timing budgets (RequestTimeout, TotalTimeout,
FirstByteTimeout, StreamIdleTimeout, MaxRetryAfter, MaxRetryWait) are too tight
and cause flakiness; increase these values to larger, CI-safe numbers (e.g.,
RequestTimeout ~300ms, TotalTimeout/FirstByteTimeout/StreamIdleTimeout ~200ms,
MaxRetryAfter/MaxRetryWait ~50ms) in the test setup so retries/latency in CI
won't exhaust the budget, and update the expectation logic that checks provider
hit counts (the assertion that the first two providers "must each hit 1 time")
to accept >=1 hits instead of exactly 1 to avoid brittle timing-dependent
failures.
---
Nitpick comments:
In `@internal/executor/attempt_watchdog.go`:
- Around line 109-129: The run method of attemptWatchdog currently uses a very
tight polling interval (time.NewTicker(5 * time.Millisecond)) which may cause
unnecessary CPU overhead under high concurrency; update the ticker creation in
attemptWatchdog.run to use a larger interval (e.g., 50-100ms) that still meets
the timeout precision for nextTimeoutErr/first-byte (10s) and stream-idle (15s),
keeping the existing logic with stopCh, nextTimeoutErr(), w.mu, w.timeoutErr,
and w.cancel unchanged.
In `@tests/e2e/provider_failover_health_test.go`:
- Around line 257-275: The response body isn't always closed on all paths; after
calling env.AdminPost(...) in the route-creation helper, immediately add defer
resp.Body.Close() to ensure the body is closed whether the test fails or
succeeds, and remove the manual resp.Body.Close() at the end of the successful
path; apply the same change to the similar block around lines 280-304 so both
helper usages (the resp variable in this function) consistently use defer for
closing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d314791f-f1a6-479c-a057-13a39feeefe6
📒 Files selected for processing (13)
cmd/maxx/main.gointernal/core/database.gointernal/executor/attempt_health.gointernal/executor/attempt_health_test.gointernal/executor/attempt_watchdog.gointernal/executor/attempt_watchdog_test.gointernal/flow/attempt_activity.gointernal/flow/attempt_activity_test.gointernal/handler/response_state_writer.gointernal/handler/response_state_writer_test.gointernal/health/tracker.gointernal/health/tracker_test.gotests/e2e/provider_failover_health_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/handler/response_state_writer.go
📜 Review details
🔇 Additional comments (12)
internal/flow/attempt_activity_test.go (1)
1-52: LGTM!测试用例清晰地验证了
WrapResponseBody的完成信号语义:
- 确保在 Close 时(未读取 EOF)正确触发一次完成回调
- 确保在 EOF + Close 场景下完成回调仅触发一次
测试观察者实现简洁,计数器方式有效验证了幂等性。
internal/handler/response_state_writer_test.go (1)
1-23: LGTM!测试简洁地验证了
Flush操作正确标记响应已开始,并确认了http.Flusher接口实现。cmd/maxx/main.go (1)
196-199: LGTM! 健康追踪器集成与internal/core/database.go保持一致。两处创建和传递
providerHealthTracker的方式相同,确保了在不同入口点(CLI 直接启动 vs. 通过 core 初始化)下行为一致。Also applies to: 285-285
internal/executor/attempt_watchdog_test.go (1)
1-51: LGTM! 测试覆盖了关键的边界场景。
- 并发安全测试(200 次迭代 × 16 并发调用)有效验证了
Stop()的线程安全性- Nil 接收者测试确保空指针调用不会 panic
- 超时错误测试通过设置过去的起始时间巧妙地触发立即超时
internal/executor/attempt_watchdog.go (3)
174-178:Write方法的NoteFirstByte调用语义正确。每次非空
Write都调用NoteFirstByte,首次调用设置activitySeen=true,后续调用更新lastActivity。虽然方法名暗示仅记录首字节,但实际上它也更新了活动时间戳,这对流式传输的空闲超时检测是必要的。
181-186:Flush仅在首字节已见时记录活动是正确的设计。
NoteActivity仅在activitySeen=true时更新时间戳。这确保了:
- 响应头刷新不会误触发活动记录
- 只有实际的响应体数据传输才被视为有效活动
28-48: LGTM! watchdog 生命周期管理正确。
- 返回
nil的守卫条件(cancel==nil或两个超时都 ≤0)避免了无意义的 watchdog 创建run()goroutine 在构造时启动,通过stopCh优雅停止stopOnce确保Stop()的幂等性internal/health/tracker_test.go (1)
10-578: LGTM! 测试套件全面且设计良好。测试覆盖了健康追踪器的关键行为:
- 评分偏好(低延迟、高成功率)
- 断路器状态转换(closed → open → half-open → closed/open)
- 半开探针的单并发约束
- 陈旧成功的正确处理
- 衰减半衰期行为
并发测试(
TestTrackerHalfOpenAllowsOnlySingleConcurrentProbe)正确使用了 channel 同步来验证单探针约束。项目配置 Go 1.26.0,完全支持for range 5语法。internal/executor/attempt_health_test.go (1)
1-80: 优质的错误处理和健康记录测试覆盖。测试设计清晰,覆盖了三个关键路径:
- 超时错误正确规范化为 504 Gateway Timeout
- 原始网络错误包装为 ProxyError,并正确设置
IsNetworkError和Retryable标记- 健康追踪器正确识别网络错误
testNetError通过编译时接口检查(第 20 行)确保实现net.Error,测试逻辑严谨。internal/core/database.go (1)
258-266: 整合正确,无需显式清理。 健康追踪器是纯内存结构,未启动后台 goroutine,也无需显式的 Stop/Close/Shutdown。providerHealthTracker在路由器和执行器之间正确共享,创建顺序合理。internal/flow/attempt_activity.go (1)
53-76:CompleteResponseBody的一次性收尾实现正确
Read(EOF) 与Close都走complete(),并通过sync.Once保证只触发一次,状态机收尾语义清晰。internal/health/tracker.go (1)
325-330: 半衰期衰减公式与目标语义一致这里的衰减实现能保证在一个 half-life 周期后衰减到 50%,与“2 分钟半衰期”的设计目标一致。
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/executor/attempt_health.go (1)
61-64: 可去掉冗余分支以降低维护成本。Line 61 到 Line 64 的
if attemptCtx == nil分支当前与后续分支行为完全一致(都返回err),建议合并为单一返回,减少后续改动时分支漂移风险。建议修改
- if attemptCtx == nil { - return err - } return err🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_health.go` around lines 61 - 64, The if branch checking attemptCtx == nil is redundant because both branches simply return err; remove the entire if statement and replace with a single return err at the end of the function (i.e., eliminate the duplicate conditional that references attemptCtx and leave one unconditional return err), ensuring no other logic depends on that check.internal/executor/attempt_health_test.go (1)
41-60: 建议补充网络错误状态码断言。当前已验证 raw network error 被包装并标记为 network/retryable;可再断言
HTTPStatusCode == 502,防止后续误改网络错误映射策略。建议补充
if !proxyErr.Retryable { t.Fatal("ProxyError.Retryable = false, want true for raw network error") } + if proxyErr.HTTPStatusCode != http.StatusBadGateway { + t.Fatalf("ProxyError.HTTPStatusCode = %d, want %d", proxyErr.HTTPStatusCode, http.StatusBadGateway) + } if !errors.Is(proxyErr.Err, rawErr) { t.Fatalf("ProxyError.Err = %v, want wrapped raw network error", proxyErr.Err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/executor/attempt_health_test.go` around lines 41 - 60, Update the test TestNormalizeAttemptErrorWrapsRawNetworkError to also assert that the wrapped domain.ProxyError has HTTPStatusCode == 502: after casting to *domain.ProxyError (from exec.normalizeAttemptError) add a check on proxyErr.HTTPStatusCode and call t.Fatalf (or t.Fatalf with a clear message) if it is not 502; this ensures the network error mapping in Executor.normalizeAttemptError remains 502 for raw network errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/executor/attempt_health_test.go`:
- Around line 41-60: Update the test
TestNormalizeAttemptErrorWrapsRawNetworkError to also assert that the wrapped
domain.ProxyError has HTTPStatusCode == 502: after casting to *domain.ProxyError
(from exec.normalizeAttemptError) add a check on proxyErr.HTTPStatusCode and
call t.Fatalf (or t.Fatalf with a clear message) if it is not 502; this ensures
the network error mapping in Executor.normalizeAttemptError remains 502 for raw
network errors.
In `@internal/executor/attempt_health.go`:
- Around line 61-64: The if branch checking attemptCtx == nil is redundant
because both branches simply return err; remove the entire if statement and
replace with a single return err at the end of the function (i.e., eliminate the
duplicate conditional that references attemptCtx and leave one unconditional
return err), ensuring no other logic depends on that check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 22343134-5d51-4658-acd5-ca31b710f42b
📒 Files selected for processing (3)
internal/executor/attempt_health.gointernal/executor/attempt_health_test.gotests/e2e/provider_failover_health_test.go
📜 Review details
🔇 Additional comments (13)
tests/e2e/provider_failover_health_test.go (7)
1-14: LGTM!导入和包声明正确,所有依赖项都在测试中使用。
16-90: LGTM!测试逻辑清晰:第一次请求验证故障转移行为,第二次请求验证健康感知路由跳过不健康的 provider。
slowProvider返回 502 并延迟 60ms,会正确影响健康分数,使后续请求优先选择健康的 provider。
92-165: LGTM!断路器测试设计合理:
- 预热阶段发送 3 个请求触发断路器打开
- 在断路器打开后添加健康 provider
- 验证后续请求跳过不健康 provider 并成功路由到健康 provider
167-254: LGTM!预算测试逻辑正确:
RequestTimeout=120ms总预算,TotalTimeout=60ms单次尝试超时- 每个慢 provider 延迟 100ms > 60ms,会触发超时
- 断言灵活允许 1-2 次命中(Lines 247-249),适应 CI 环境的时序波动
- 验证预算耗尽后返回 504,不会继续尝试后续 provider
时序预算已根据之前的审查反馈调整,测试稳定性得到改善。
256-277: LGTM!辅助函数结构清晰,正确使用
t.Helper()标记,错误处理完善。
279-306: LGTM!JSON payload 结构正确匹配
ProviderConfig结构体定义:disableErrorCooldown字段位于config对象内,与custom同级,符合internal/domain/model.go中的字段定义。
308-314: LGTM!优雅关闭测试服务器的辅助函数:先断开客户端连接再关闭服务器,防止连接泄漏。空指针检查提供了防御性保护。
internal/executor/attempt_health.go (5)
15-60: 错误归一化顺序与 504 映射实现正确。timeout/deadline 在网络错误判定之前处理,且超时场景统一覆盖为 504,能稳定避免 deadline 被误归类为 502。
67-106: response-started 场景的错误策略一致且清晰。该分支把 timeout/deadline 统一映射为 504,并将
Retryable置为 false,和“响应已开始后不切路由”的语义一致。
115-159: 健康记录字段映射完整,分类信号充足。
IsTimeout/IsNetwork/IsRateLimit/IsServerErr的提取覆盖了 proxy/raw error 两条路径,且与 tracker 的AttemptResult结构契合。
195-214: 重试等待裁剪链路实现合理。
MaxRetryAfter -> MaxRetryWait -> RequestRemaining的顺序明确,能同时约束协议建议等待与全局请求预算。
216-226: half-open probe 释放条件控制正确。仅在未记录健康结果时补偿释放 probe,避免重复释放并降低状态机错位风险。
internal/executor/attempt_health_test.go (1)
22-39: 关键回归路径测试覆盖到位。504 覆盖与 deadline 优先级(含 response-started)都被直接断言,这组用例对错误归一化回归非常有价值。
Also applies to: 82-119
|
@coderabbitai If the current suggestion is not adopted, will it lead to disruptive changes? |
|
Tip For best results, initiate chat on the files or code changes.
If you can point me to the specific review comment (e.g., by quoting it or sharing a link to it), I can give you a precise assessment of whether skipping it would cause breaking or disruptive behavior. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/handler/proxy.go (1)
74-79:⚠️ Potential issue | 🟡 Minor
responseStateWriter未实现http.Hijacker和http.Pusher接口。当前实现正确处理了
Flusher接口(通过类型断言委托给底层 writer),WriteHeader和Write也正确标记 started 状态。然而,Hijacker和Pusher接口未实现——嵌入的http.ResponseWriter字段无法满足针对这两个接口的类型断言。虽然当前代码(如
writeStreamError)仅使用Flusher和基础的Write/WriteHeader方法,没有直接引发运行时回归,但这构成了一个接口契约缺陷。如果未来添加 WebSocket 或 HTTP/2 Push 支持,传入此 writer 的代码将无法访问这些功能。建议补充实现这两个接口,或至少提供Unwrap()方法以便调用者访问底层 writer。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/proxy.go` around lines 74 - 79, The responseStateWriter currently returned by newResponseStateWriter doesn't implement http.Hijacker or http.Pusher causing type-assert failures for callers expecting those interfaces; update the responseStateWriter type to implement Hijacker (Hijack() (net.Conn, *bufio.ReadWriter, error)) and Pusher (Push(target string, opts *http.PushOptions) error) by delegating to the embedded underlying http.ResponseWriter when it supports those interfaces, and add an Unwrap() http.ResponseWriter method returning the underlying writer so callers (and future WebSocket/HTTP2 code like writeStreamError consumers) can access the raw writer if needed. Ensure Hijack/Push return appropriate errors when the underlying writer does not implement the interface.
🧹 Nitpick comments (2)
internal/handler/proxy.go (1)
244-250: 这里最好改用errors.As解出*domain.ProxyError。当前的直接类型断言只匹配最外层 error,而下面两段已经用
errors.Is接受 unwrap 之后的DeadlineExceeded/Canceled。如果后续链路用%w包了一层ProxyError,这里就会绕过writeDispatchError,丢掉HTTPStatusCode、Retry-After和 stream/non-stream 专用错误体。🔧 建议修改
- proxyErr, ok := err.(*domain.ProxyError) - if ok { + var proxyErr *domain.ProxyError + if errors.As(err, &proxyErr) { h.writeDispatchError(c, proxyErr, stream) c.Err = err c.Abort() return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/proxy.go` around lines 244 - 250, Replace the direct type assertion with errors.As so wrapped *domain.ProxyError values are detected: attempt to unwrap err into a variable (e.g., proxyErr) using errors.As(err, &proxyErr) and, if true, call h.writeDispatchError(c, proxyErr, stream), set c.Err = err and c.Abort() as before; this ensures HTTPStatusCode / Retry-After and stream/non-stream error bodies from domain.ProxyError are not lost when err is wrapped with %w.internal/handler/proxy_test.go (1)
36-52: 这个用例没有覆盖dispatch的真实回归点。这里直接调用
writeDispatchError,只能证明 helper 在responseHasStarted=true时不再写二次错误;如果dispatch顶部的 started 判断回退,DeadlineExceeded和普通 error 分支依然可能在半响应后写出 504/500,而这个测试不会失败。建议再补一个走dispatch的 case,并顺手覆盖仅WriteHeader就开始响应的场景。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/proxy_test.go` around lines 36 - 52, The current test calls writeDispatchError directly and doesn't exercise the dispatch entrypoint; add a new test that invokes ProxyHandler.dispatch (or the exported method that routes to dispatch) so the started-check at the top of dispatch is exercised end-to-end, and also add a subcase where the response is considered started by calling WriteHeader (but not Write) before triggering the error to ensure dispatch does not write a 504/500 after a partial response; reference ProxyHandler.dispatch, writeDispatchError, and the existing TestWriteDispatchErrorSkipsWhenResponseAlreadyStarted to mirror setup and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/handler/proxy.go`:
- Around line 74-79: The responseStateWriter currently returned by
newResponseStateWriter doesn't implement http.Hijacker or http.Pusher causing
type-assert failures for callers expecting those interfaces; update the
responseStateWriter type to implement Hijacker (Hijack() (net.Conn,
*bufio.ReadWriter, error)) and Pusher (Push(target string, opts
*http.PushOptions) error) by delegating to the embedded underlying
http.ResponseWriter when it supports those interfaces, and add an Unwrap()
http.ResponseWriter method returning the underlying writer so callers (and
future WebSocket/HTTP2 code like writeStreamError consumers) can access the raw
writer if needed. Ensure Hijack/Push return appropriate errors when the
underlying writer does not implement the interface.
---
Nitpick comments:
In `@internal/handler/proxy_test.go`:
- Around line 36-52: The current test calls writeDispatchError directly and
doesn't exercise the dispatch entrypoint; add a new test that invokes
ProxyHandler.dispatch (or the exported method that routes to dispatch) so the
started-check at the top of dispatch is exercised end-to-end, and also add a
subcase where the response is considered started by calling WriteHeader (but not
Write) before triggering the error to ensure dispatch does not write a 504/500
after a partial response; reference ProxyHandler.dispatch, writeDispatchError,
and the existing TestWriteDispatchErrorSkipsWhenResponseAlreadyStarted to mirror
setup and assertions.
In `@internal/handler/proxy.go`:
- Around line 244-250: Replace the direct type assertion with errors.As so
wrapped *domain.ProxyError values are detected: attempt to unwrap err into a
variable (e.g., proxyErr) using errors.As(err, &proxyErr) and, if true, call
h.writeDispatchError(c, proxyErr, stream), set c.Err = err and c.Abort() as
before; this ensures HTTPStatusCode / Retry-After and stream/non-stream error
bodies from domain.ProxyError are not lost when err is wrapped with %w.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ccf74aaa-68a8-4a1d-ba2b-64cfe5b5bf5a
📒 Files selected for processing (7)
internal/adapter/provider/custom/adapter.gointernal/domain/model.gointernal/handler/proxy.gointernal/handler/proxy_test.gointernal/repository/sqlite/models.gointernal/service/admin.gointernal/service/backup.go
🚧 Files skipped from review as they are similar to previous changes (5)
- internal/domain/model.go
- internal/repository/sqlite/models.go
- internal/adapter/provider/custom/adapter.go
- internal/service/backup.go
- internal/service/admin.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: e2e
- GitHub Check: playwright
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ymkiux
Repo: awsl-project/maxx PR: 0
File: :0-0
Timestamp: 2026-03-11T07:29:12.071Z
Learning: Reviewer ymkiux (repo: awsl-project/maxx) requires strict attention to breaking changes in all PRs. Always thoroughly audit public API, interface, and behavioral contract changes — especially in sidebar/auth/layout components — before raising review comments. Do not flag superficial style issues; focus on correctness, regressions, and breaking changes.
🔇 Additional comments (1)
internal/handler/proxy_test.go (1)
54-64: 这组状态码回归测试很好。它把
ProxyError.HTTPStatusCode在普通响应和 SSE 响应里的透传行为都锁住了,后续很难再无意回退到默认 502。Also applies to: 66-76
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/handler/response_state_writer.go (1)
59-61:responseHasStarted建议支持包装链解包当前只做直接类型断言;如果
c.Writer后续被其它中间层再包一层,会误判为未开始,进而触发“响应已开始后再次写错误体”的回归风险。建议改为可解包检查
func responseHasStarted(w http.ResponseWriter) bool { - state, ok := w.(*responseStateWriter) - return ok && state.started + for { + if state, ok := w.(*responseStateWriter); ok { + return state.started + } + uw, ok := w.(interface{ Unwrap() http.ResponseWriter }) + if !ok { + return false + } + w = uw.Unwrap() + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handler/response_state_writer.go` around lines 59 - 61, 在 responseHasStarted 中支持包装链解包:当前只做一次对 *responseStateWriter 的断言会在 Writer 被中间件再包一层时误判。修改 responseHasStarted 函数为循环解包:先检查当前 w 是否为 *responseStateWriter 并返回其 started 字段;否则检查 w 是否实现了 Unwrap() http.ResponseWriter(或类似的解包方法),如果实现则把 w 设为 Unwrap() 的返回值继续循环,直到不能再解包为止,最终返回 false。这样在存在多个包装器时仍能正确检测 responseStateWriter.started 状态。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/handler/response_state_writer.go`:
- Around line 31-35: The Flush method on responseStateWriter currently sets
started = true before checking support for http.Flusher; change it to only set
w.started = true after confirming the wrapped ResponseWriter implements
http.Flusher (same pattern used by Hijack), so if the type assertion to
(http.Flusher) fails you do not mark the response as started; update
responseStateWriter.Flush (and any related assumptions in responseHasStarted())
to follow this behavior so downstream error handling in proxy.go correctly
detects an unstarted response.
---
Nitpick comments:
In `@internal/handler/response_state_writer.go`:
- Around line 59-61: 在 responseHasStarted 中支持包装链解包:当前只做一次对 *responseStateWriter
的断言会在 Writer 被中间件再包一层时误判。修改 responseHasStarted 函数为循环解包:先检查当前 w 是否为
*responseStateWriter 并返回其 started 字段;否则检查 w 是否实现了 Unwrap()
http.ResponseWriter(或类似的解包方法),如果实现则把 w 设为 Unwrap() 的返回值继续循环,直到不能再解包为止,最终返回
false。这样在存在多个包装器时仍能正确检测 responseStateWriter.started 状态。
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 99b04724-52ec-44b8-b939-7bcfff486dc9
📒 Files selected for processing (2)
internal/handler/response_state_writer.gointernal/handler/response_state_writer_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/handler/response_state_writer_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: e2e
- GitHub Check: playwright
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ymkiux
Repo: awsl-project/maxx PR: 0
File: :0-0
Timestamp: 2026-03-11T07:29:12.071Z
Learning: Reviewer ymkiux (repo: awsl-project/maxx) requires strict attention to breaking changes in all PRs. Always thoroughly audit public API, interface, and behavioral contract changes — especially in sidebar/auth/layout components — before raising review comments. Do not flag superficial style issues; focus on correctness, regressions, and breaking changes.
🔇 Additional comments (1)
internal/handler/response_state_writer.go (1)
14-19: 避免重复包装的处理是正确的这里先判断
*responseStateWriter再返回原对象,能保留已有started状态并避免行为漂移。
…roject#421)" This reverts commit 67ab9e0.
1. 这次主要解决什么问题
原来的问题有两个:
priority只看Positionweighted_random之前不是真正权重随机Retry-After拖慢这次改动的目标就是:
2. 最终方案概览
这次没有重写整个架构,而是在现有
Router + Executor + Handler上补齐调度闭环。路由层
Router.Match现在会:priority:按Positionweighted_random:按真实权重随机执行层
Executor.dispatch现在会:RequestTimeoutTotalTimeoutFirstByteTimeoutStreamIdleTimeout健康层
Tracker负责:closed / open / half_openbreaker3. 路由算法怎么做
3.1
priority这个策略不变,仍然是:
3.2 真实
weighted_random这次把
Route.Weight真正打通了。现在 route 已经有:
默认规则:
算法采用加权随机无放回排序:
含义:
3.3 权重和健康同时生效
如果只是权重随机,不够。
如果只是按健康分硬排序,权重又会失效。
所以现在是联合计算:
当前健康乘子:
这样做的效果是:
4. 近期健康怎么做
健康状态的维度是:
Tracker 维护这些指标:
recentSuccessrecentFailurerecentTimeoutrecentRateLimitrecentServerErrewmaLatencyewmaTTFTinFlightconsecutiveUnhealthyFailures时间衰减
近期统计不是永久累计,而是带衰减:
当前半衰期是:
意思是最近几分钟的表现权重更高,老失败会逐步淡出。
评分思路
健康分大致由这些项组成:
这不是复杂模型,而是一个可解释的启发式打分。
5. Breaker 怎么做
状态有三种:
closedopenhalf_open状态流转:
什么算不健康失败
正常阶段:
在
half_open探测阶段,429也会 reopen。为什么普通 4xx 不 reopen
普通 4xx 很可能是用户请求问题,不是 provider 坏了。
所以现在:
half_open遇到普通 4xx这样不会把健康 provider 因为一条 bad request 再次摘掉。
为什么只允许一个 probe
half_open的关键就是只放一个恢复探测请求。当前实现保证:
Record()消费结果才结束6. 执行链怎么避免被坏 provider 拖慢
共享请求预算
新增了:
默认值:
作用:
RequestTimeout限制整次请求总预算TotalTimeout限制单次 attemptFirstByteTimeout防止上游迟迟没首字节StreamIdleTimeout防止流式中途卡死MaxRetryAfter / MaxRetryWait防止等待时间吃光预算实际效果
之前最坏情况接近:
现在会被:
硬截断。
7. first byte / stream idle 怎么做
新增了 attempt watchdog。
它会:
FirstByteTimeoutStreamIdleTimeout为了避免误判,不只是在
ResponseWriter.Write记录活动,还在读取上游 body 时记录活动:这样可以区分:
8. 错误语义怎么处理
这次把 timeout / deadline / response-started 之后的错误统一归成
ProxyError。规则是:
504这解决了之前“已经写出部分响应后又补一份错误”的问题。
9. 关键代码位置
如果 reviewer 想快速看代码,优先看这几个文件:
internal/router/router.gointernal/health/tracker.gointernal/executor/middleware_dispatch.gointernal/executor/attempt_budget.gointernal/executor/attempt_watchdog.gointernal/executor/attempt_health.gointernal/flow/attempt_activity.gointernal/handler/proxy.go10. 主要测试覆盖
单元测试
覆盖了:
weighted_randome2e
覆盖了:
11. 结果
这次改动完成后,核心收益是:
weighted_random终于变成了真实权重随机Summary by CodeRabbit
新功能
改进
测试