fix: (ai-proxy-multi) health check not work#12968
fix: (ai-proxy-multi) health check not work#12968elizax wants to merge 2 commits intoapache:masterfrom
Conversation
|
Hi @elizax, please use English in the public channel. Could you add a test case for your changes? |
|
@elizax 我将更改放到3.15.0的镜像中运行,发现过几秒之后就会出现取不到shm,然后将全部实例都加进去,导致仍然偏转到不健康的实例中。 |
There was a problem hiding this comment.
Pull request overview
This PR fixes health check filtering in the ai-proxy-multi plugin to ensure unhealthy LLM instances are properly excluded from load balancing. The root cause was inconsistent health state synchronization across workers and nil checker handling issues.
Changes:
- Modified health check to read state directly from shared memory (SHM) instead of relying on per-worker local caches
- Fixed nil checker handling to prevent invalid checkers from being added to the checkers table
- Updated LRU cache key generation to include health status version for immediate picker refresh when health states change
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| test-nginx | Added subproject reference for test infrastructure |
| apisix/plugins/ai-proxy-multi.lua | Core changes: SHM-based health status reading, nil checker filtering, calculate_dns_node extraction, LRU cache key optimization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -35,7 +35,7 @@ local endpoint_regex = "^(https?)://([^:/]+):?(%d*)/?.*$" | |||
|
|
|||
| local pickers = {} | |||
There was a problem hiding this comment.
The TTL reduction from 300 to 10 seconds significantly increases SHM access frequency. Consider documenting why this aggressive TTL is needed, especially since status_ver changes should already invalidate the cache when health states change.
| local pickers = {} | |
| local pickers = {} | |
| -- NOTE: | |
| -- The TTL here is intentionally kept small (10s) instead of the more typical 300s. | |
| -- Health state changes are already handled via status_ver-based invalidation in the | |
| -- healthcheck manager, but this cache also needs to reflect other dynamic changes | |
| -- (for example, configuration updates, priority/weight adjustments, or backend | |
| -- endpoint rotation) in a timely manner for AI traffic routing. | |
| -- | |
| -- Using a 10s TTL trades slightly higher SHM access frequency for faster convergence | |
| -- towards the latest upstream selection state, which is acceptable for this plugin's | |
| -- usage pattern. If you change this value, consider the impact on both performance | |
| -- and how quickly picker state reflects configuration and routing changes. |
| -- Calculate DNS node from instance config without modifying the input | ||
| -- Returns a node table with host, port, scheme fields |
There was a problem hiding this comment.
The function comment should clarify that this is intended for use when _dns_value is not available (e.g., when called from timer context), as this is a critical use case mentioned in the PR description.
| -- Calculate DNS node from instance config without modifying the input | |
| -- Returns a node table with host, port, scheme fields | |
| -- Calculate DNS node from instance config without modifying the input. | |
| -- Intended for use when _dns_value is not available (e.g., when called | |
| -- from timer context) to recompute the target node. | |
| -- Returns a node table with host, port, scheme fields. |
| -- built-in ai driver always use https | ||
| scheme = "https" | ||
| host = ai_driver.host | ||
| port = ai_driver.port |
There was a problem hiding this comment.
The removed code handled ai_driver.get_node() which appears to be a valid driver interface. This removal could break custom AI drivers that implement get_node(). Consider preserving this logic or documenting why it was removed.
| -- built-in ai driver always use https | |
| scheme = "https" | |
| host = ai_driver.host | |
| port = ai_driver.port | |
| -- built-in ai driver always use https; custom drivers may implement get_node() | |
| if ai_driver.get_node then | |
| local driver_node = ai_driver.get_node(instance_conf) | |
| if driver_node then | |
| scheme = driver_node.scheme or "https" | |
| host = driver_node.host | |
| port = driver_node.port | |
| else | |
| scheme = "https" | |
| host = ai_driver.host | |
| port = ai_driver.port | |
| end | |
| else | |
| scheme = "https" | |
| host = ai_driver.host | |
| port = ai_driver.port | |
| end |
| -- State not found in SHM (checker not yet created), default to healthy | ||
| core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to healthy") | ||
| return true, nil |
There was a problem hiding this comment.
Defaulting to healthy when state is not found contradicts the stated goal of excluding unhealthy instances. This could allow requests to uninitialized instances. Consider defaulting to unhealthy until the first health check completes, or document why healthy is the correct default.
| -- State not found in SHM (checker not yet created), default to healthy | |
| core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to healthy") | |
| return true, nil | |
| -- State not found in SHM (checker not yet created), default to unhealthy to avoid routing to uninitialized instances | |
| core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to unhealthy") | |
| return false, nil |
| end | ||
|
|
||
| -- Fallback: use another checker's SHM and construct the checker_name | ||
| local checker_ref = checkers and next(checkers) |
There was a problem hiding this comment.
next(checkers) returns (key, value), so checker_ref is actually the key (instance name), not the checker object. This code attempts to access checker_ref.shm which would fail. Should be: local _, checker_ref = next(checkers)
| local checker_ref = checkers and next(checkers) | |
| local _, checker_ref = checkers and next(checkers) |
|
|
||
| -- Fallback: use another checker's SHM and construct the checker_name | ||
| local checker_ref = checkers and next(checkers) | ||
| if checker_ref and checker_ref.shm then |
There was a problem hiding this comment.
The array index calculation (i - 1) suggests Lua's 1-based indexing is being converted to 0-based. Add a comment explaining this is for compatibility with the checker naming convention, as it's not immediately obvious why the adjustment is needed.
| if checker_ref and checker_ref.shm then | |
| if checker_ref and checker_ref.shm then | |
| -- Note: (i - 1) converts Lua's 1-based index to 0-based to match the | |
| -- existing checker naming convention used when the health checker is created. |
| end | ||
|
|
||
| local node = { | ||
| local upstream_node = { |
There was a problem hiding this comment.
The variable name upstream_node shadows the outer scope variable node. While functionally correct, this could be confusing. Consider renaming to something like node_config to distinguish the constructed upstream node from the DNS-resolved node.

ai-proxy-multi 健康检查过滤解决方案
问题描述
现象
预期行为
不健康的实例应该被自动剔除,只将健康的实例加入负载均衡池。
测试场景
根本原因分析
核心问题:Worker 间健康状态不同步
1. Upstream vs ai-proxy-multi 的架构差异
res_conf.value.upstreamplugin.construct_upstream)healthcheck_manager.fetch_checker获取2. 问题一:Nil Checker 导致健康检查失效
原始代码:
问题:
fetch_checker可能返回 nil(checker 尚未创建)checkers = {["instance1"] = nil, ["instance2"] = nil}next(checkers)返回("instance1", nil)get_shm_info中checker_ref为 nil,无法获取.shm解决方案:
3. 问题二:
_dns_value运行时字段缺失原始代码:
问题:
_dns_value是在请求处理过程中通过resolve_endpoint()动态生成的运行时字段_dns_value字段construct_upstream,检查_dns_value不存在时返回 nil定时器创建 checker 流程:
解决方案:
效果:
_dns_value不存在也能自动计算4. 问题三:LRU 缓存导致过期 Picker 被复用
原始机制:
问题:
conf_version只有配置变更时才变化解决方案:
效果:
status_ver递增 →version变化 → LRU 缓存失效 → 创建新 picker解决方案详解
方案一:通过 SHM 同步 Worker 间健康状态
问题
不同 worker 的
checker对象是独立的,本地缓存通过 worker events 异步同步,导致状态不一致。解决
直接从 SHM(共享内存)读取权威健康状态,绕过本地缓存:
关键点:
完整代码修改
文件:
ai-proxy-multi.lua修改 1:只添加有效的 Checker
修改 2:添加 SHM 状态读取函数
修改 3:获取 SHM 和 Checker 信息
修改 4:使用 Status Ver 作为缓存 Key
测试结果
测试 1:重启后立即测试
测试 2:健康检查完成后
日志证据
关键要点总结
1. Worker 间状态同步是核心问题
Upstream 的工作方式:
ai-proxy-multi 的挑战:
healthcheck_manager异步创建解决方案:
2. LRU 缓存失效机制决定 Picker 更新频率
缓存 Key 的演进:
conf_versionconf_version .. "#s" .. status_ver效果:
status_ver递增 → 缓存 key 变化 → LRU 失效 → 创建新 picker3. Nil Checker 处理是细节关键
问题:
后果:
next(checkers)返回("key", nil)checker_ref为 nilchecker_ref.shm解决:
重启后前几秒的 404 问题
现象
重启后前几秒(约 3-5 秒)仍会有少量 404 响应。
原因
这是预期行为,原因:
unhealthy(修复后)优化建议(可选)
如果需要完全消除重启后的 404,可以考虑:
healthcheck_manager.lua中添加启动时预热逻辑healthy.successes(需要更多成功才标记为 healthy)unhealthy.http_failures(更快标记为 unhealthy)相关文件
apisix/plugins/ai-proxy-multi.luaapisix/healthcheck_manager.luadeps/share/lua/5.1/resty/healthcheck.lua总结
通过四个层面的组合修复,成功解决了 ai-proxy-multi 健康检查过滤的问题:
1. Worker 间状态同步(核心问题)
2. 定时器创建 Checker 失败(基础问题)
_dns_value运行时字段不存在calculate_dns_node()函数,从配置自动计算 endpoint3. LRU 缓存失效机制(效率保证)
4. Nil Checker 处理(细节关键)
3. Nil Checker 处理(细节关键)
核心改进:从依赖运行时生成的字段和本地缓存,改为支持从 SHM 读取权威健康状态,并通过 status_ver 实现缓存的即时失效。
这使得系统在健康检查完成后能够达到 100% 的成功率,正确剔除不健康的实例。