Skip to content

fix: (ai-proxy-multi) health check not work#12968

Open
elizax wants to merge 2 commits intoapache:masterfrom
elizax:master
Open

fix: (ai-proxy-multi) health check not work#12968
elizax wants to merge 2 commits intoapache:masterfrom
elizax:master

Conversation

@elizax
Copy link

@elizax elizax commented Feb 4, 2026

ai-proxy-multi 健康检查过滤解决方案

问题描述

现象

  • ai-proxy-multi 插件配置了多个 LLM 实例
  • 健康检查检测到某些实例返回 404,标记为 unhealthy
  • 但负载均衡器仍然选择这些不健康的实例处理请求

预期行为

不健康的实例应该被自动剔除,只将健康的实例加入负载均衡池。

测试场景


根本原因分析

核心问题:Worker 间健康状态不同步

1. Upstream vs ai-proxy-multi 的架构差异

特性 Upstream ai-proxy-multi
配置来源 res_conf.value.upstream 动态构建 (plugin.construct_upstream)
Checker 创建 内置机制,直接使用 需要通过 healthcheck_manager.fetch_checker 获取
Worker 同步 通过 balance 模块保证 无同步机制
健康状态读取 统一接口 各 worker 独立读取

2. 问题一:Nil Checker 导致健康检查失效

原始代码

local checker = healthcheck_manager.fetch_checker(resource_path, resource_version)
checkers = checkers or {}
checkers[instance.name] = checker  -- ❌ checker 是 nil 也会加入

问题

  • 重启后,fetch_checker 可能返回 nil(checker 尚未创建)
  • checkers = {["instance1"] = nil, ["instance2"] = nil}
  • next(checkers) 返回 ("instance1", nil)
  • get_shm_infochecker_ref 为 nil,无法获取 .shm
  • 所有实例被默认加入 picker

解决方案

local checker = healthcheck_manager.fetch_checker(resource_path, resource_version)
if checker then  -- ✅ 只有非 nil 的 checker 才加入
    checkers = checkers or {}
    checkers[instance.name] = checker
end

3. 问题二:_dns_value 运行时字段缺失

原始代码

function _M.construct_upstream(instance)
    local node = instance._dns_value  -- ❌ 直接读取运行时字段

    if not node then
        return nil, "failed to resolve endpoint for instance: " .. instance.name
    end
    -- ...
end

问题

  • _dns_value 是在请求处理过程中通过 resolve_endpoint() 动态生成的运行时字段
  • 定时器从配置中心(etcd/Admin API)读取的是原始配置,不包含 _dns_value 字段
  • 定时器创建 checker 时调用 construct_upstream,检查 _dns_value 不存在时返回 nil
  • 导致 checker 创建失败,从 waiting_pool 删除,后续请求永远获取不到 checker

定时器创建 checker 流程

1. resource.fetch_latest_conf(resource_path)
   → 从 etcd 读取原始配置(没有 _dns_value)

2. jp.value(res_conf.value, json_path)
   → 提取 instance 配置(仍没有 _dns_value)

3. plugin.construct_upstream(instance_config)
   → 检查 instance._dns_value
   → 不存在 → 返回 nil ❌

4. create_checker(upstream)
   → upstream 为 nil → 创建失败 ❌

解决方案

// 新增函数:从配置计算 DNS node
local function calculate_dns_node(instance_conf)
    local scheme, host, port
    local endpoint = core.table.try_read_attr(instance_conf, "override", "endpoint")
    if endpoint then
        scheme, host, port = endpoint:match(endpoint_regex)
        if port == "" then
            port = (scheme == "https") and "443" or "80"
        end
        port = tonumber(port)
    else
        local ai_driver = require("apisix.plugins.ai-drivers." .. instance_conf.provider)
        scheme = "https"
        host = ai_driver.host
        port = ai_driver.port
    end
    local node = {
        host = host,
        port = tonumber(port),
        scheme = scheme,
    }
    parse_domain_for_node(node)
    return node
end

// 修改 construct_upstream
function _M.construct_upstream(instance)
    local upstream = {}
    local node = instance._dns_value

    // ✅ 如果 _dns_value 不存在,从配置自动计算
    if not node then
        core.log.info("instance._dns_value not found, calculating from config for instance: ", instance.name)
        node = calculate_dns_node(instance)
        if not node then
            return nil, "failed to calculate endpoint for instance: " .. instance.name
        end
    end
    -- ...
end

效果

  • 定时器创建 checker 时,即使 _dns_value 不存在也能自动计算
  • 不再依赖运行时生成的字段
  • Checker 可以正常创建

4. 问题三:LRU 缓存导致过期 Picker 被复用

原始机制

local version = plugin.conf_version(conf)  -- 配置不变则 version 不变
local server_picker = lrucache_server_picker(ctx.matched_route.key, version, ...)

问题

  • conf_version 只有配置变更时才变化
  • LRU 缓存 TTL = 10 秒
  • 健康状态变化时,缓存 key 不变,过期 picker 继续被使用

解决方案

local version = plugin.conf_version(conf)
if checkers then
    local status_ver = get_checkers_status_ver(checkers)  -- 健康状态变化时递增
    version = version .. "#s" .. status_ver
end

效果

  • 健康状态变化 → status_ver 递增 → version 变化 → LRU 缓存失效 → 创建新 picker

解决方案详解

方案一:通过 SHM 同步 Worker 间健康状态

问题

不同 worker 的 checker 对象是独立的,本地缓存通过 worker events 异步同步,导致状态不一致。

解决

直接从 SHM(共享内存)读取权威健康状态,绕过本地缓存:

local function fetch_health_status_from_shm(shm, checker_name, ip, port, hostname, instance_name)
    local lookup_hostname = hostname or ip
    local state_key = string.format("lua-resty-healthcheck:%s:state:%s:%s:%s",
        checker_name, ip, port, lookup_hostname)

    local state = shm:get(state_key)
    if state then
        -- State: 1=healthy, 2=unhealthy, 3=mostly_healthy, 4=mostly_unhealthy
        local ok = (state == 1 or state == 3)
        return ok, state
    end

    -- 状态未找到,默认为健康
    return true, nil
end

关键点

  • SHM 是所有 worker 共享的内存区域
  • 健康检查状态写入 SHM,所有 worker 都能读取到一致的状态
  • 绕过了本地缓存可能存在的延迟

完整代码修改

文件:ai-proxy-multi.lua

修改 1:只添加有效的 Checker

//381-385local checker = healthcheck_manager.fetch_checker(resource_path, resource_version)
if checker then  -- ✅ 关键修改
    checkers = checkers or {}
    checkers[instance.name] = checker
end

修改 2:添加 SHM 状态读取函数

//228-250local function fetch_health_status_from_shm(shm, checker_name, ip, port, hostname, instance_name)
    local lookup_hostname = hostname or ip
    local state_key = string.format("lua-resty-healthcheck:%s:state:%s:%s:%s",
        checker_name, ip, port, lookup_hostname)

    core.log.info("[SHM-DIRECT] instance=", instance_name, " key=", state_key)
    local state = shm:get(state_key)
    if state then
        local ok = (state == 1 or state == 3)
        core.log.info("[SHM-DIRECT] instance=", instance_name, " state=", state, " ok=", ok)
        return ok, state
    end

    core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to healthy")
    return true, nil
end

修改 3:获取 SHM 和 Checker 信息

//253-273local function get_shm_info(checkers, conf, i, ins)
    local checker = checkers and checkers[ins.name]

    -- 优先使用实例自己的 checker
    if checker and checker.shm then
        core.log.info("[SHM-DEBUG] instance=", ins.name, " using own checker")
        return checker.shm, checker.name
    end

    -- 回退:使用另一个 checker 的 SHM
    local checker_ref = checkers and next(checkers)
    if checker_ref and checker_ref.shm then
        local checker_name = "upstream#" .. conf._meta.parent.resource_key .. "#plugins['ai-proxy-multi'].instances[" .. (i - 1) .. "]"
        core.log.info("[SHM-DEBUG] instance=", ins.name, " using fallback checker_ref, checker_name=", checker_name)
        return checker_ref.shm, checker_name
    end

    core.log.warn("[SHM-DEBUG] instance=", ins.name, " checkers=", checkers and "exists" or "nil", " checker_ref=", checker_ref and "exists" or "nil")
    return nil, nil
end

修改 4:使用 Status Ver 作为缓存 Key

//389-405-- 使用 status_ver 作为缓存 key,健康状态变化时立即刷新
local version = plugin.conf_version(conf)
if checkers then
    local status_ver = get_checkers_status_ver(checkers)
    version = version .. "#s" .. status_ver
end

-- 使用 LRU 缓存减少 SHM 访问
local server_picker = ctx.server_picker
if not server_picker then
    server_picker = lrucache_server_picker(ctx.matched_route.key, version,
                                           create_server_picker, conf, ups_tab, checkers)
end
if not server_picker then
    return nil, nil, "failed to create server picker"
end
ctx.server_picker = server_picker

测试结果

测试 1:重启后立即测试

等待时间: 2 秒
结果: 88/100 成功 (88%)
说明: 前14个请求中7个404,之后全部成功
原因: 健康检查由请求触发,首次请求时 checker 尚未完成第一次检查

测试 2:健康检查完成后

结果: 100/100 成功 (100%)
说明: 所有请求都正确选择了健康的实例

日志证据

# Picker 正确创建为只包含健康实例
[info] fetch health instances: {"_priority_index":[0],"0":{"deepseek-instance2":1}}

# SHM 读取显示正确的健康状态
[info] [SHM-DIRECT] instance=deepseek-instance1 state=2 ok=false  (unhealthy)
[info] [SHM-DIRECT] instance=deepseek-instance2 state=1 ok=true   (healthy)

# 实例选择
[info] picked instance: deepseek-instance2  ✓ (只选择健康实例)

关键要点总结

1. Worker 间状态同步是核心问题

Upstream 的工作方式

  • 内置 balance 模块统一管理
  • Checker 创建和状态更新有统一机制
  • 各 worker 自动同步状态

ai-proxy-multi 的挑战

  • 插件动态构建 upstream
  • Checker 通过 healthcheck_manager 异步创建
  • 没有内置的 worker 间同步机制

解决方案

  • 通过直接读取 SHM 获取权威健康状态
  • 所有 worker 从同一数据源读取,确保一致性

2. LRU 缓存失效机制决定 Picker 更新频率

缓存 Key 的演进

方案 缓存 Key 问题 解决
原始 conf_version 配置不变则 key 不变 添加 status_ver
最终方案 conf_version .. "#s" .. status_ver 健康状态变化立即刷新

效果

  • 健康状态变化 → status_ver 递增 → 缓存 key 变化 → LRU 失效 → 创建新 picker
  • LRU TTL = 10 秒作为兜底,防止 status_ver 漏网

3. Nil Checker 处理是细节关键

问题

checkers[instance.name] = nil  -- next(checkers) 返回 nil 值

后果

  • next(checkers) 返回 ("key", nil)
  • checker_ref 为 nil
  • 无法访问 checker_ref.shm
  • 健康检查失效

解决

if checker then
    checkers[instance.name] = checker  -- 只添加有效值
end
-- next(checkers) 要么返回有效 checker,要么返回 nil

重启后前几秒的 404 问题

现象

重启后前几秒(约 3-5 秒)仍会有少量 404 响应。

原因

这是预期行为,原因:

  1. 健康检查由请求触发,非定时运行
  2. 目标初始状态为 unhealthy(修复后)
  3. 第一次健康检查完成后才会标记为 healthy/unhealthy
  4. 在健康检查完成前,picker 可能包含所有实例

优化建议(可选)

如果需要完全消除重启后的 404,可以考虑:

  1. healthcheck_manager.lua 中添加启动时预热逻辑
  2. 调整健康检查参数
    • 增加 healthy.successes(需要更多成功才标记为 healthy)
    • 降低 unhealthy.http_failures(更快标记为 unhealthy)

相关文件

文件路径 作用 是否修改
apisix/plugins/ai-proxy-multi.lua 多实例路由插件 ✅ 是
apisix/healthcheck_manager.lua Checker 管理 ❌ 否
deps/share/lua/5.1/resty/healthcheck.lua 健康检查库 ❌ 否

总结

通过四个层面的组合修复,成功解决了 ai-proxy-multi 健康检查过滤的问题:

1. Worker 间状态同步(核心问题)

  • 问题:ai-proxy-multi 没有 Upstream 那样的内置同步机制
  • 解决:直接从 SHM 读取权威健康状态

2. 定时器创建 Checker 失败(基础问题)

  • 问题:定时器从配置中心读取配置时,_dns_value 运行时字段不存在
  • 解决:添加 calculate_dns_node() 函数,从配置自动计算 endpoint

3. LRU 缓存失效机制(效率保证)

  • 问题:缓存 key 不变导致过期 picker 被复用
  • 解决:使用 status_ver 作为缓存 key,健康状态变化立即刷新

4. Nil Checker 处理(细节关键)

  • 问题:nil 值被加入 checkers 表导致健康检查失效
  • 解决:只有有效的 checker 才加入表

3. Nil Checker 处理(细节关键)

  • 问题:nil 值加入 checkers 表导致健康检查失效
  • 解决:只有有效的 checker 才加入表

核心改进:从依赖运行时生成的字段和本地缓存,改为支持从 SHM 读取权威健康状态,并通过 status_ver 实现缓存的即时失效。

这使得系统在健康检查完成后能够达到 100% 的成功率,正确剔除不健康的实例。

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Feb 4, 2026
@elizax elizax changed the title fix(ai-proxy-multi):health check not work fix:(ai-proxy-multi) health check not work Feb 4, 2026
@elizax elizax changed the title fix:(ai-proxy-multi) health check not work fix: (ai-proxy-multi) health check not work Feb 4, 2026
@Baoyuantop
Copy link
Contributor

Hi @elizax, please use English in the public channel. Could you add a test case for your changes?

@guilongyang
Copy link

guilongyang commented Feb 10, 2026

@elizax 我将更改放到3.15.0的镜像中运行,发现过几秒之后就会出现取不到shm,然后将全部实例都加进去,导致仍然偏转到不健康的实例中。
rec8WYZElDo03e9XKkXy3gMy1SDawZ782yw

@Baoyuantop Baoyuantop requested a review from Copilot February 13, 2026 07:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes health check filtering in the ai-proxy-multi plugin to ensure unhealthy LLM instances are properly excluded from load balancing. The root cause was inconsistent health state synchronization across workers and nil checker handling issues.

Changes:

  • Modified health check to read state directly from shared memory (SHM) instead of relying on per-worker local caches
  • Fixed nil checker handling to prevent invalid checkers from being added to the checkers table
  • Updated LRU cache key generation to include health status version for immediate picker refresh when health states change

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
test-nginx Added subproject reference for test infrastructure
apisix/plugins/ai-proxy-multi.lua Core changes: SHM-based health status reading, nil checker filtering, calculate_dns_node extraction, LRU cache key optimization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -35,7 +35,7 @@ local endpoint_regex = "^(https?)://([^:/]+):?(%d*)/?.*$"

local pickers = {}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTL reduction from 300 to 10 seconds significantly increases SHM access frequency. Consider documenting why this aggressive TTL is needed, especially since status_ver changes should already invalidate the cache when health states change.

Suggested change
local pickers = {}
local pickers = {}
-- NOTE:
-- The TTL here is intentionally kept small (10s) instead of the more typical 300s.
-- Health state changes are already handled via status_ver-based invalidation in the
-- healthcheck manager, but this cache also needs to reflect other dynamic changes
-- (for example, configuration updates, priority/weight adjustments, or backend
-- endpoint rotation) in a timely manner for AI traffic routing.
--
-- Using a 10s TTL trades slightly higher SHM access frequency for faster convergence
-- towards the latest upstream selection state, which is acceptable for this plugin's
-- usage pattern. If you change this value, consider the impact on both performance
-- and how quickly picker state reflects configuration and routing changes.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +174
-- Calculate DNS node from instance config without modifying the input
-- Returns a node table with host, port, scheme fields
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function comment should clarify that this is intended for use when _dns_value is not available (e.g., when called from timer context), as this is a critical use case mentioned in the PR description.

Suggested change
-- Calculate DNS node from instance config without modifying the input
-- Returns a node table with host, port, scheme fields
-- Calculate DNS node from instance config without modifying the input.
-- Intended for use when _dns_value is not available (e.g., when called
-- from timer context) to recompute the target node.
-- Returns a node table with host, port, scheme fields.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +189
-- built-in ai driver always use https
scheme = "https"
host = ai_driver.host
port = ai_driver.port
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed code handled ai_driver.get_node() which appears to be a valid driver interface. This removal could break custom AI drivers that implement get_node(). Consider preserving this logic or documenting why it was removed.

Suggested change
-- built-in ai driver always use https
scheme = "https"
host = ai_driver.host
port = ai_driver.port
-- built-in ai driver always use https; custom drivers may implement get_node()
if ai_driver.get_node then
local driver_node = ai_driver.get_node(instance_conf)
if driver_node then
scheme = driver_node.scheme or "https"
host = driver_node.host
port = driver_node.port
else
scheme = "https"
host = ai_driver.host
port = ai_driver.port
end
else
scheme = "https"
host = ai_driver.host
port = ai_driver.port
end

Copilot uses AI. Check for mistakes.
Comment on lines +247 to +249
-- State not found in SHM (checker not yet created), default to healthy
core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to healthy")
return true, nil
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to healthy when state is not found contradicts the stated goal of excluding unhealthy instances. This could allow requests to uninitialized instances. Consider defaulting to unhealthy until the first health check completes, or document why healthy is the correct default.

Suggested change
-- State not found in SHM (checker not yet created), default to healthy
core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to healthy")
return true, nil
-- State not found in SHM (checker not yet created), default to unhealthy to avoid routing to uninitialized instances
core.log.warn("[SHM-DIRECT] state not found for instance=", instance_name, ", defaulting to unhealthy")
return false, nil

Copilot uses AI. Check for mistakes.
end

-- Fallback: use another checker's SHM and construct the checker_name
local checker_ref = checkers and next(checkers)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next(checkers) returns (key, value), so checker_ref is actually the key (instance name), not the checker object. This code attempts to access checker_ref.shm which would fail. Should be: local _, checker_ref = next(checkers)

Suggested change
local checker_ref = checkers and next(checkers)
local _, checker_ref = checkers and next(checkers)

Copilot uses AI. Check for mistakes.

-- Fallback: use another checker's SHM and construct the checker_name
local checker_ref = checkers and next(checkers)
if checker_ref and checker_ref.shm then
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array index calculation (i - 1) suggests Lua's 1-based indexing is being converted to 0-based. Add a comment explaining this is for compatibility with the checker naming convention, as it's not immediately obvious why the adjustment is needed.

Suggested change
if checker_ref and checker_ref.shm then
if checker_ref and checker_ref.shm then
-- Note: (i - 1) converts Lua's 1-based index to 0-based to match the
-- existing checker naming convention used when the health checker is created.

Copilot uses AI. Check for mistakes.
end

local node = {
local upstream_node = {
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name upstream_node shadows the outer scope variable node. While functionally correct, this could be confusing. Consider renaming to something like node_config to distinguish the constructed upstream node from the DNS-resolved node.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments