Skip to content

feat(probe): add Kubernetes probe support with liveness, readiness, and startup checks#3213

Open
Alanxtl wants to merge 4 commits intoapache:developfrom
Alanxtl:develop
Open

feat(probe): add Kubernetes probe support with liveness, readiness, and startup checks#3213
Alanxtl wants to merge 4 commits intoapache:developfrom
Alanxtl:develop

Conversation

@Alanxtl
Copy link
Contributor

@Alanxtl Alanxtl commented Feb 13, 2026

This is the implemention of #2039
which is the rewritten of #3047
usage are demonstrated in apache/dubbo-go-samples#1033
docs are written in apache/dubbo-website#3193

Kubernetes 探针(Probe)功能说明

本模块提供独立的 HTTP 探针服务,面向 Kubernetes 的 livenessreadinessstartup 三类探针。
它支持用户自定义健康检查逻辑,并可选择性地与 Dubbo Server 生命周期进行内部状态对齐。

设计目标

  1. 可扩展:通过回调注册自定义检查逻辑。
  2. 可控风险liveness 默认不带内部逻辑,避免不当重启。
  3. 生命周期对齐readiness/startup 可选用内部状态。

默认 HTTP 路径

当启用 probe 后,默认在22222端口下暴露以下路径:

  • GET /live:liveness 探针
  • GET /ready:readiness 探针
  • GET /startup:startup 探针

响应规则:

  • 所有检查通过:HTTP 200
  • 任一检查失败:HTTP 503

new api 配置方式

通过 metrics.NewOptions(...) 传入以下 Option 配置:

ins, err := dubbo.NewInstance(
  dubbo.WithMetrics(
    metrics.WithProbeEnabled(),
    metrics.WithProbePort(22222),
    metrics.WithProbeLivenessPath("/live"),
    metrics.WithProbeReadinessPath("/ready"),
    metrics.WithProbeStartupPath("/startup"),
    metrics.WithProbeUseInternalState(true),
  ),
)

old api 配置方式

metrics 配置下新增 probe 子配置:

metrics:
  probe:
    enabled: true
    port: "22222"
    liveness-path: "/live"
    readiness-path: "/ready"
    startup-path: "/startup"
    use-internal-state: true

配置项说明:

  • enabled:是否开启 probe 服务
  • port:probe HTTP 端口
  • liveness-path:liveness 路径
  • readiness-path:readiness 路径
  • startup-path:startup 路径
  • use-internal-state:是否启用内部生命周期状态检查,默认启用

内部状态(UseInternalState)

use-internal-state: true 时,探针会附加内部状态检查:

  • readiness 依赖 probe.SetReady(true/false)
  • startup 依赖 probe.SetStartupComplete(true/false)

默认行为:

  • 应用启动完成后(Server.Serve() 成功执行)会设置 ready=truestartup=true
  • 应用优雅关闭时会将 ready=false

如果设置为 false,则完全由用户注册的回调决定探针结果。

自定义健康检查(推荐)

通过注册回调即可扩展探针逻辑:

import "dubbo.apache.org/dubbo-go/v3/metrics/probe"

// liveness 例子
probe.RegisterLiveness("db", func(ctx context.Context) error {
    // 检查数据库连接
    return nil
})

// readiness 例子
probe.RegisterReadiness("cache", func(ctx context.Context) error {
    // 检查缓存或依赖中间件
    return nil
})

// startup 例子
probe.RegisterStartup("warmup", func(ctx context.Context) error {
    // 检查预热逻辑是否完成
    return nil
})

注意事项

  • liveness 风险:liveness 失败会触发 Pod 重启,请谨慎设置,推荐仅用作进程/核心依赖检测。
  • readiness 适配:可以关联注册中心、数据库、缓存、下游依赖等健康状态。
  • startup 适配:建议用于冷启动、预热或依赖初始化场景。

Kubernetes 示例

livenessProbe:
  httpGet:
    path: /live
    port: 22222
  initialDelaySeconds: 5
  periodSeconds: 5
readinessProbe:
  httpGet:
    path: /ready
    port: 22222
  initialDelaySeconds: 5
  periodSeconds: 5
startupProbe:
  httpGet:
    path: /startup
    port: 22222
  failureThreshold: 30
  periodSeconds: 10

Description

Fixes # (issue)

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@sonarqubecloud
Copy link

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 46.42857% with 105 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.93%. Comparing base (60d1c2a) to head (53a5f19).
⚠️ Report is 735 commits behind head on develop.

Files with missing lines Patch % Lines
metrics/probe/server.go 0.00% 57 Missing ⚠️
metrics/options.go 0.00% 19 Missing ⚠️
config/metric_config.go 0.00% 8 Missing and 1 partial ⚠️
global/metric_config.go 63.63% 5 Missing and 3 partials ⚠️
compat.go 81.81% 2 Missing and 2 partials ⚠️
metrics/probe/probe.go 87.09% 2 Missing and 2 partials ⚠️
server/options.go 0.00% 1 Missing and 1 partial ⚠️
server/server.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3213      +/-   ##
===========================================
+ Coverage    46.76%   47.93%   +1.17%     
===========================================
  Files          295      467     +172     
  Lines        17172    33943   +16771     
===========================================
+ Hits          8031    16272    +8241     
- Misses        8287    16355    +8068     
- Partials       854     1316     +462     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@AlexStocks AlexStocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要调整初始化路径:probe 现在在 config/metric_config.go 和 server/options.go 两处都可能触发 Init。虽然有 sync.Once,但会带来隐式优先级(谁先初始化谁生效),容易导致配置行为不一致。建议收敛为单一入口,另一处只做参数构造或删除。

mu.RLock()
defer mu.RUnlock()
for name, fn := range checks {
if err := fn(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runChecks 持有 mu.RLock() 的同时调用用户提供的 CheckFunc。如果用户的 CheckFunc 内部调用了 RegisterReadiness/RegisterLiveness/RegisterStartup(需要 WriteLock),sync.RWMutex 不支持锁升级/重入,同一 goroutine 会死锁。

建议:持锁期间只拷贝 map,释放锁后再执行:

mu.RLock()
snapshot := make(map[string]CheckFunc, len(checks))
for k, v := range checks { snapshot[k] = v }
mu.RUnlock()
for name, fn := range snapshot {
    if err := fn(ctx); err \!= nil {
        return fmt.Errorf("probe %s: %w", name, err)
    }
}

)

func livenessHandler(w http.ResponseWriter, r *http.Request) {
if err := CheckLiveness(r.Context()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

三个 handler 均未限制 HTTP 方法。Kubernetes 探针只发 GET,但 POST /readyDELETE /startup 等请求也会触发健康检查逻辑并返回 200/503,不符合最小权限原则和 Kubernetes 规范。

建议在每个 handler 开头加:

if r.Method \!= http.MethodGet {
    http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
    return
}

if cfg.StartupPath != "" {
mux.HandleFunc(cfg.StartupPath, startupHandler)
}
srv := &http.Server{Addr: ":" + cfg.Port, Handler: mux}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http.Server 没有设置任何超时(ReadHeaderTimeout/WriteTimeout/IdleTimeout)。如果用户注册的 CheckFunc 阻塞(比如等数据库连接),探针响应会挂起,Kubernetes 判定探针失败,Pod 被反复重启。

建议:

srv := &http.Server{
    Addr:              ":" + cfg.Port,
    Handler:           mux,
    ReadHeaderTimeout: 5 * time.Second,
    WriteTimeout:      10 * time.Second,
    IdleTimeout:       30 * time.Second,
}

}

var (
startOnce sync.Once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startOnce 是包级全局变量且无法重置。测试文件 probe_test.go 中的 resetProbeState() 没有重置它,导致测试 Init() 行为的用例无法隔离——只有第一次 Init() 真正生效,后续调用全部被 Once 跳过。

建议:将 startOnce、server 实例等状态封装进 ProbeServer struct,通过接口注入;或者至少在测试辅助函数中提供重置能力。

metrics.Init(mc.toURL())
}
if mc.Probe != nil && mc.Probe.Enabled != nil && *mc.Probe.Enabled {
probe.Init(&probe.Config{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接构造 probe.Config 调用 probe.Init,绕过了 server/options.go 中的 BuildProbeConfig。两条路径的默认值填充逻辑不一致,未来 BuildProbeConfig 加入新逻辑时,这条旧路径不会受益,容易产生行为差异。

建议:统一使用 BuildProbeConfig,消除重复逻辑。

Comment on lines +107 to +111
useInternal := probeCfg.UseInternalState == nil || *probeCfg.UseInternalState
port := probeCfg.Port
if port == "" {
port = constant.ProbeDefaultPort
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.Port 是字符串,没有任何合法性校验。传入 "abc""99999"ListenAndServe 报错仅记日志,探针服务静默失败,Kubernetes 因探针一直连不上而反复重启 Pod,排查成本极高。

建议:在 BuildProbeConfigInit 入口用 strconv.Atoi 校验范围 [1, 65535],非法端口直接返回错误。

Comment on lines +333 to +334
probe.SetStartupComplete(true)
probe.SetReady(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probe.SetReady(true)probe.SetStartupComplete(true) 设置完成后立即进入 select{},而此时 s.mu 的写锁仍被持有(defer s.mu.Unlock() 等函数返回时才执行,但函数永不返回)。任何后续调用 GetServiceOptions()GetServiceInfo()registerServiceOptions() 的代码都会死锁。

建议:serve = true 的赋值在锁内完成后立即释放锁,再进入阻塞等待。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose readiness and liveness apis so the process's status can be detected by the scheduling cluster like K8S.

3 participants