fix: readiness probe remove tenant manager call#581
fix: readiness probe remove tenant manager call#581gandalf-at-lerian merged 3 commits intodevelopfrom
Conversation
The K8s readiness probe was calling GetDatabaseForTenant("__readiness_probe__")
every 5s, which triggered ~720 WARN-level "tenant not found" logs per hour
since the fake tenant ID doesn't exist in Tenant Manager.
Per Ring multi-tenant standards, health/readiness probes MUST NOT call
Tenant Manager for tenant resolution. Replace with lightweight nil-checks
on tmClient — the circuit breaker on the client is the correct resilience
mechanism for detecting Tenant Manager failures at runtime.
Changes:
- Worker: replace checkTenantResolution() with checkTenantManagerClient()
- Manager: replace GetDatabaseForTenant probe with tmClient nil-check
- Remove unused constants, imports (tmcore, errors)
- Update tests for new signatures, add TestCheckTenantManagerClient
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughThis pull request simplifies health check and readiness probe logic across the manager and worker components by removing dependencies on the Tenant Manager's MongoDB manager. The readiness probe in both components is now a lightweight check that verifies only whether the Tenant Manager client is initialized, eliminating direct database queries. Function signatures in the worker's health check builders are updated to reflect the removal of the tenant MongoDB manager parameter and introduce a new tmClientCheck callback. Documentation files receive build-trigger HTML comments. A transitive dependency (lib-commons) is bumped to a newer beta version. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
🔒 Security Scan Results —
|
📊 Unit Test Coverage Report:
|
| Metric | Value |
|---|---|
| Overall Coverage | 87.9% ✅ PASS |
| Threshold | 85% |
Coverage by Package
| Package | Coverage |
|---|---|
github.com/LerianStudio/reporter/components/manager/internal/adapters/http/in |
88.2% |
github.com/LerianStudio/reporter/components/manager/internal/services |
89.3% |
Generated by Go PR Analysis workflow
📊 Unit Test Coverage Report:
|
| Metric | Value |
|---|---|
| Overall Coverage | 90.7% ✅ PASS |
| Threshold | 85% |
Coverage by Package
| Package | Coverage |
|---|---|
github.com/LerianStudio/reporter/components/worker/internal/services |
92.6% |
Generated by Go PR Analysis workflow
🔒 Security Scan Results —
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/manager/internal/bootstrap/init_tenant.go`:
- Around line 68-82: readinessCheck currently only verifies tmClient != nil
which doesn't guarantee Tenant Manager is reachable; update readinessCheck to
perform a lightweight real health call against the Tenant Manager (instead of
just the pointer check) — use a short timeout context and call a minimal
health/ping method on tmClient (e.g., tmClient.Health(ctx) or
tmClient.Ping(ctx)) or issue a GET to the Tenant Manager health endpoint with
the configured API key if no helper exists; if the call returns an error or
non-2xx status, return that error so /ready reflects actual Tenant Manager
availability (this affects readinessCheck, tmClient, and upstream behavior in
tenantMid.WithTenantDB).
In `@components/worker/internal/bootstrap/config_runtime.go`:
- Around line 256-269: The current checkTenantManagerClient function only checks
tmClient != nil which falsely reports readiness; modify checkTenantManagerClient
to perform a lightweight health/heartbeat call on the tmClient (e.g., a
ping/health/status method on tmclient.Client or a minimal GET to the
tenant-manager health endpoint using a short context timeout) instead of a
pointer check, treat non-2xx / error / timeout as not_ready and include the
error/message in the dependencyStatus, and keep the existing nil-case handling
as a fallback; update return values from checkTenantManagerClient to reflect the
actual health response and ensure the call is non-blocking and bounded by a
short context timeout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f604b1ef-083f-41ee-82d6-cd214daf4f89
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
components/manager/README.mdcomponents/manager/internal/bootstrap/init_tenant.gocomponents/worker/README.mdcomponents/worker/internal/bootstrap/config_runtime.gocomponents/worker/internal/bootstrap/config_runtime_test.gogo.mod
gandalf-at-lerian
left a comment
There was a problem hiding this comment.
Reviewed — removes readiness probe log spam (GetDatabaseForTenant with fake tenant ID every 5s), replaces with lightweight nil check. Bump to lib-commons v4.1.0-beta.6. Clean refactor with updated tests. LGTM ✅
Pull Request Checklist
Pull Request Type
Checklist
Additional Notes
Problem
The K8s readiness probe was calling
GetDatabaseForTenant("__readiness_probe__")every 5s in multi-tenant mode. Since__readiness_probe__is not a real tenant, the Tenant Manager responded with 404 andlib-commonslogged a WARN-level "tenant not found" message on every probe — ~720 warnings per hour per pod.Solution
Replace the Tenant Manager HTTP call in readiness probes with a lightweight nil-check on
tmClient. The circuit breaker on the client is the correct resilience mechanism for detecting Tenant Manager failures at runtime, not the readiness probe.Per Ring multi-tenant standards, health/readiness probes MUST NOT call Tenant Manager for tenant resolution.
Changes
config_runtime.go): ReplacecheckTenantResolution()withcheckTenantManagerClient()(nil-check, no I/O). Remove unusedworkerTenantReadinessProbeIDconstant andtmcore/errorsimports.init_tenant.go): ReplaceGetDatabaseForTenantprobe withtmClientnil-check. Remove unusedtenantReadinessProbeIDconstant andtmcore/errorsimports.config_runtime_test.go): Update for new signatures, addTestCheckTenantManagerClient.lib-commonsv4.1.0-beta.5 → v4.1.0-beta.6.Follow-up tasks
lib-commons TASK-001: AddPing/Healthmethod to tenant-manager client (allows future enhancement of this check without triggering tenant resolution warnings).reporter TASK-012: Evaluate whether external datasources should be per-tenant in multi-tenant mode.