Skip to content

fix: readiness probe remove tenant manager call#581

Merged
gandalf-at-lerian merged 3 commits intodevelopfrom
fix/readiness-probe-remove-tenant-manager-call
Mar 17, 2026
Merged

fix: readiness probe remove tenant manager call#581
gandalf-at-lerian merged 3 commits intodevelopfrom
fix/readiness-probe-remove-tenant-manager-call

Conversation

@brunobls
Copy link
Member

@brunobls brunobls commented Mar 17, 2026

Pull Request Checklist

Pull Request Type

  • Manager
  • Worker
  • Frontend
  • Infrastructure
  • Packages
  • Pipeline
  • Tests
  • Documentation

Checklist

  • I have tested these changes locally.
  • I have updated the documentation accordingly.
  • I have added necessary comments to the code, especially in complex areas.
  • I have ensured that my changes adhere to the project's coding standards.
  • I have checked for any potential security issues.
  • I have ensured that all tests pass.
  • I have updated the version appropriately (if applicable).
  • I have confirmed this code is ready for review.

Additional Notes

Problem

The K8s readiness probe was calling GetDatabaseForTenant("__readiness_probe__") every 5s in multi-tenant mode. Since __readiness_probe__ is not a real tenant, the Tenant Manager responded with 404 and lib-commons logged a WARN-level "tenant not found" message on every probe — ~720 warnings per hour per pod.

Solution

Replace the Tenant Manager HTTP call in readiness probes with a lightweight nil-check on tmClient. The circuit breaker on the client is the correct resilience mechanism for detecting Tenant Manager failures at runtime, not the readiness probe.

Per Ring multi-tenant standards, health/readiness probes MUST NOT call Tenant Manager for tenant resolution.

Changes

  • Worker (config_runtime.go): Replace checkTenantResolution() with checkTenantManagerClient() (nil-check, no I/O). Remove unused workerTenantReadinessProbeID constant and tmcore/errors imports.
  • Manager (init_tenant.go): Replace GetDatabaseForTenant probe with tmClient nil-check. Remove unused tenantReadinessProbeID constant and tmcore/errors imports.
  • Tests (config_runtime_test.go): Update for new signatures, add TestCheckTenantManagerClient.
  • Deps: Bump lib-commons v4.1.0-beta.5 → v4.1.0-beta.6.

Follow-up tasks

  • lib-commons TASK-001: Add Ping/Health method to tenant-manager client (allows future enhancement of this check without triggering tenant resolution warnings).
  • reporter TASK-012: Evaluate whether external datasources should be per-tenant in multi-tenant mode.

brunobls and others added 3 commits March 17, 2026 14:15
The K8s readiness probe was calling GetDatabaseForTenant("__readiness_probe__")
every 5s, which triggered ~720 WARN-level "tenant not found" logs per hour
since the fake tenant ID doesn't exist in Tenant Manager.

Per Ring multi-tenant standards, health/readiness probes MUST NOT call
Tenant Manager for tenant resolution. Replace with lightweight nil-checks
on tmClient — the circuit breaker on the client is the correct resilience
mechanism for detecting Tenant Manager failures at runtime.

Changes:
- Worker: replace checkTenantResolution() with checkTenantManagerClient()
- Manager: replace GetDatabaseForTenant probe with tmClient nil-check
- Remove unused constants, imports (tmcore, errors)
- Update tests for new signatures, add TestCheckTenantManagerClient

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brunobls brunobls requested a review from a team as a code owner March 17, 2026 17:18
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Walkthrough

This pull request simplifies health check and readiness probe logic across the manager and worker components by removing dependencies on the Tenant Manager's MongoDB manager. The readiness probe in both components is now a lightweight check that verifies only whether the Tenant Manager client is initialized, eliminating direct database queries. Function signatures in the worker's health check builders are updated to reflect the removal of the tenant MongoDB manager parameter and introduce a new tmClientCheck callback. Documentation files receive build-trigger HTML comments. A transitive dependency (lib-commons) is bumped to a newer beta version.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: readiness probe remove tenant manager call' clearly and concisely summarizes the main change: removing Tenant Manager calls from readiness probes.
Description check ✅ Passed The pull request description follows the required template structure with all major sections completed, including pull request type, comprehensive checklist items marked as done, and detailed additional notes explaining the problem, solution, and changes made.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@lerian-studio
Copy link
Contributor

Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the skip-changelog label.

@github-actions
Copy link

🔒 Security Scan Results — manager

Filesystem Scan

✅ No vulnerabilities or secrets found.

Docker Image Scan

✅ No vulnerabilities found.

All security checks passed.

@lerian-studio
Copy link
Contributor

📊 Unit Test Coverage Report: reporter-manager

Metric Value
Overall Coverage 87.9% ✅ PASS
Threshold 85%

Coverage by Package

Package Coverage
github.com/LerianStudio/reporter/components/manager/internal/adapters/http/in 88.2%
github.com/LerianStudio/reporter/components/manager/internal/services 89.3%

Generated by Go PR Analysis workflow

@lerian-studio
Copy link
Contributor

📊 Unit Test Coverage Report: reporter-worker

Metric Value
Overall Coverage 90.7% ✅ PASS
Threshold 85%

Coverage by Package

Package Coverage
github.com/LerianStudio/reporter/components/worker/internal/services 92.6%

Generated by Go PR Analysis workflow

@github-actions
Copy link

🔒 Security Scan Results — worker

Filesystem Scan

✅ No vulnerabilities or secrets found.

Docker Image Scan

✅ No vulnerabilities found.

All security checks passed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/manager/internal/bootstrap/init_tenant.go`:
- Around line 68-82: readinessCheck currently only verifies tmClient != nil
which doesn't guarantee Tenant Manager is reachable; update readinessCheck to
perform a lightweight real health call against the Tenant Manager (instead of
just the pointer check) — use a short timeout context and call a minimal
health/ping method on tmClient (e.g., tmClient.Health(ctx) or
tmClient.Ping(ctx)) or issue a GET to the Tenant Manager health endpoint with
the configured API key if no helper exists; if the call returns an error or
non-2xx status, return that error so /ready reflects actual Tenant Manager
availability (this affects readinessCheck, tmClient, and upstream behavior in
tenantMid.WithTenantDB).

In `@components/worker/internal/bootstrap/config_runtime.go`:
- Around line 256-269: The current checkTenantManagerClient function only checks
tmClient != nil which falsely reports readiness; modify checkTenantManagerClient
to perform a lightweight health/heartbeat call on the tmClient (e.g., a
ping/health/status method on tmclient.Client or a minimal GET to the
tenant-manager health endpoint using a short context timeout) instead of a
pointer check, treat non-2xx / error / timeout as not_ready and include the
error/message in the dependencyStatus, and keep the existing nil-case handling
as a fallback; update return values from checkTenantManagerClient to reflect the
actual health response and ensure the call is non-blocking and bounded by a
short context timeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f604b1ef-083f-41ee-82d6-cd214daf4f89

📥 Commits

Reviewing files that changed from the base of the PR and between 99b6411 and 8b32db4.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • components/manager/README.md
  • components/manager/internal/bootstrap/init_tenant.go
  • components/worker/README.md
  • components/worker/internal/bootstrap/config_runtime.go
  • components/worker/internal/bootstrap/config_runtime_test.go
  • go.mod

Copy link
Contributor

@gandalf-at-lerian gandalf-at-lerian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — removes readiness probe log spam (GetDatabaseForTenant with fake tenant ID every 5s), replaces with lightweight nil check. Bump to lib-commons v4.1.0-beta.6. Clean refactor with updated tests. LGTM ✅

@gandalf-at-lerian gandalf-at-lerian merged commit b594f9e into develop Mar 17, 2026
30 checks passed
@gandalf-at-lerian gandalf-at-lerian deleted the fix/readiness-probe-remove-tenant-manager-call branch March 17, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants