Skip to content

🎨 Services using RabbitMQ and Redis have their health checks tied to the respective clients#8977

Draft
GitHK wants to merge 25 commits intoITISFoundation:masterfrom
GitHK:pr-osparc-restart-on-failing-healthchecks
Draft

🎨 Services using RabbitMQ and Redis have their health checks tied to the respective clients#8977
GitHK wants to merge 25 commits intoITISFoundation:masterfrom
GitHK:pr-osparc-restart-on-failing-healthchecks

Conversation

@GitHK
Copy link
Copy Markdown
Contributor

@GitHK GitHK commented Mar 27, 2026

@GitHK GitHK self-assigned this Mar 27, 2026
@GitHK GitHK added this to the Cassata milestone Mar 27, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 88.07339% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.08%. Comparing base (ff5ffe4) to head (4cbd199).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8977      +/-   ##
==========================================
- Coverage   87.48%   85.08%   -2.40%     
==========================================
  Files        2068     1186     -882     
  Lines       81581    50420   -31161     
  Branches     1434      176    -1258     
==========================================
- Hits        71368    42900   -28468     
+ Misses       9809     7464    -2345     
+ Partials      404       56     -348     
Flag Coverage Δ
integrationtests 64.31% <64.70%> (+0.26%) ⬆️
unittests 84.99% <90.00%> (-1.39%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 85.79% <ø> (ø)
agent 92.85% <100.00%> (ø)
api_server ∅ <ø> (∅)
autoscaling 95.48% <71.42%> (-0.07%) ⬇️
catalog 92.10% <ø> (ø)
clusters_keeper 98.59% <66.66%> (-0.12%) ⬇️
dask_sidecar ∅ <ø> (∅)
datcore_adapter 97.95% <ø> (ø)
director 75.62% <ø> (+0.08%) ⬆️
director_v2 78.52% <66.66%> (-13.14%) ⬇️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 87.82% <ø> (-0.47%) ⬇️
efs_guardian 89.73% <80.00%> (-0.13%) ⬇️
invitations 90.93% <ø> (ø)
payments 92.87% <100.00%> (+<0.01%) ⬆️
resource_usage_tracker 91.88% <87.50%> (+0.07%) ⬆️
storage 86.77% <96.29%> (+0.07%) ⬆️
webclient ∅ <ø> (∅)
webserver 82.77% <62.50%> (-4.05%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff5ffe4...4cbd199. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns multiple services’ /health endpoints (and the webserver healthcheck signal) with the actual RabbitMQ/Redis client connectivity state, so orchestration can detect backend outages more reliably.

Changes:

  • Add Redis/RabbitMQ client health assertions to various service health endpoints and webserver healthcheck hooks.
  • Consolidate Redis usage in storage around RedisClientsManager and wire Celery/task-store + DSM cleaner to specific Redis DB clients.
  • Extend RedisClientsManager with a .healthy aggregate and propagate Redis client health-check interval into the underlying redis-py client.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
services/web/server/src/simcore_service_webserver/redis.py Registers Redis manager health with aiohttp healthcheck signal.
services/storage/src/simcore_service_storage/modules/redis.py Switches storage Redis wiring to RedisClientsManager for multiple DBs.
services/storage/src/simcore_service_storage/modules/celery/init.py Uses Redis client from the shared manager for Celery task store.
services/storage/src/simcore_service_storage/dsm_cleaner.py Uses the Redis LOCKS client from the manager for exclusive periodic locking.
services/storage/src/simcore_service_storage/core/settings.py Cleans up settings formatting and removes unused Rabbit settings import/field.
services/storage/src/simcore_service_storage/core/application.py Ensures Redis is set up before Celery so dependencies are available.
services/storage/src/simcore_service_storage/api/rest/dependencies/redis.py Adds FastAPI dependency to retrieve the Redis clients manager.
services/storage/src/simcore_service_storage/api/rest/_health.py Adds Redis manager health check to storage health endpoint.
services/resource-usage-tracker/src/simcore_service_resource_usage_tracker/services/modules/redis.py Adds request-based dependency accessor for Redis lock client.
services/resource-usage-tracker/src/simcore_service_resource_usage_tracker/api/rest/_health.py Enforces Redis client health in resource-usage-tracker health endpoint.
services/payments/tests/unit/api/test_rest_meta.py Refactors tests to cover both RabbitMQ and RPC client health combinations.
services/payments/src/simcore_service_payments/services/rabbitmq.py Adds request-based dependency for RPC client and tightens typing.
services/payments/src/simcore_service_payments/api/rest/_health.py Makes payments health depend on both RabbitMQ and RPC client health.
services/notifications/tests/unit/test_api_rest__health.py Reworks health tests to override dependencies and validate Redis/Rabbit/Postgres checks.
services/notifications/src/simcore_service_notifications/core/events.py Adds Redis lifespan to the app lifespan chain.
services/notifications/src/simcore_service_notifications/clients/redis.py Introduces a dedicated Redis lifespan/client getter for celery tasks backend.
services/notifications/src/simcore_service_notifications/clients/celery.py Reuses Redis client from the new Redis lifespan rather than creating its own.
services/notifications/src/simcore_service_notifications/api/rest/dependencies.py Adds dependency to inject Redis client into request handlers.
services/notifications/src/simcore_service_notifications/api/rest/_health.py Adds Redis health check to notifications health endpoint.
services/efs-guardian/tests/unit/test_api_health.py Updates health mocks to include Redis lock client.
services/efs-guardian/src/simcore_service_efs_guardian/api/rest/health.py Adds Redis lock client health check alongside RabbitMQ checks.
services/director-v2/src/simcore_service_director_v2/api/routes/health.py Adds Redis manager health check to director-v2 health route.
services/director-v2/src/simcore_service_director_v2/api/dependencies/rabbitmq.py Adds request-based dependency to retrieve Redis clients manager.
services/clusters-keeper/src/simcore_service_clusters_keeper/api/health.py Adds Redis client health check to clusters-keeper health endpoint.
services/autoscaling/src/simcore_service_autoscaling/api/health.py Adds RabbitMQ + Redis client health checks to autoscaling health endpoint.
services/agent/src/simcore_service_agent/api/rest/_health.py Switches agent health check to use the RPC RabbitMQ client dependency.
services/agent/src/simcore_service_agent/api/rest/_dependencies.py Renames dependency to get_rabbitmq_rpc_client to reflect the actual client.
packages/service-library/src/servicelib/redis/_clients_manager.py Adds aggregate .healthy property and improves context manager typing.
packages/service-library/src/servicelib/redis/_client.py Passes configured health-check interval into redis-py client creation.
packages/celery-library/src/celery_library/app.py Adjusts Celery broker retry/heartbeat settings and Redis SSL backend config.

Comment on lines 33 to 35
if celery_settings.CELERY_REDIS_RESULT_BACKEND.REDIS_SECURE:
base_config["redis_backend_use_ssl"] = {"ssl_cert_reqs": ssl.CERT_NONE}
base_config["redis_backend_use_ssl"] = {"ssl_cert_reqs": TypeAdapter(bool).validate_python(ssl.CERT_NONE)}
return base_config
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssl.CERT_NONE is an int constant (0), but this code validates it as bool, which converts it to False. Even though False == 0 in Python, this loses the intent/semantics of the TLS setting and can be misinterpreted by Celery/kombu config parsing. Prefer passing ssl.CERT_NONE directly (or explicitly casting to int if needed) instead of TypeAdapter(bool) here.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +102
@pytest.mark.parametrize(
"rabbit_healthy, postgres_healthy, redis_healthy, expecteded_msg",
[
(True, True, False, REDIS_CLIENT_UNHEALTHY_MSG),
(True, False, True, POSRGRES_DATABASE_UNHEALTHY_MSG),
(False, True, True, RABBITMQ_CLIENT_UNHEALTHY_MSG),
],
)
def test_unhealthy_services(mock_services_health: None, test_client: TestClient, expecteded_msg: str):
with pytest.raises(HealthCheckError) as exc:
test_client.get("/")
assert RABBITMQ_CLIENT_UNHEALTHY_MSG in f"{exc.value}"
assert expecteded_msg in f"{exc.value}"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in parameter name expecteded_msg (double 'ed') reduces readability and makes the test API harder to understand. Rename it consistently in both the parametrization and the test function signature/usages.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +32
def get_redis_client_from_request(app: Annotated[FastAPI, Depends(get_application)]) -> RedisClientSDK:
return get_redis_client(app)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_redis_client_from_request does not accept a Request (it takes an app dependency instead), so the name is misleading. Consider renaming to something like get_redis_client / get_redis_client_from_app (and update the callers) to match the actual signature and reduce confusion with other dependency helpers that truly take a request.

Suggested change
def get_redis_client_from_request(app: Annotated[FastAPI, Depends(get_application)]) -> RedisClientSDK:
return get_redis_client(app)
def get_redis_client_from_app(
app: Annotated[FastAPI, Depends(get_application)],
) -> RedisClientSDK:
return get_redis_client(app)
def get_redis_client_from_request(request: Request) -> RedisClientSDK:
app = get_application(request)
return get_redis_client_from_app(app)

Copilot uses AI. Check for mistakes.
Comment on lines 19 to +22
async def check_service_health(
rabbitmq_client: Annotated[RabbitMQClient, Depends(get_rabbitmq_client)],
rabbitmq_rpc_client: Annotated[RabbitMQClient, Depends(get_rabbitmq_rpc_client)],
):
if not rabbitmq_client.healthy:
if not rabbitmq_rpc_client.healthy:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency get_rabbitmq_rpc_client returns a RabbitMQRPCClient, but the endpoint parameter is annotated as RabbitMQClient. This type mismatch can confuse readers and static type checkers; update the annotation/import to RabbitMQRPCClient (or to the common base type if that's the intent).

Copilot uses AI. Check for mistakes.
Comment on lines 32 to 42
@router.get("/", include_in_schema=True, response_class=PlainTextResponse)
async def health_check():
async def health_check(app: Annotated[FastAPI, Depends(get_app)]):
# NOTE: sync url in docker/healthcheck.py with this entrypoint!

if not get_rabbitmq_client(app).healthy:
raise HealthCheckError(RABBITMQ_CLIENT_UNHEALTHY_MSG)

if not get_redis_client(app).is_healthy:
raise HealthCheckError(REDIS_CLIENT_UNHEALTHY_MSG)

return f"{__name__}.health_check@{datetime.datetime.now(datetime.UTC).isoformat()}"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handler signature no longer declares a return type. Since this module already uses return type annotations elsewhere (e.g. /status), consider adding -> str here for consistency and better static checking.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS Redis: Connection Reset by peer 104

2 participants