Skip to content

[Locket] Add configurable Locket DB Health Check #1105

@rositsa-popova

Description

@rositsa-popova

Proposed Change

As a Platform Operator,
I want a configurable Locket DB health check,
So that I can proactively detect connectivity or availability issues and react before they impact platform stability.

Problem Details

Currently the Locket service has no mechanism to actively verify its database connectivity at runtime.

During past operational incidents we observed that Locket could enter a degraded or silently broken state when its database became unresponsive or unreachable. Because no internal health verification exists, the failure could only be detected indirectly through platform symptoms or external monitoring. By that point, platform behavior has already been impacted and recovery requires manual intervention.

If Locket had been able to detect its own loss of database connectivity, the process could have been restarted automatically by BOSH, significantly reducing impact and recovery time.

The same operational gap was recently addressed for the BBS component by introducing a DB health check. This mechanism has proven valuable and could be extended to Locket to provide consistent resilience across Diego control plane components.

Solution Proposal

Implement a Locket DB health check following the same model introduced for the BBS DB health check and adapting it for the Locket database connection.
The following merged PRs serve as the reference implementation:

The Locket equivalent should expose analogous BOSH properties (e.g. diego.locket.enable_db_health_check, along with timeout, interval, and failure threshold settings) and implement the same internal runner pattern within the Locket process:

  • Periodically verify DB connectivity using a simple write/read operation
  • Exit the process after a configurable number of consecutive failures
  • Allow BOSH to restart the process for recovery
  • Be disabled by default

Acceptance criteria

Scenario: Health check detects a healthy database
Given the Locket DB health check is enabled via configuration
When Locket successfully performs a DB insert and retrieve within the configured timeout
Then Locket continues operating normally
And no restart is triggered

Scenario: Health check detects consecutive database failures and triggers a restart
Given the Locket DB health check is enabled
And configured with a failure threshold of N consecutive failures
When Locket fails to complete a DB insert and retrieve within the configured timeout for N consecutive attempts
Then the Locket process exits so that BOSH can restart it and restore database connectivity

Scenario: Health check is disabled by default
Given a Locket deployment with no explicit health check configuration
When the Locket process starts
Then the DB health check is not active
And Locket behaves as it did prior to this feature

Scenario: Operator can configure health check parameters
Given the Locket DB health check is enabled
When an operator sets custom values for the check interval, per-check timeout, and consecutive failure threshold
Then the health check runs using those values
And database connectivity is evaluated according to the operator-specified parameters

Related links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions