Skip to content

Conversation

@discostur
Copy link
Contributor

@discostur discostur commented Nov 21, 2025

Update ClickHouse Keeper Prometheus Alert Rules

fix #1328

Summary

This PR updates the ClickHouse Keeper Prometheus alert rules to use modern metrics exposed by recent ClickHouse Keeper versions (23.x+) and adds several new critical alerts for better observability.

Problem

The existing alert rules were using outdated ZooKeeper-compatible metrics (zk_*) that are no longer exposed by modern ClickHouse Keeper versions. This caused the alerts to fail silently, leaving the cluster without proper monitoring.

Fixes #1328

Changes

Updated Existing Alerts (5)

  1. ClickHouseKeeperDown

    • Removed obsolete zk_ruok metric
    • Now uses only up{} metric for availability monitoring
  2. ClickHouseKeeperHighLatency

    • Replaced zk_max_latencyClickHouseAsyncMetrics_KeeperMaxLatency
    • Updated unit from "ticks" to "ms" for clarity
  3. ClickHouseKeeperOutstandingRequests

    • Replaced zk_outstanding_requestsClickHouseMetrics_KeeperOutstandingRequests
  4. ClickHouseKeeperHighEphemeralNodes

    • Replaced zk_ephemerals_countClickHouseAsyncMetrics_KeeperEphemeralsCount
    • Updated documentation link to ClickHouse Keeper docs
  5. ClickHouseKeeperHighFileDescriptors

    • Removed (old metric zk_open_file_descriptor_count not available)
    • Replaced with new percentage-based alert (see below)

New Alerts Added (5)

  1. ClickHouseKeeperCommitsFailed (Critical)

    • Detects failed Raft commits indicating consensus issues
    • Uses: ClickHouseProfileEvents_KeeperCommitsFailed
  2. ClickHouseKeeperSnapshotCreationsFailed (High)

    • Monitors failed snapshot creations that could lead to log accumulation
    • Uses: ClickHouseProfileEvents_KeeperSnapshotCreationsFailed
  3. ClickHouseKeeperLostQuorum (Critical)

    • Alerts when leader loses quorum (cluster cannot commit operations)
    • Uses: ClickHouseAsyncMetrics_KeeperSyncedFollowers + KeeperIsLeader
  4. ClickHouseKeeperMemorySoftLimitExceeded (Warning)

    • Monitors memory soft limit violations
    • Uses: ClickHouseAsyncMetrics_KeeperIsExceedingMemorySoftLimitHit
  5. ClickHouseKeeperHighFileDescriptorUsage (Warning)

    • Alerts when file descriptor usage exceeds 80% of available FDs
    • Uses: ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount / KeeperMaxFileDescriptorCount

Alert Severity Distribution

  • Critical (3): Down, CommitsFailed, LostQuorum
  • High (2): OutstandingRequests, SnapshotCreationsFailed
  • Warning (5): HighLatency, HighEphemeralNodes, MemorySoftLimit, HighFileDescriptorUsage

Regards,
Kilian

Copy link
Collaborator

@Slach Slach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, thanks for contribution, let's wait CI/CD checks

@Slach Slach changed the base branch from master to 0.25.6 November 21, 2025 10:36
@sunsingerus sunsingerus merged commit 2521922 into Altinity:0.25.6 Dec 3, 2025
1 of 2 checks passed
@discostur discostur deleted the improve-prometheus-rules branch December 3, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

need to update clickhouse-keeper alert rules for actual prometheus metircs

3 participants