Skip to content

Proposal: Embedded TSDB #4813

@siavashs

Description

@siavashs

Proposal

Background

Alertmanager has multiple internal storages:

  • Alerts stores
  • Alerts provider
  • Markers
  • Silences
  • ...

There are multiple API endpoints which expose the above (raw or mutated):

  • /alerts
  • /alerts/groups
  • /silences
  • /metrics
  • ...

The above APIs have the following issues:

  • process intensive
  • causing contention, blocking Alertmanager from doing its job
  • only show the current state (no history)
    • in case of /alerts/groups we predict which alerts will be suppressed with which silences in future!

Proposal

Prometheus already provides the ALERTS timeseries which provides both current and historic information about the status of alerts.
My proposal is to embed a TSDB into Alertmanager so it can also provide an ALERTS timeseries with a compatible Prometheus API.
Alertmanager will have at least one extra label value for alertstate which would be suppressed or muted, plus more extra labels like muted_why and muted_by, etc.

This has multiple benefits:

  • it provides both current and historic status of all alerts
    • it removes the need to hammer Alertmanager API to capture and record such data
    • ALERTS timeseries from Prometheus and Alertmanager can be joined in a PromQL query for analysis
  • TSDB has already a familiar interface for users of Prometheus which is PromQL
  • This can be extended to more things like silences, notifications, etc.
  • each cluster member can expose its own ALERTS timeseries including a member label which would allow to detect inconsistencies across the cluster

There is an existing example of TSDB embedding in Thanos Ruler which exposes a similar ALERTS timeseries data through its embedded TSDB.

We could optionally:

  • support remote_write
  • write the TSDB to disk, which would allow Thanos Sidecar to upload this data to longterm storage

Risks

The only risks I see so far is high cardinality of ALERTS, specially in setups with 100s or 1000s of Prometheus instances, the cardinality on ALERTS timeseries can be high.
Alertmanager can simply be protected from high cardinality by applying limits per alertname (similar to how Prometheus does it per scrape job, etc.)

Metadata

Metadata

Assignees

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions