Proposal: Embedded TSDB

# Proposal

## Background 
Alertmanager has multiple internal storages:
- Alerts stores
- Alerts provider
- Markers
- Silences
- ...

There are multiple API endpoints which expose the above (raw or mutated):
- /alerts
- /alerts/groups
- /silences
- /metrics
- ...

The above APIs have the following issues:
- process intensive
- causing contention, blocking Alertmanager from doing its job
- only show the current state (no history)
  - in case of `/alerts/groups` we predict which alerts will be suppressed with which silences in future!

## Proposal
Prometheus already provides the `ALERTS` timeseries which provides both `current ` and `historic` information about the status of alerts.
My proposal is to embed a TSDB into Alertmanager so it can also provide an `ALERTS` timeseries with a compatible Prometheus API.
Alertmanager will have at least one extra label value for `alertstate` which would be `suppressed` or `muted`, plus more extra labels like `muted_why` and `muted_by`, etc.

This has multiple benefits:
- it provides both current and historic status of all alerts
  - it removes the need to hammer Alertmanager API to capture and record such data
  - `ALERTS` timeseries from Prometheus and Alertmanager can be joined in a PromQL query for analysis
- TSDB has already a familiar interface for users of Prometheus which is PromQL
- This can be extended to more things like `silences`, `notifications`, etc.
- each cluster member can expose its own `ALERTS` timeseries including a `member` label which would allow to detect inconsistencies across the cluster

There is an existing example of TSDB embedding in Thanos Ruler which exposes a similar `ALERTS` timeseries data through its embedded TSDB.

We could optionally:
- support `remote_write`
- write the TSDB to disk, which would allow Thanos Sidecar to upload this data to longterm storage

## Risks
The only risks I see so far is high cardinality of ALERTS, specially in setups with 100s or 1000s of Prometheus instances, the cardinality on `ALERTS` timeseries can be high.
Alertmanager can simply be protected from high cardinality by applying limits per alertname (similar to how Prometheus does it per scrape job, etc.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Embedded TSDB #4813

Proposal

Background

Proposal

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Embedded TSDB #4813

Description

Proposal

Background

Proposal

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions