Skip to content

state label in bird_protocol_up leads to unique time serie for each state #63

@sgrade

Description

@sgrade

Overview

In Prometheus, every time series is UNIQuely identified by its metric name and set of LABELS (source). So, when a state (label) changes in bird_protocol_up metric, new time series is created in addition to the one with previous state. This ruins the metric: instead of one bird_protocol_up time series per BIRD protocol we see several in parallel. And when the state changes regularly (e.g. flap), we have gaps in the series.

How to replicate

If a BGP peer on other side becomes unavailable, BIRD tries to reconnect (goes through different states). In the example below, in Prometheus we see three different bird_protocol_up time series for one peer. They correspond to the BGP states (state labels):

  • "Idle Socket: No route to host"
  • "Connect Socket: No route to host"
  • "Active Socket: No route to host"

All three exist in the TSDB in parallel.

Problems this approach creates

  • When the protocol state changes (e.g. flaps), bird_exporter reports only current state. So, at the moment of scraping it can be one state. A second after that the state is different, but we don't see it in Prometheus. Different combinations of the scraping intervals and protocol timers create different (weird) results in monitoring.
  • In a complex environment with thousands of peers (thus many labels per peer) an unstable (unpredictable) number of metrics per protocol is difficult to manage. Idempotence is difficult to achieve. Automation breaks.
  • It is difficult to understand, which BGP state is current. Prometheus returns all time series (in example above three time series) for the single bird protocol. It is the same with instant queries as the series with different states are considered unique
  • It is difficult to count peers, for which bird_protocol_up == 0. Instead of actual number of down peers count shows number of unique time series, which is not what we want to see. I still managed to do it using count(group by (state) {}), but IMHO this is more a workaround than a proper solution

Suggestion

Who will do it

I can implement it myself if the agreement is made.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions