Skip to content

Initalize counter metrics to zero #18964

@brandon-at-wrk

Description

@brandon-at-wrk

Request

Set telemetry counters, such as aztec_sequencer_slot_filled_count, to 0 (or an appropriate default value) upon node startup, rather than omitting the metric until the action occurs.

Here are current 4 metrics I'm primarily looking at to instrument, but this change probably ought to be applied universally:

  • aztec_sequencer_slot_filled_count
  • aztec_sequencer_slot_total_count
  • aztec_validator_attestation_failed_node_issue_count
  • aztec_validator_attestation_failed_bad_proposal_count

Reason

When an Aztec node is restarted, counter metrics are omitted until an event occurs, rather than being set to 0. Within Prometheus, a counter being omitted rather than set to a default value gets interpreted as "No Data", which makes instrumenting graphs or alerts with functions like increase or rate unreliable for low volume events, such as aztec_sequencer_slot_filled_count. If a node only has 5 sequencer keys in its keystore, there's a chance it won't publish or attest to any blocks for days, given the randomness of committee selection, so "No Data" can persist for quite some time. And when using rate calculations, going from "No Data" to 1 (or N) is interpreted as a change of 0 rather than a step up value of 1 (or N), making it difficult to have alerts fire rapidly -- or at all when there's only a singular change within an interval to compare against.

Prometheus docs noting how to treat 0 vs no data: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
OTEL's docs, which I believe implies the same stance, but uses more abstract language: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#timeseries-model

Examples

A screenshot of our graph, noting that I restarted the machine to pick up a kernel update and the accompanying Grafana JSON representing that graph. There ought to be two other lines at 0 for the last 6 hours and the viewable line ought to carry through the entire 6 hour range, but they end up omitted instead as there are no time series data points to reference.

Image
Grafana JSON
{
  "id": 55,
  "type": "timeseries",
  "title": "Attestation rate",
  "gridPos": {
    "x": 16,
    "y": 1,
    "h": 8,
    "w": 8
  },
  "fieldConfig": {
    "defaults": {
      "custom": {
        "drawStyle": "line",
        "lineInterpolation": "linear",
        "barAlignment": 0,
        "barWidthFactor": 0.6,
        "lineWidth": 1,
        "fillOpacity": 0,
        "gradientMode": "none",
        "spanNulls": false,
        "insertNulls": false,
        "showPoints": "auto",
        "pointSize": 5,
        "stacking": {
          "mode": "none",
          "group": "A"
        },
        "axisPlacement": "auto",
        "axisLabel": "",
        "axisColorMode": "text",
        "axisBorderShow": false,
        "scaleDistribution": {
          "type": "linear"
        },
        "axisCenteredZero": false,
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        },
        "thresholdsStyle": {
          "mode": "off"
        }
      },
      "color": {
        "mode": "palette-classic"
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      }
    },
    "overrides": []
  },
  "pluginVersion": "11.6.0",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "editorMode": "code",
      "expr": "sum by (namespace) (aztec_validator_attestation_failed_bad_proposal_count)",
      "legendFormat": "Failed Attestations - Proposal Issues",
      "range": true,
      "refId": "A"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "editorMode": "code",
      "expr": "sum by (namespace) (aztec_validator_attestation_failed_node_issue_count)",
      "hide": false,
      "instant": false,
      "legendFormat": "Failed Attestations - Node Issue",
      "range": true,
      "refId": "B"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "editorMode": "code",
      "expr": "increase(aztec_validator_attestation_success_count{namespace=~\"$namespace\"}[15m])",
      "hide": false,
      "instant": false,
      "legendFormat": "Successful Attestation Rate",
      "range": true,
      "refId": "C"
    }
  ],
  "datasource": {
    "type": "prometheus",
    "uid": "prometheus"
  },
  "options": {
    "tooltip": {
      "mode": "single",
      "sort": "none",
      "hideZeros": false
    },
    "legend": {
      "showLegend": true,
      "displayMode": "list",
      "placement": "bottom",
      "calcs": []
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions