-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Request
Set telemetry counters, such as aztec_sequencer_slot_filled_count, to 0 (or an appropriate default value) upon node startup, rather than omitting the metric until the action occurs.
Here are current 4 metrics I'm primarily looking at to instrument, but this change probably ought to be applied universally:
- aztec_sequencer_slot_filled_count
- aztec_sequencer_slot_total_count
- aztec_validator_attestation_failed_node_issue_count
- aztec_validator_attestation_failed_bad_proposal_count
Reason
When an Aztec node is restarted, counter metrics are omitted until an event occurs, rather than being set to 0. Within Prometheus, a counter being omitted rather than set to a default value gets interpreted as "No Data", which makes instrumenting graphs or alerts with functions like increase or rate unreliable for low volume events, such as aztec_sequencer_slot_filled_count. If a node only has 5 sequencer keys in its keystore, there's a chance it won't publish or attest to any blocks for days, given the randomness of committee selection, so "No Data" can persist for quite some time. And when using rate calculations, going from "No Data" to 1 (or N) is interpreted as a change of 0 rather than a step up value of 1 (or N), making it difficult to have alerts fire rapidly -- or at all when there's only a singular change within an interval to compare against.
Prometheus docs noting how to treat 0 vs no data: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
OTEL's docs, which I believe implies the same stance, but uses more abstract language: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#timeseries-model
Examples
A screenshot of our graph, noting that I restarted the machine to pick up a kernel update and the accompanying Grafana JSON representing that graph. There ought to be two other lines at 0 for the last 6 hours and the viewable line ought to carry through the entire 6 hour range, but they end up omitted instead as there are no time series data points to reference.
Grafana JSON
{
"id": 55,
"type": "timeseries",
"title": "Attestation rate",
"gridPos": {
"x": 16,
"y": 1,
"h": 8,
"w": 8
},
"fieldConfig": {
"defaults": {
"custom": {
"drawStyle": "line",
"lineInterpolation": "linear",
"barAlignment": 0,
"barWidthFactor": 0.6,
"lineWidth": 1,
"fillOpacity": 0,
"gradientMode": "none",
"spanNulls": false,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 5,
"stacking": {
"mode": "none",
"group": "A"
},
"axisPlacement": "auto",
"axisLabel": "",
"axisColorMode": "text",
"axisBorderShow": false,
"scaleDistribution": {
"type": "linear"
},
"axisCenteredZero": false,
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"thresholdsStyle": {
"mode": "off"
}
},
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"pluginVersion": "11.6.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "sum by (namespace) (aztec_validator_attestation_failed_bad_proposal_count)",
"legendFormat": "Failed Attestations - Proposal Issues",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "sum by (namespace) (aztec_validator_attestation_failed_node_issue_count)",
"hide": false,
"instant": false,
"legendFormat": "Failed Attestations - Node Issue",
"range": true,
"refId": "B"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "increase(aztec_validator_attestation_success_count{namespace=~\"$namespace\"}[15m])",
"hide": false,
"instant": false,
"legendFormat": "Successful Attestation Rate",
"range": true,
"refId": "C"
}
],
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"options": {
"tooltip": {
"mode": "single",
"sort": "none",
"hideZeros": false
},
"legend": {
"showLegend": true,
"displayMode": "list",
"placement": "bottom",
"calcs": []
}
}
}