| theme | seriph | |||||||
|---|---|---|---|---|---|---|---|---|
| background | https://cover.sli.dev | |||||||
| hideInToc | true | |||||||
| title | Observability Training | |||||||
| author | Mischa Taylor | |||||||
| info | ## Slidev Starter Template Presentation slides for developers. Learn more at [Sli.dev](https://sli.dev) | |||||||
| class | text-center | |||||||
| drawings |
|
|||||||
| transition | slide-left | |||||||
| mdc | true | |||||||
| themeConfig |
|
Mischa Taylor | 📧 taylor@linux.com
cat >prometheus.yml <<EOF
global:
scrape_interval: 5s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
EOFdocker container run -it --rm \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-p 9090:9090 \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=20GB \
--web.listen-address=:9090Visit http://localhost:9090/
Metrics are served up via http://localhost:9090/metrics
Check health of targets with http://localhost:9090/targets
To use the expression browser http://localhost:9090/query
Try the following query. This expression tells you the CPU usage in cores of Prometheus itself, averaged over a 1-minute window:
rate(process_cpu_seconds_total{job="prometheus"}[1m])
This expression reports the resident memory usage of Prometheus itself in mebibytes (MiB):
process_resident_memory_bytes{job="prometheus"} / 1024 / 1024
This expression shows the number of samples per second stored by Prometheus, as averaged over a 1-minute window:
rate(prometheus_tsdb_head_samples_appended_total{job="prometheus"}[1m])
This is a synthetic up metric that Prometheus records for every target. In is scoped to the prometheus job:
up{job="prometheus"}
- Pull-based - Endpoints don't push out data, instead Prometheus scrapes endpoints to fetch data
- The consumer (Prometheus) controls the timing, not the producer
- Cannot set the scrape interval to match the rate at which you collect metrics
- Data is collected on a timer, metrics do not flow immediately when published
- Precision is based on best-effort, periodic sampling. It is not intended for high-frequency, real-time or sub-second metrics
- Metrics collection is passive, not active - Prometheus chooses when to pull the data
- Not intended to collect every event — it’s about compact summaries of what’s happening over time.
- It Collects State, Not Raw Streams
- Prometheus metrics are like “snapshots” of current counters or gauges.
- Example: Instead of storing every sensor message, it stores
sensor_msgs_received_total 145723.
Analogy: “It’s like glancing at a dashboard every 15 seconds and writing down what the odometer says — not recording the entire drive.”
- It’s Pull-Based and Bounded -You define how often Prometheus collects (scrape_interval), e.g., every 15s.
- It only collects what’s exposed — usually a compact flat list of counters and gauges.
This puts an upper bound on CPU, memory, and I/O used by monitoring — it won’t run wild like logs or bag files can.
- It Stores Only Numbers and Labels
- Prometheus stores structured numeric time series: a float64 value and a timestamp, with optional key/value tags (labels).
- It does not store logs, strings, or individual events.
This is why Prometheus databases stay compact — even thousands of time series can be held in a few hundred MB.
- It Uses Time and Space Efficient Storage (TSDB)
- Prometheus uses its own highly optimized time-series database:
- Stores data in chunks
- Compresses samples using delta and XOR compression
- Avoids duplication of labels
- Prunes old data automatically (--storage.tsdb.retention.time)
Most people can store weeks of metrics in a few GB, even across hundreds of machines.
Prometheus is designed to collect summaries of system state over time, not high-frequency raw data. It stores only numeric values at fixed intervals and compresses them efficiently, so it’s great for monitoring trends, spotting failures, and diagnosing performance issues — without overwhelming your system with logging or bandwidth.
Stop Prometheus container with ctrl+c
cat >prometheus.yml <<EOF
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'dynamic'
file_sd_configs:
- files:
- /etc/prometheus/targets.yml
- /etc/prometheus/targets.json
refresh_interval: 30s
EOFcat >targets.yml <<EOF
- targets: [ 'node-exporter:9100' ]
labels:
job: 'node'
EOFdocker volume create prometheus-data
docker network create monitoringdocker container run -it --rm \
-d \
--name prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=bind,source="$(pwd)/targets.yml",target=/etc/prometheus/targets.yml,readonly \
--mount type=volume,source=prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheusdocker container run -it --rm \
--name node-exporter \
-p 9100:9100 \
--network monitoring \
docker.io/boxcutter/node-exporter \
--collector.systemd \
--collector.processes \
--no-collector.infiniband \
--no-collector.nfs \
--web.listen-address=:9100https://www.robustperception.io/using-the-textfile-collector-from-a-shell-script/
#!/bin/bash
# Adjust as needed.
TEXTFILE_COLLECTOR_DIR=/var/lib/node_exporter/textfile_collector/
# Note the start time of the script.
START="$(date +%s)"
# Your code goes here.
sleep 10
# Write out metrics to a temporary file.
END="$(date +%s)"
cat << EOF > "$TEXTFILE_COLLECTOR_DIR/myscript.prom.$$"
myscript_duration_seconds $(($END - $START))
myscript_last_run_seconds $END
EOF
# Rename the temporary file atomically.
# This avoids the node exporter seeing half a file.
mv "$TEXTFILE_COLLECTOR_DIR/myscript.prom.$$" \
"$TEXTFILE_COLLECTOR_DIR/myscript.prom"# /etc/systemd/system/my_script.service
[Unit]
Description=Run my bash script
[Service]
Type=oneshot
ExecStart=/usr/local/bin/my_script.sh# /etc/systemd/system/my_script.timer
[Unit]
Description=Timer for my_script.sh
[Timer]
OnCalendar=*:0/15
Persistent=true
[Install]
WantedBy=timers.targetThat OnCalendar=*:0/15 means: every 15 minutes. You can use any systemd time expression, e.g.: • daily → once a day • hourly → once an hour • Mon..Fri 09:00 → weekdays at 9am
Enable and start the timer
sudo systemctl daemon-reexec # safest way to reload after edits
sudo systemctl daemon-reload
sudo systemctl enable --now my_script.timerCheck that it works
- See timer status: systemctl list-timers --all
- See last run: systemctl status my_script.service
- See logs: journalctl -u my_script.service
docker container stop prometheusdocker volume rm prometheus-data
docker network rm monitoringdocker volume create prometheus-data
docker network create monitoringcat >prometheus.yml <<EOF
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'dynamic'
file_sd_configs:
- files:
- /etc/prometheus/targets.yml
- /etc/prometheus/targets.json
refresh_interval: 30s
EOFcat >targets.yml <<EOF
- targets: [ 'node-exporter:9100' ]
labels:
job: 'node'
EOFdocker container run -it --rm \
-d \
--name prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=bind,source="$(pwd)/targets.yml",target=/etc/prometheus/targets.yml,readonly \
--mount type=volume,source=prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheusdocker container run -it --rm \
-d \
--name node-exporter \
-p 9100:9100 \
--network monitoring \
docker.io/boxcutter/node-exporterdocker volume create grafana-data# https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/
docker container run -it --rm \
--name grafana \
-p 3000:3000 \
--network monitoring \
--env GF_AUTH_ANONYMOUS_ENABLED=true \
--env GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
--mount type=volume,source=grafana-data,target=/var/lib/grafana,volume-driver=local \
docker.io/grafana/grafanaNavigate to Connections → Data Sources → Add data source in the left side panel. Choose Prometheus Prometheus server url: http://prometheus:9090 Click on the "Save & test" button
Navigate to Dashboards > New > Import Paste in 1860 and click on the "Load" button
Navigate to Dashboards → New Dashboard → Click on the Add Visualization button Select Prometheus from the data source dropdown
docker container stop node-exporter
docker container stop prometheusdocker volume rm prometheus-data
docker network rm monitoringThe blackbox exporter uses the multi-target exporter pattern to monitor the availability of multiple entities from a single instance.
Multi-target exporters are used when you can't (or shouldn't) run an exporter directly on a device.
cat >blackbox.yml <<EOF
modules:
http_2xx:
prober: http
http:
method: GET
preferred_ip_protocol: "ip4"
resolve_prometheus:
prober: dns
dns:
query_name: prometheus.io
query_type: A
EOFdocker network create monitoringdocker container run -it --rm \
--name blackbox-exporter \
-p 9115:9115 \
--network monitoring \
docker.io/boxcutter/blackbox_exporterThere is the usual /metrics endpoint for prometheus monitoring the health of the blackbox exporter service
itself at http://localhost:9115/metrics. But this isn't the endpoint that you generally use with the blackbox
exporter:
% curl 'localhost:9115/metrics'There is another endpoint called /probe. A probe can generate information for another target besides the
blackbox exporter service itself. You specify a "target" and a "module" to be queried by a probe,
configured in the blackbox.yml file.
If you visit http://localhost:9115/ - you'll see that there are no recent probes yet. Probes only happen when
something visits the /probe endpoint, specifying a "target" and "module". Let's do that now:
% curl 'localhost:9115/probe?target=prometheus.io&module=http_2xx'
We can get prometheus to do the same kind probe trigger that we just did with curl. Here's what that configuration looks like:
cat >prometheus.yml <<EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: blackbox
metrics_path: /probe
params:
module:
- http_2xx
target:
- prometheus.io
static_configs:
- targets:
- blackbox-exporter:9115
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-d \
--name prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
docker.io/boxcutter/prometheusCheck http://localhost:9090/targets and verify that prometheus sent the probe successfully. It should say the target is up.
Visit http://localhost:9090/query and perform queries for probe_http_status_code and probe_http_duration_seconds.
probe_http_status_code{instance="blackbox-exporter:9115", job="blackbox"}
probe_http_duration_seconds{instance="blackbox-exporter:9115", job="blackbox", phase="connect"}
So now we see that prometheus stores the result of a probe. There is a problem with our config. Note that the instance
label shows blackbox-exporter:9115, which is correct. But we have no idea what the name of the host that was probed,
in this case prometheus.io. We could look in the config file under the scrape config param to find this information,
but then there's no way to query this information later in the database.
We can fix this with relabeling. Here's what we need to know about relabeling to understand the upcoming config:
- All labels starting with
__are dropped after the scrape. Most internal labels start with__ - There is an internal label
__address__which is set by the targets understatic_configsand whose value is the hostname for the scrape request. By default it is later used to set the value for the labelinstance, which is attached to each metric and tells you where the metrics came from. - You can set custom labels to be scrape params that are defined with
__param_<name>.
cat >prometheus.yml <<EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module:
- http_2xx
static_configs:
- targets:
- 'http://prometheus.io' # Target to probe with HTTP.
- 'https://prometheus.io' # Target to probe with HTTPS.
- 'http://promlabs.com:8080' # Unreachable target to probe with HTTP on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115 # Blackbox exporter address.
EOFSo what changed in this new config?
The target is no longer included in the params:
Old:
params:
module:
- http_2xx
target:
- prometheus.io
New:
params:
module:
- http_2xx
This lets us include multiple targets under static_configs instead of just one:
Old:
static_configs:
- targets:
- blackbox-exporter:9115
New:
static_configs:
- targets:
- 'http://prometheus.io' # Target to probe with HTTP.
- 'https://prometheus.io' # Target to probe with HTTPS.
- 'http://promlabs.com:8080' # Unreachable target to probe with HTTP on port 8080.
And now we have a bunch of new relabeling rules under relabel_configs:
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115 # Blackbox exporter address.
Before applying the relabeling rules, the request Prometheus would make would look like: http://prometheus.io/probe?modules=http_2xx.
After relabeling, the request would look like this: http://blackbox-exporter:9115/probe?target=http://prometheus.io&module=http_2xx.
The relabel rules are applied in sequence. First we take the values from the label __address__.
This comes from targets and write them to a new label called __param_target.
After this our imagined Prometheus request URI has now a target parameter: "http://prometheus.io/probe?target=http://prometheus.io&module=http_2xx".
First we take the values from the label address (which contain the values from targets) and write them to a new label __param_target which will add a parameter target to the Prometheus scrape requests:
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
After this, the first Prometheus request URI has now a target parameter: http://prometheus.io/probe?target=http://prometheus.io&module=http_2xx.
Next we take the values from the label __param_target and create a label instance with the values.
relabel_configs:
- source_labels: [__param_target]
target_label: instance
This relabel config won't change the request, but the metrics that come back from the request will now have a label instance="http://prometheus.io".
This lets us determine which host was probed.
And finally, we write the value blackbox-exporter:9115 to the label __address__ (now that we have
saved a copy in the __param_target). This is the hostname and port where prometheus will send
the probe request: http://blackbox-exporter:9115/probe?target=http://prometheus.io&module=http_2xx
relabel_configs:
- target_label: __address__
replacement: blackbox-exporter:9115
To make things easier to copy
relabel_configs:
- source_labels: []
target_label: instance
replacement: localhost:9115
cat >prometheus.yml <<EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module:
- http_2xx
static_configs:
- targets:
- 'http://prometheus.io' # Target to probe with HTTP.
- 'https://prometheus.io' # Target to probe with HTTPS.
- 'http://promlabs.com:8080' # Unreachable target to probe with HTTP on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115 # Blackbox exporter address.
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.yml
docker container run -it --rm \
-d \
--name prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus
Now if you go to and query probe_http_status_code you should see the address of the URI that was
probed instead of blackbox-exporter:9115, but
probe_http_status_code{instance="https://prometheus.io", job="blackbox"}
And for probe_http_duration_seconds you're still seeing valid data, so it's using blackbox-exporter:9115
to send the probe, and not the host listed in the instance (since it shouldn't have blackbox exporter
installed there).
docker stop prometheus
% docker run -it --rm docker.io/boxcutter/snmp \
snmpstatus -v3 -l authPriv \
-u snmpreader \
-a SHA -A superseekret \
-x AES -X superseekret \
10.137.56.1
[UDP: [10.137.56.1]:161->[0.0.0.0]:45090]=>[Peplink MAX BR2 Pro] Up: 1 day, 2:41:05.34
Interfaces: 9, Recv/Trans packets: 44205108/42666635 | IP: 0/0
2 interfaces are down!% docker run -it --rm docker.io/boxcutter/snmp \
snmpwalk -v3 -l authPriv \
-u snmpreader \
-a SHA -A superseekret \
-x AES -X superseekret \
10.137.56.1
iso.3.6.1.2.1.1.1.0 = STRING: "Peplink MAX BR2 Pro"
iso.3.6.1.2.1.1.2.0 = OID: iso.3.6.1.4.1.23695
iso.3.6.1.2.1.1.3.0 = Timeticks: (28708) 0:04:47.08
iso.3.6.1.2.1.1.4.0 = STRING: "support@peplink.com"
iso.3.6.1.2.1.1.5.0 = STRING: "MAX-BR2-2013"
iso.3.6.1.2.1.1.6.0 = STRING: "Peplink"
iso.3.6.1.2.1.1.8.0 = Timeticks: (1) 0:00:00.01
...git clone https://github.com/prometheus/snmp_exporter.git
cd snmp_exporter/generatorcp generator.yml generator.yml.orig---
auths:
public_v3:
version: 3
username: snmpreader
password: superseekret
auth_protocol: SHA
priv_protocol: AES
priv_password: superseekret
modules:
# Default IF-MIB interfaces table with ifIndex.
if_mib:
walk: [sysUpTime, interfaces, ifXTable]
lookups:
- source_indexes: [ifIndex]
lookup: "IF-MIB::ifAlias"
- source_indexes: [ifIndex]
# Disambiguate from PaloAlto PAN-COMMON-MIB::ifDescr.
lookup: "IF-MIB::ifDescr"
- source_indexes: [ifIndex]
# Disambiguate from Netscaler NS-ROOT-MIB::ifName.
lookup: "IF-MIB::ifName"
overrides:
ifAlias:
ignore: true # Lookup metric
ifDescr:
ignore: true # Lookup metric
ifName:
ignore: true # Lookup metric
ifType:
type: EnumAsInfodocker run -it --rm \
--mount type=bind,source="$(pwd)",target=/code \
--entrypoint /bin/bash \
docker.io/boxcutter/snmp-generator
# cd /code
# generator -m /opt/snmp_exporter/mibs/mibs -m /opt/snmp_exporter/mibs/peplink/peplink generatedocker network create monitoringdocker run -it --rm \
--name snmp-exporter \
-p 9116:9116 \
--network monitoring \
--mount type=bind,source="$(pwd)/snmp.yml",target=/etc/snmp_exporter/snmp.yml \
docker.io/boxcutter/snmp-exporter \
--config.file=/etc/snmp_exporter/snmp.ymlTarget: 10.137.56.1
Auth: public_v3
Module: if_mibcat >prometheus.yml <<EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'network'
metrics_path: /snmp
params:
auth: [public_v3]
module: [if_mib]
static_configs:
- targets: ['10.137.56.1']
labels:
job: 'peplink'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116 # The SNMP exporter real host
EOFdocker container run -it --rm \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
--name prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
docker.io/boxcutter/prometheusRemote write lets you send time series data to a remote system.
docker volume create central-prometheus-data
docker volume create client-prometheus-data
docker network create monitoring
cat >central-prometheus.yml <<EOF
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'central-prometheus-self'
EOF
docker container run -it --rm \
-d \
--name central-prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/central-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=central-prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9090 \
--web.enable-remote-write-receiver
Note that the central prometheus is not scraping anything and there are no metrics:
scrape_samples_scraped
count({__name__=~".+"})
And if you click on the triple dots in the search menu and click on "Explore metrics" there are no metric names yet.
cat >client-prometheus.yml <<EOF
global:
scrape_interval: 5s
evaluation_interval: 5s
external_labels:
source: client-prometheus
scrape_configs:
- job_name: 'client-prometheus-self'
static_configs:
- targets: ['localhost:9091']
remote_write:
- url: http://central-prometheus:9090/api/v1/write
EOF
docker container run -it --rm \
--name prometheus-client \
-p 9091:9091 \
--network monitoring \
--mount type=bind,source="$(pwd)/client-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=client-prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9091
Should see remote_write messages (and errors) in the log:
time=2026-01-03T23:31:40.024Z level=INFO source=watcher.go:292 msg="Replaying WAL" component=remote remote_name=7f0bb5 url=http://central-prometheus:9090/api/v1/write queue=7f0bb5
time=2026-01-03T23:31:45.057Z level=INFO source=watcher.go:538 msg="Done replaying WAL" component=remote remote_name=7f0bb5 url=http://central-prometheus:9090/api/v1/write duration=5.032317753s
Prometheus collects metrics starting with prometheus_remote_storage_*
Visit the client promtheus at http://localhost:9091
rate(prometheus_remote_storage_samples_total{job="client-prometheus-self"}[5m])
Visit the central prometheus at http://localhost:9090
The up metric should have the source label:
up{instance="localhost:9091", job="client-prometheus-self", source="client"}
You'll definitely want to prune/minimize metrics sent to the central prometheus with relabeling to just the essential ones you want in the central prometheus.
docker volume rm client-prometheus-data
docker volume rm server-prometheus-data
docker network rm monitoring
If the internet link is flaky, you want two complementary signals:
1. “Is the robot itself up?” (local liveness)
2. “Can I reach it right now?” (end-to-end reachability)
Prometheus by itself mostly measures (2) unless you design for (1).
Run Prometheus on each robot (or at least a tiny agent that speaks remote_write), scrape node_exporter locally over localhost, and remote_write to your central Prometheus when it can.
Why it works: even if the robot is offline from the internet, local Prometheus still records that the computer was up; when the link comes back it ships the backlog.
What to use
- Full Prometheus on-robot + remote_write
- Or an agent (e.g., Grafana Agent / Alloy / VictoriaMetrics vmagent / OTel Collector with Prom remote_write) if you don’t want full Prometheus per robot
A. Scrape success: up
Prometheus already provides:
- up{job="node-exporter", instance="robot-123"} = 1 when scrape succeeds, 0 when it fails.
This is reachability, not necessarily “the robot is down”.
B. Host boot health: node_exporter + time drift
From node_exporter, use:
- node_boot_time_seconds (changes only on reboot)
- node_time_seconds (current time)
- node_uptime_seconds (if enabled/available; otherwise derive from boot time)
In your central Prometheus, record:
- “Last time we got any sample from this robot”
- “Last observed boot time” (to detect reboots)
Even if you can’t get local buffering, this lets you distinguish:
- “robot unreachable” vs “robot rebooted recently” vs “robot stable but link flaky”
Create a recording rule that turns scrapes into a clean “last seen timestamp”:
- For each robot/instance, compute the most recent sample time from a stable metric (boot time is good)
- Then alert if “now - last_seen > threshold”
This is more actionable than raw up==0 because it naturally debounces brief blips.
What you’ll alert on:
- “Not seen for 5 minutes” (warning)
- “Not seen for 30 minutes” (critical)
And separately:
- “Rebooted within last 10 minutes” (info/warn)
For flaky internet, you also want: “Robot is up, but internet is bad.”
Options:
- Run blackbox_exporter on the robot to probe:
- DNS resolution
- HTTPS to a known endpoint (your backend or a public endpoint you control)
- ICMP ping to a few anchors
- Or a lightweight cron/script that updates a textfile collector metric with:
- packet loss %
- latency
- default route present
- Wi-Fi RSSI / cellular signal metrics
Then you can classify outages:
- robot down (no local metrics even on-robot)
- robot up but uplink down (local metrics show poor internet)
- robot up and internet OK, but your central can’t reach (routing/firewall)
For a fleet, ensure your series identity doesn’t change when IPs change:
- Set instance to something stable (robot_id + hw serial)
- Add labels like robot_id, site, fleet, role
This makes “last seen” and “uptime” meaningful across network changes.
Don’t page on up == 0 immediately.
Use a multi-stage approach:
- Warning: unreachable for 5–10 minutes
- Critical: unreachable for 30–60 minutes
- Info: reboot detected (boot time changed)
- Correlate: if robot reports uplink loss at same time, downgrade severity
docker volume create robot-prometheus-data
docker volume create central-prometheus-data
docker network create monitoring
# central-prometheus.yml
cat >central-prometheus.yml <<EOF
global:
scrape_interval: 15s
external_labels:
prometheus: central
site: hq
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/central-prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-d \
--name central-prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/central-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=central-prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9090 \
--web.enable-remote-write-receiver
docker container run -it --rm \
-d \
--name robot001-node-exporter \
-p 9100:9100 \
--network monitoring \
docker.io/boxcutter/node-exporter# robot-prometheus.yml
cat >robot-prometheus.yml <<EOF
global:
scrape_interval: 15s
external_labels:
robot_id: robot001
fleet: alpha
site: field
scrape_configs:
# Signal 1: "Robot is up" (local scrape; works even if WAN is down)
- job_name: robot-node-exporter
static_configs:
- targets: ["robot001-node-exporter:9100"] # node_exporter on the robot
labels:
role: compute
remote_write:
- url: http://central-prometheus:9090/api/v1/write
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/robot-prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-d \
--name robot001-prometheus \
-p 9091:9091 \
--network monitoring \
--mount type=bind,source="$(pwd)/robot-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9091
Go to the central-prometheus on http://localhost:9090/
All the metrics come from the robot via remote_write
# node_memory_MemAvailable_bytes
node_memory_MemAvailable_bytes{fleet="alpha", instance="robot001-node-exporter:9100",
job="robot-node-exporter", robot_id="robot001", role="compute",
site="field"} 7451455488On a non-flaky network, you'd just query up{job="robot-node-exporter"}
up{fleet="alpha", instance="robot001-node-exporter:9100", job="robot-node-exporter",
robot_id="robot001", role="compute", site="field"}
A good "alive in the last 5 minutes" query:
max_over_time(up{job="robot-node-exporter", robot_id="robot001"}[5m])
Returns 1 if central has seen up==1 at least once in the last 5 minutes
Returns 0 if it has seen up==0 (exporter scraped but down) and nothing 1.
Returns no series if nothing arrived at all (WAN down, robot down, remote_write blocked, etc).
To treat "no data" as down:
max_over_time(up{job="robot-node-exporter", robot_id="robot001"}[5m]) or on() vector(0)
This distinguishes "robot/exporter down" from "no remote_write data arriving":
time() - max(timestamp(up{job="robot-node-exporter", robot_id="robot001"}))
Small number - central is receiving fresh samples (<30s) Large number - data stopped arriving (WAN down, robot down, robote_write queue jammed, central receiver issue)
A simple boolean “alive within 90s”:
(time() - max(timestamp(up{job="robot-node-exporter", robot_id="robot001"}))) < 90
“Robot exporter not OK” (exporter reachable from robot Prometheus)
max_over_time(up{job="robot-node-exporter", robot_id="robot001"}[2m]) == 0
“Robot telemetry missing” (remote_write stopped or robot down). This is usually the true "robot missing from central" alarm.
(time() - max(timestamp(up{job="robot-node-exporter", robot_id="robot001"}))) > 120
Sometimes up{job="robot-node-exporter"} is fine, but you might prefer a “robot Prometheus heartbeat” series that always exists even if node_exporter changes.
On the robot, add a scrape for itself:
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
Then on central use
time() - max(timestamp(up{job="prometheus", robot_id="robot001"}))
This works, but it can be expensive if you have lots of robots/series:
(time() - max by (robot_id) (max_over_time(timestamp(up{job="robot-node-exporter"})[90d]))) > 86400
It’s scanning 90 days of raw samples per robot.
Instead: record a “last_seen” series that is cheap to query long-term.
Create a recording rule on central (evaluates every, say, 1m):
groups:
- name: robot-presence
interval: 1m
rules:
- record: robot_last_seen_timestamp_seconds
expr: max by (robot_id, fleet, site) (timestamp(up{job="robot-node-exporter"}))
When robots are sending metrics, this records “the last timestamp we’ve seen up for that robot”.
When a robot disappears, this series stops updating, but the previous samples remain in TSDB until your retention window expires - exactly what you want for “weeks/months”.
Now the “missing robots” query is fast and clean:
Robots missing for > 7 days
(time() - max_over_time(robot_last_seen_timestamp_seconds[365d])) > 7*24*60*60
Robots that have ever been seen in the last year, but are missing now (> 1 hour)
(
time() - max_over_time(robot_last_seen_timestamp_seconds[365d])
) > 3600
That returns one series per missing robot, value = seconds since last seen.
Why max_over_time(...[365d]) here is okay: it’s now one low-rate series per robot, not millions of raw scrapes across many metrics.
docker stop robot001-prometheus# robot-prometheus.yml
cat >robot-prometheus.yml <<EOF
global:
scrape_interval: 15s
external_labels:
robot_id: robot001
fleet: alpha
site: field
rule_files:
- rules/robot-presence-rule.yml
scrape_configs:
# Signal 1: "Robot is up" (local scrape; works even if WAN is down)
- job_name: robot-node-exporter
static_configs:
- targets: ["robot001-node-exporter:9100"] # node_exporter on the robot
labels:
role: compute
remote_write:
- url: http://central-prometheus:9090/api/v1/write
EOF
mkdir -p rules# robot-presence-rule.yml
cat >rules/robot-presence-rule.yml <<EOF
groups:
- name: robot-presence
interval: 1m
rules:
- record: robot_last_seen_timestamp_seconds
expr: max by (robot_id, fleet, site) (timestamp(up{job="robot-node-exporter"}))
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/robot-prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--mount type=bind,source="$(pwd)/rules/",target=/prometheus/rules/,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-d \
--name robot001-prometheus \
-p 9091:9091 \
--network monitoring \
--mount type=bind,source="$(pwd)/robot-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=bind,source="$(pwd)/rules/",target=/etc/prometheus/rules/,readonly \
--mount type=volume,source=robot-prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9091
docker volume create grafana-data
# https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/
docker container run -it --rm \
--name grafana \
-p 3000:3000 \
--network monitoring \
--env GF_AUTH_ANONYMOUS_ENABLED=true \
--env GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
--mount type=volume,source=grafana-data,target=/var/lib/grafana,volume-driver=local \
docker.io/grafana/grafana
Table panel
- Code query:
time() - max by (robot_id, fleet, site) (robot_last_seen_timestamp_seconds) - Options: Type - Instant
- Transformations:
- Labels to fields
- Organize fields (rename "Value" to "Last seen age")
# central-prometheus.yml
cat >central-prometheus.yml <<EOF
global:
scrape_interval: 15s
external_labels:
prometheus: central
site: hq
scrape_configs:
# Signal 2: end-to-end reachability from HQ/site to robot
- job_name: robot-reachability
scrape_interval: 15s
static_configs:
- targets:
# Probe the robot's node_exporter *over the network path you care about
- robot001-node-exporter:9100
labels:
fleet: alpha
EOF
docker container run -it --rm \
--mount type=bind,source="$(pwd)/central-prometheus.yml",target=/prometheus/prometheus.yml,readonly \
--entrypoint promtool \
docker.io/boxcutter/prometheus check config prometheus.ymldocker container run -it --rm \
-d \
--name central-prometheus \
-p 9090:9090 \
--network monitoring \
--mount type=bind,source="$(pwd)/central-prometheus.yml",target=/etc/prometheus/prometheus.yml,readonly \
--mount type=volume,source=central-prometheus-data,target=/prometheus,volume-driver=local \
docker.io/boxcutter/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9090 \
--web.enable-remote-write-receiver# robot-prometheus.yml
cat >robot-prometheus.yml <<EOF
global:
scrape_interval: 15s
external_labels:
robot_id: robot001
fleet: alpha
site: field
scrape_configs:
# Signal 1: "Robot is up" (local scrape; works even if WAN is down)
- job_name: robot-node-exporter
static_configs:
- targets: ["robot001-node-exporter:9100"] # node_exporter on the robot
labels:
role: compute
remote_write:
- url: "http://central-prometheus:9090/api/v1/write"
queue_config:
max_samples_per_send: 1000
max_shards: 20
capacity: 5000
EOF
docker stop robot001-node-exporter
docker stop robot001-prometheus
docker stop central-prometheus
docker volume rm robot-prometheus-data
docker volume rm central-prometheus-data
docker network rm monitoring