-
Notifications
You must be signed in to change notification settings - Fork 460
Prometheus
- Overview
- Metrics Export
- ExaBGP Metrics
- Prometheus Configuration
- Exporters
- Grafana Dashboards
- Alert Rules
- BGP-Specific Metrics
- Complete Setup
- Troubleshooting
- Best Practices
- See Also
Monitoring ExaBGP with Prometheus provides visibility into BGP session health, route announcements, and operational metrics. This guide covers metrics export, Prometheus configuration, Grafana dashboards, and alerting.
Monitoring Strategy:
- Session Health: BGP session state and uptime
- Route Metrics: Announced/withdrawn routes, routing table size
- Performance: Message processing time, API latency
- Health Checks: Application health status via ExaBGP
- Alerts: Proactive notification of issues
Remember: ExaBGP does NOT manipulate routing tables. It only announces routes via BGP. Monitor the actual route installation on network devices separately.
ExaBGP doesn't include built-in Prometheus metrics, but you can export metrics via:
- Custom API Process: Python script that exports metrics
- External Exporter: Standalone exporter that reads ExaBGP state
- Log Parsing: Parse logs for metrics (not recommended)
- Process Metrics: Generic process metrics (CPU, memory)
Create a Python process that exposes metrics:
#!/usr/bin/env python3
# prometheus_exporter.py
"""
ExaBGP process that exports Prometheus metrics
"""
import sys
import json
import time
from prometheus_client import start_http_server, Counter, Gauge, Histogram
from threading import Thread
# Define metrics
bgp_session_state = Gauge(
'exabgp_bgp_session_state',
'BGP session state (1=up, 0=down)',
['peer', 'local_as', 'peer_as']
)
bgp_messages_received = Counter(
'exabgp_bgp_messages_received_total',
'Total BGP messages received',
['peer', 'message_type']
)
bgp_messages_sent = Counter(
'exabgp_bgp_messages_sent_total',
'Total BGP messages sent',
['peer', 'message_type']
)
routes_announced = Gauge(
'exabgp_routes_announced',
'Number of routes currently announced',
['peer', 'afi', 'safi']
)
routes_received = Gauge(
'exabgp_routes_received',
'Number of routes received',
['peer', 'afi', 'safi']
)
api_messages_processed = Counter(
'exabgp_api_messages_processed_total',
'Total API messages processed',
['message_type']
)
health_check_status = Gauge(
'exabgp_health_check_status',
'Health check status (1=healthy, 0=unhealthy)',
['service']
)
def process_exabgp_messages():
"""Read JSON messages from ExaBGP and update metrics"""
for line in sys.stdin:
try:
msg = json.loads(line)
msg_type = msg.get('type')
if msg_type == 'state':
# BGP session state change
peer = msg.get('neighbor', {}).get('address', {}).get('peer')
state = msg.get('neighbor', {}).get('state')
if state == 'up':
bgp_session_state.labels(
peer=peer,
local_as=msg.get('neighbor', {}).get('asn', {}).get('local'),
peer_as=msg.get('neighbor', {}).get('asn', {}).get('peer')
).set(1)
else:
bgp_session_state.labels(
peer=peer,
local_as=msg.get('neighbor', {}).get('asn', {}).get('local'),
peer_as=msg.get('neighbor', {}).get('asn', {}).get('peer')
).set(0)
elif msg_type == 'update':
# Route update (announce or withdraw)
peer = msg.get('neighbor', {}).get('address', {}).get('peer')
if 'announce' in msg:
for afi in msg['announce']:
for safi in msg['announce'][afi]:
route_count = len(msg['announce'][afi][safi])
routes_announced.labels(
peer=peer,
afi=afi,
safi=safi
).inc(route_count)
if 'withdraw' in msg:
for afi in msg['withdraw']:
for safi in msg['withdraw'][afi]:
route_count = len(msg['withdraw'][afi][safi])
routes_announced.labels(
peer=peer,
afi=afi,
safi=safi
).dec(route_count)
elif msg_type in ['open', 'keepalive', 'notification']:
# BGP message counters
peer = msg.get('neighbor', {}).get('address', {}).get('peer')
direction = msg.get('direction', 'unknown')
if direction == 'receive':
bgp_messages_received.labels(
peer=peer,
message_type=msg_type
).inc()
elif direction == 'send':
bgp_messages_sent.labels(
peer=peer,
message_type=msg_type
).inc()
# Track API message processing
api_messages_processed.labels(
message_type=msg_type
).inc()
except json.JSONDecodeError:
pass
except Exception as e:
print(f"Error processing message: {e}", file=sys.stderr)
def main():
# Start Prometheus HTTP server
start_http_server(9576)
print("Prometheus metrics server started on :9576", file=sys.stderr)
# Process ExaBGP messages in main thread
process_exabgp_messages()
if __name__ == '__main__':
main()# exabgp.conf
process metrics-exporter {
run /opt/scripts/prometheus_exporter.py;
encoder json;
}
neighbor 192.168.1.1 {
router-id 10.0.0.1;
local-address 10.0.0.2;
local-as 65000;
peer-as 65001;
family {
ipv4 unicast;
ipv6 unicast;
}
api {
processes [ metrics-exporter ];
receive {
parsed;
update;
neighbor-changes;
}
}
static {
route 203.0.113.1/32 next-hop self;
}
}# Session state (1=up, 0=down)
exabgp_bgp_session_state{peer="192.168.1.1", local_as="65000", peer_as="65001"} 1
# BGP messages
exabgp_bgp_messages_received_total{peer="192.168.1.1", message_type="update"} 1523
exabgp_bgp_messages_received_total{peer="192.168.1.1", message_type="keepalive"} 8492
exabgp_bgp_messages_sent_total{peer="192.168.1.1", message_type="update"} 42
exabgp_bgp_messages_sent_total{peer="192.168.1.1", message_type="keepalive"} 8490
# Routes
exabgp_routes_announced{peer="192.168.1.1", afi="ipv4", safi="unicast"} 5
exabgp_routes_received{peer="192.168.1.1", afi="ipv4", safi="unicast"} 1523
# Add to prometheus_exporter.py
# Custom health check integration
def check_service_health(service_name, endpoint):
"""Check service health and export metric"""
try:
response = requests.get(endpoint, timeout=2)
if response.status_code == 200:
health_check_status.labels(service=service_name).set(1)
else:
health_check_status.labels(service=service_name).set(0)
except:
health_check_status.labels(service=service_name).set(0)
# Schedule health checks
def health_check_loop():
while True:
check_service_health('web', 'http://localhost:80/health')
check_service_health('api', 'http://localhost:8080/health')
time.sleep(10)
# Start in background thread
Thread(target=health_check_loop, daemon=True).start()Use node_exporter for system-level metrics:
# CPU usage
process_cpu_seconds_total{job="exabgp"}
# Memory usage
process_resident_memory_bytes{job="exabgp"}
# Open file descriptors
process_open_fds{job="exabgp"}
# Process uptime
process_start_time_seconds{job="exabgp"}
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# ExaBGP metrics
- job_name: 'exabgp'
static_configs:
- targets:
- 'localhost:9576'
labels:
instance: 'exabgp-node1'
environment: 'production'
# Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'localhost:9100'
# Additional ExaBGP instances
- job_name: 'exabgp-cluster'
static_configs:
- targets:
- 'exabgp-node1:9576'
- 'exabgp-node2:9576'
- 'exabgp-node3:9576'
labels:
cluster: 'primary'scrape_configs:
- job_name: 'exabgp-k8s'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kube-system
relabel_configs:
# Only scrape pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9576
# Add pod name as label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Add namespace as label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespacescrape_configs:
- job_name: 'exabgp-consul'
consul_sd_configs:
- server: 'consul.service.consul:8500'
services:
- exabgp
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node# Install
pip install exabgp-exporter
# Run
exabgp-exporter --exabgp-host localhost --exabgp-port 5000 --listen-port 9576Docker:
docker run -d \
--name exabgp-exporter \
-p 9576:9576 \
-e EXABGP_HOST=localhost \
-e EXABGP_PORT=5000 \
lusis/exabgp_exporter# docker-compose.yml
services:
exabgp:
image: exabgp/exabgp:latest
network_mode: host
volumes:
- ./exabgp.conf:/etc/exabgp/exabgp.conf:ro
exporter:
image: myorg/exabgp-prometheus-exporter:latest
network_mode: host
environment:
- EXABGP_API_HOST=localhost
- EXABGP_API_PORT=5000
- LISTEN_PORT=9576
depends_on:
- exabgpFor complete visibility, also export BGP metrics from routers:
scrape_configs:
# SNMP exporter for BGP on routers
- job_name: 'bgp-routers'
static_configs:
- targets:
- router1.example.com
- router2.example.com
metrics_path: /snmp
params:
module: [bgp4]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116{
"dashboard": {
"title": "ExaBGP Monitoring",
"panels": [
{
"title": "BGP Session State",
"type": "stat",
"targets": [
{
"expr": "exabgp_bgp_session_state",
"legendFormat": "{{peer}}"
}
],
"fieldConfig": {
"defaults": {
"mappings": [
{"type": "value", "value": "0", "text": "Down", "color": "red"},
{"type": "value", "value": "1", "text": "Up", "color": "green"}
]
}
}
},
{
"title": "Routes Announced",
"type": "graph",
"targets": [
{
"expr": "exabgp_routes_announced",
"legendFormat": "{{peer}} - {{afi}}/{{safi}}"
}
]
},
{
"title": "BGP Messages Rate",
"type": "graph",
"targets": [
{
"expr": "rate(exabgp_bgp_messages_received_total[5m])",
"legendFormat": "RX {{peer}} - {{message_type}}"
},
{
"expr": "rate(exabgp_bgp_messages_sent_total[5m])",
"legendFormat": "TX {{peer}} - {{message_type}}"
}
]
},
{
"title": "Health Check Status",
"type": "stat",
"targets": [
{
"expr": "exabgp_health_check_status",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "process_resident_memory_bytes{job='exabgp'}",
"legendFormat": "{{instance}}"
}
]
}
]
}
}1. BGP Session Overview
# Session state
exabgp_bgp_session_state
# Session uptime
time() - process_start_time_seconds{job="exabgp"}
# Active sessions count
count(exabgp_bgp_session_state == 1)
2. Route Statistics
# Total routes announced
sum(exabgp_routes_announced)
# Routes by peer
sum by (peer) (exabgp_routes_announced)
# Route changes rate
rate(exabgp_routes_announced[5m])
3. BGP Protocol Health
# Keepalive messages (should be steady)
rate(exabgp_bgp_messages_received_total{message_type="keepalive"}[5m])
# Update messages
rate(exabgp_bgp_messages_received_total{message_type="update"}[5m])
# Notifications (errors)
rate(exabgp_bgp_messages_received_total{message_type="notification"}[1h])
4. Application Health
# Service health status
exabgp_health_check_status
# Unhealthy services count
count(exabgp_health_check_status == 0)
5. Performance Metrics
# CPU usage
rate(process_cpu_seconds_total{job="exabgp"}[5m])
# Memory usage
process_resident_memory_bytes{job="exabgp"}
# API message processing rate
rate(exabgp_api_messages_processed_total[5m])
Save dashboard JSON and import:
# Via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @exabgp-dashboard.json
# Via Grafana UI
# Settings β Data Sources β Add Prometheus β Import Dashboard# /etc/prometheus/rules/exabgp.yml
groups:
- name: exabgp_alerts
interval: 30s
rules:
# BGP session down
- alert: BGPSessionDown
expr: exabgp_bgp_session_state == 0
for: 2m
labels:
severity: critical
annotations:
summary: "BGP session down on {{ $labels.instance }}"
description: "BGP session to {{ $labels.peer }} is down for more than 2 minutes"
# No routes announced
- alert: NoRoutesAnnounced
expr: sum(exabgp_routes_announced) == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No routes announced by ExaBGP on {{ $labels.instance }}"
description: "ExaBGP is not announcing any routes"
# Route flapping
- alert: RouteFlapping
expr: rate(exabgp_routes_announced[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Route flapping detected on {{ $labels.instance }}"
description: "High rate of route changes: {{ $value }} changes/sec"
# BGP notification messages (errors)
- alert: BGPNotifications
expr: rate(exabgp_bgp_messages_received_total{message_type="notification"}[10m]) > 0
labels:
severity: warning
annotations:
summary: "BGP notification messages received on {{ $labels.instance }}"
description: "Receiving BGP error notifications from {{ $labels.peer }}"
# Service unhealthy
- alert: ServiceUnhealthy
expr: exabgp_health_check_status == 0
for: 3m
labels:
severity: warning
annotations:
summary: "Service {{ $labels.service }} unhealthy on {{ $labels.instance }}"
description: "Health check failing for {{ $labels.service }}"
# ExaBGP process down
- alert: ExaBGPDown
expr: up{job="exabgp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ExaBGP process down on {{ $labels.instance }}"
description: "ExaBGP is not responding to scrape requests"
# High memory usage
- alert: ExaBGPHighMemory
expr: process_resident_memory_bytes{job="exabgp"} > 1e9
for: 5m
labels:
severity: warning
annotations:
summary: "ExaBGP high memory usage on {{ $labels.instance }}"
description: "Memory usage: {{ $value | humanize }}B"
# No keepalive messages
- alert: NoKeepalives
expr: rate(exabgp_bgp_messages_received_total{message_type="keepalive"}[5m]) == 0
for: 3m
labels:
severity: critical
annotations:
summary: "No BGP keepalives from {{ $labels.peer }}"
description: "BGP session may be frozen"Load rules:
# prometheus.yml
rule_files:
- '/etc/prometheus/rules/*.yml'# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: pagerduty
# Warnings go to Slack
- match:
severity: warning
receiver: slack
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#alerts'
title: 'ExaBGP Alert'
text: '{{ .CommonAnnotations.description }}'# Add to prometheus_exporter.py
# Prefix-specific metrics
prefix_announced = Gauge(
'exabgp_prefix_announced',
'Whether specific prefix is announced',
['peer', 'prefix']
)
# FlowSpec rules
flowspec_rules_active = Gauge(
'exabgp_flowspec_rules_active',
'Number of active FlowSpec rules',
['peer']
)
# Session establishment time
bgp_session_established_timestamp = Gauge(
'exabgp_bgp_session_established_timestamp',
'Timestamp when BGP session was established',
['peer']
)
# Route attributes
route_med = Gauge(
'exabgp_route_med',
'MED attribute of announced routes',
['peer', 'prefix']
)
# Community tracking
routes_with_community = Gauge(
'exabgp_routes_with_community',
'Routes with specific community',
['peer', 'community']
)# Session duration
time() - exabgp_bgp_session_established_timestamp
# Route churn rate (announces + withdraws)
rate(exabgp_routes_announced[5m]) + rate(exabgp_routes_withdrawn[5m])
# Average routes per peer
avg(exabgp_routes_announced) by (peer)
# Session flaps (state changes)
changes(exabgp_bgp_session_state[1h])
# Percentage of healthy services
(sum(exabgp_health_check_status) / count(exabgp_health_check_status)) * 100
version: '3.8'
services:
# ExaBGP
exabgp:
image: exabgp/exabgp:5.0.0
network_mode: host
volumes:
- ./config/exabgp.conf:/etc/exabgp/exabgp.conf:ro
- ./scripts:/opt/scripts:ro
restart: unless-stopped
# Node exporter
node-exporter:
image: prom/node-exporter:latest
network_mode: host
pid: host
restart: unless-stopped
# Prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/rules:/etc/prometheus/rules:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
restart: unless-stopped
# Alertmanager
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
# Grafana
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
restart: unless-stopped
depends_on:
- prometheus
volumes:
prometheus-data:
grafana-data:# Check if exporter is running
curl http://localhost:9576/metrics
# Verify ExaBGP process is sending JSON
docker exec exabgp ps aux | grep prometheus_exporter
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq# Enable debug logging in exporter
import logging
logging.basicConfig(level=logging.DEBUG)
# Log all received messages
print(f"DEBUG: Received message: {msg}", file=sys.stderr)# Check Prometheus logs
docker logs prometheus
# Verify network connectivity
docker exec prometheus wget -O- http://exabgp:9576/metrics
# Check firewall
iptables -L -n | grep 9576# Test Prometheus data source
curl http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up
# Check Grafana logs
docker logs grafanaMatch scrape interval to BGP timers:
# If BGP keepalive is 30s, use 15s scrape
scrape_interval: 15sBalance storage vs. history:
# Keep 30 days of data
--storage.tsdb.retention.time=30dRun Prometheus in HA mode:
# Two Prometheus instances scraping same targets
# Use Thanos or Cortex for long-term storageTune alert thresholds:
# Require 2 minutes of downtime before alerting
for: 2mAvoid high-cardinality labels:
# Bad: label per prefix (thousands of prefixes)
prefix_metric.labels(prefix="1.2.3.4/32")
# Good: aggregate by peer
routes_by_peer.labels(peer="192.168.1.1")Protect metrics endpoint:
# Basic auth on exporter
from werkzeug.security import check_password_hashDocument custom metrics:
routes_announced = Gauge(
'exabgp_routes_announced',
'Number of routes currently announced via BGP', # Clear description
['peer', 'afi', 'safi']
)- Monitoring Operations - General monitoring guide
- Debugging - Troubleshooting ExaBGP
- Docker Integration - Docker deployment
- Kubernetes Integration - Kubernetes monitoring
- High Availability - HA patterns
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)