Skip to content

Prometheus

Thomas Mangin edited this page Nov 13, 2025 · 4 revisions

Prometheus Monitoring

Table of Contents

Overview

Monitoring ExaBGP with Prometheus provides visibility into BGP session health, route announcements, and operational metrics. This guide covers metrics export, Prometheus configuration, Grafana dashboards, and alerting.

Monitoring Strategy:

  • Session Health: BGP session state and uptime
  • Route Metrics: Announced/withdrawn routes, routing table size
  • Performance: Message processing time, API latency
  • Health Checks: Application health status via ExaBGP
  • Alerts: Proactive notification of issues

Remember: ExaBGP does NOT manipulate routing tables. It only announces routes via BGP. Monitor the actual route installation on network devices separately.

Metrics Export

ExaBGP Native Metrics

ExaBGP doesn't include built-in Prometheus metrics, but you can export metrics via:

  1. Custom API Process: Python script that exports metrics
  2. External Exporter: Standalone exporter that reads ExaBGP state
  3. Log Parsing: Parse logs for metrics (not recommended)
  4. Process Metrics: Generic process metrics (CPU, memory)

Custom Metrics Process

Create a Python process that exposes metrics:

#!/usr/bin/env python3
# prometheus_exporter.py

"""
ExaBGP process that exports Prometheus metrics
"""

import sys
import json
import time
from prometheus_client import start_http_server, Counter, Gauge, Histogram
from threading import Thread

# Define metrics
bgp_session_state = Gauge(
    'exabgp_bgp_session_state',
    'BGP session state (1=up, 0=down)',
    ['peer', 'local_as', 'peer_as']
)

bgp_messages_received = Counter(
    'exabgp_bgp_messages_received_total',
    'Total BGP messages received',
    ['peer', 'message_type']
)

bgp_messages_sent = Counter(
    'exabgp_bgp_messages_sent_total',
    'Total BGP messages sent',
    ['peer', 'message_type']
)

routes_announced = Gauge(
    'exabgp_routes_announced',
    'Number of routes currently announced',
    ['peer', 'afi', 'safi']
)

routes_received = Gauge(
    'exabgp_routes_received',
    'Number of routes received',
    ['peer', 'afi', 'safi']
)

api_messages_processed = Counter(
    'exabgp_api_messages_processed_total',
    'Total API messages processed',
    ['message_type']
)

health_check_status = Gauge(
    'exabgp_health_check_status',
    'Health check status (1=healthy, 0=unhealthy)',
    ['service']
)

def process_exabgp_messages():
    """Read JSON messages from ExaBGP and update metrics"""
    for line in sys.stdin:
        try:
            msg = json.loads(line)
            msg_type = msg.get('type')

            if msg_type == 'state':
                # BGP session state change
                peer = msg.get('neighbor', {}).get('address', {}).get('peer')
                state = msg.get('neighbor', {}).get('state')

                if state == 'up':
                    bgp_session_state.labels(
                        peer=peer,
                        local_as=msg.get('neighbor', {}).get('asn', {}).get('local'),
                        peer_as=msg.get('neighbor', {}).get('asn', {}).get('peer')
                    ).set(1)
                else:
                    bgp_session_state.labels(
                        peer=peer,
                        local_as=msg.get('neighbor', {}).get('asn', {}).get('local'),
                        peer_as=msg.get('neighbor', {}).get('asn', {}).get('peer')
                    ).set(0)

            elif msg_type == 'update':
                # Route update (announce or withdraw)
                peer = msg.get('neighbor', {}).get('address', {}).get('peer')

                if 'announce' in msg:
                    for afi in msg['announce']:
                        for safi in msg['announce'][afi]:
                            route_count = len(msg['announce'][afi][safi])
                            routes_announced.labels(
                                peer=peer,
                                afi=afi,
                                safi=safi
                            ).inc(route_count)

                if 'withdraw' in msg:
                    for afi in msg['withdraw']:
                        for safi in msg['withdraw'][afi]:
                            route_count = len(msg['withdraw'][afi][safi])
                            routes_announced.labels(
                                peer=peer,
                                afi=afi,
                                safi=safi
                            ).dec(route_count)

            elif msg_type in ['open', 'keepalive', 'notification']:
                # BGP message counters
                peer = msg.get('neighbor', {}).get('address', {}).get('peer')
                direction = msg.get('direction', 'unknown')

                if direction == 'receive':
                    bgp_messages_received.labels(
                        peer=peer,
                        message_type=msg_type
                    ).inc()
                elif direction == 'send':
                    bgp_messages_sent.labels(
                        peer=peer,
                        message_type=msg_type
                    ).inc()

            # Track API message processing
            api_messages_processed.labels(
                message_type=msg_type
            ).inc()

        except json.JSONDecodeError:
            pass
        except Exception as e:
            print(f"Error processing message: {e}", file=sys.stderr)

def main():
    # Start Prometheus HTTP server
    start_http_server(9576)
    print("Prometheus metrics server started on :9576", file=sys.stderr)

    # Process ExaBGP messages in main thread
    process_exabgp_messages()

if __name__ == '__main__':
    main()

ExaBGP Configuration with Metrics

# exabgp.conf

process metrics-exporter {
    run /opt/scripts/prometheus_exporter.py;
    encoder json;
}

neighbor 192.168.1.1 {
    router-id 10.0.0.1;
    local-address 10.0.0.2;
    local-as 65000;
    peer-as 65001;

    family {
        ipv4 unicast;
        ipv6 unicast;
    }

    api {
        processes [ metrics-exporter ];
        receive {
            parsed;
            update;
            neighbor-changes;
        }
    }

    static {
        route 203.0.113.1/32 next-hop self;
    }
}

ExaBGP Metrics

Core BGP Metrics

# Session state (1=up, 0=down)
exabgp_bgp_session_state{peer="192.168.1.1", local_as="65000", peer_as="65001"} 1

# BGP messages
exabgp_bgp_messages_received_total{peer="192.168.1.1", message_type="update"} 1523
exabgp_bgp_messages_received_total{peer="192.168.1.1", message_type="keepalive"} 8492
exabgp_bgp_messages_sent_total{peer="192.168.1.1", message_type="update"} 42
exabgp_bgp_messages_sent_total{peer="192.168.1.1", message_type="keepalive"} 8490

# Routes
exabgp_routes_announced{peer="192.168.1.1", afi="ipv4", safi="unicast"} 5
exabgp_routes_received{peer="192.168.1.1", afi="ipv4", safi="unicast"} 1523

Application Health Metrics

# Add to prometheus_exporter.py

# Custom health check integration
def check_service_health(service_name, endpoint):
    """Check service health and export metric"""
    try:
        response = requests.get(endpoint, timeout=2)
        if response.status_code == 200:
            health_check_status.labels(service=service_name).set(1)
        else:
            health_check_status.labels(service=service_name).set(0)
    except:
        health_check_status.labels(service=service_name).set(0)

# Schedule health checks
def health_check_loop():
    while True:
        check_service_health('web', 'http://localhost:80/health')
        check_service_health('api', 'http://localhost:8080/health')
        time.sleep(10)

# Start in background thread
Thread(target=health_check_loop, daemon=True).start()

Process Metrics

Use node_exporter for system-level metrics:

# CPU usage
process_cpu_seconds_total{job="exabgp"}

# Memory usage
process_resident_memory_bytes{job="exabgp"}

# Open file descriptors
process_open_fds{job="exabgp"}

# Process uptime
process_start_time_seconds{job="exabgp"}

Prometheus Configuration

Basic Scrape Config

# prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # ExaBGP metrics
  - job_name: 'exabgp'
    static_configs:
      - targets:
          - 'localhost:9576'
        labels:
          instance: 'exabgp-node1'
          environment: 'production'

  # Node exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'localhost:9100'

  # Additional ExaBGP instances
  - job_name: 'exabgp-cluster'
    static_configs:
      - targets:
          - 'exabgp-node1:9576'
          - 'exabgp-node2:9576'
          - 'exabgp-node3:9576'
        labels:
          cluster: 'primary'

Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'exabgp-k8s'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - kube-system

    relabel_configs:
      # Only scrape pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom port from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:9576

      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

      # Add namespace as label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

Service Discovery (Consul)

scrape_configs:
  - job_name: 'exabgp-consul'
    consul_sd_configs:
      - server: 'consul.service.consul:8500'
        services:
          - exabgp

    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_node]
        target_label: node

Exporters

Third-Party ExaBGP Exporters

1. exabgp_exporter (Community)

# Install
pip install exabgp-exporter

# Run
exabgp-exporter --exabgp-host localhost --exabgp-port 5000 --listen-port 9576

Docker:

docker run -d \
    --name exabgp-exporter \
    -p 9576:9576 \
    -e EXABGP_HOST=localhost \
    -e EXABGP_PORT=5000 \
    lusis/exabgp_exporter

2. Custom Sidecar Exporter

# docker-compose.yml
services:
  exabgp:
    image: exabgp/exabgp:latest
    network_mode: host
    volumes:
      - ./exabgp.conf:/etc/exabgp/exabgp.conf:ro

  exporter:
    image: myorg/exabgp-prometheus-exporter:latest
    network_mode: host
    environment:
      - EXABGP_API_HOST=localhost
      - EXABGP_API_PORT=5000
      - LISTEN_PORT=9576
    depends_on:
      - exabgp

BGP Metrics from Network Devices

For complete visibility, also export BGP metrics from routers:

scrape_configs:
  # SNMP exporter for BGP on routers
  - job_name: 'bgp-routers'
    static_configs:
      - targets:
          - router1.example.com
          - router2.example.com
    metrics_path: /snmp
    params:
      module: [bgp4]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

Grafana Dashboards

ExaBGP Dashboard JSON

{
  "dashboard": {
    "title": "ExaBGP Monitoring",
    "panels": [
      {
        "title": "BGP Session State",
        "type": "stat",
        "targets": [
          {
            "expr": "exabgp_bgp_session_state",
            "legendFormat": "{{peer}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {"type": "value", "value": "0", "text": "Down", "color": "red"},
              {"type": "value", "value": "1", "text": "Up", "color": "green"}
            ]
          }
        }
      },
      {
        "title": "Routes Announced",
        "type": "graph",
        "targets": [
          {
            "expr": "exabgp_routes_announced",
            "legendFormat": "{{peer}} - {{afi}}/{{safi}}"
          }
        ]
      },
      {
        "title": "BGP Messages Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(exabgp_bgp_messages_received_total[5m])",
            "legendFormat": "RX {{peer}} - {{message_type}}"
          },
          {
            "expr": "rate(exabgp_bgp_messages_sent_total[5m])",
            "legendFormat": "TX {{peer}} - {{message_type}}"
          }
        ]
      },
      {
        "title": "Health Check Status",
        "type": "stat",
        "targets": [
          {
            "expr": "exabgp_health_check_status",
            "legendFormat": "{{service}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "process_resident_memory_bytes{job='exabgp'}",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ]
  }
}

Dashboard Panels

1. BGP Session Overview

# Session state
exabgp_bgp_session_state

# Session uptime
time() - process_start_time_seconds{job="exabgp"}

# Active sessions count
count(exabgp_bgp_session_state == 1)

2. Route Statistics

# Total routes announced
sum(exabgp_routes_announced)

# Routes by peer
sum by (peer) (exabgp_routes_announced)

# Route changes rate
rate(exabgp_routes_announced[5m])

3. BGP Protocol Health

# Keepalive messages (should be steady)
rate(exabgp_bgp_messages_received_total{message_type="keepalive"}[5m])

# Update messages
rate(exabgp_bgp_messages_received_total{message_type="update"}[5m])

# Notifications (errors)
rate(exabgp_bgp_messages_received_total{message_type="notification"}[1h])

4. Application Health

# Service health status
exabgp_health_check_status

# Unhealthy services count
count(exabgp_health_check_status == 0)

5. Performance Metrics

# CPU usage
rate(process_cpu_seconds_total{job="exabgp"}[5m])

# Memory usage
process_resident_memory_bytes{job="exabgp"}

# API message processing rate
rate(exabgp_api_messages_processed_total[5m])

Import Dashboard

Save dashboard JSON and import:

# Via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d @exabgp-dashboard.json

# Via Grafana UI
# Settings β†’ Data Sources β†’ Add Prometheus β†’ Import Dashboard

Alert Rules

Prometheus Alert Rules

# /etc/prometheus/rules/exabgp.yml

groups:
  - name: exabgp_alerts
    interval: 30s
    rules:
      # BGP session down
      - alert: BGPSessionDown
        expr: exabgp_bgp_session_state == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "BGP session down on {{ $labels.instance }}"
          description: "BGP session to {{ $labels.peer }} is down for more than 2 minutes"

      # No routes announced
      - alert: NoRoutesAnnounced
        expr: sum(exabgp_routes_announced) == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "No routes announced by ExaBGP on {{ $labels.instance }}"
          description: "ExaBGP is not announcing any routes"

      # Route flapping
      - alert: RouteFlapping
        expr: rate(exabgp_routes_announced[5m]) > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Route flapping detected on {{ $labels.instance }}"
          description: "High rate of route changes: {{ $value }} changes/sec"

      # BGP notification messages (errors)
      - alert: BGPNotifications
        expr: rate(exabgp_bgp_messages_received_total{message_type="notification"}[10m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "BGP notification messages received on {{ $labels.instance }}"
          description: "Receiving BGP error notifications from {{ $labels.peer }}"

      # Service unhealthy
      - alert: ServiceUnhealthy
        expr: exabgp_health_check_status == 0
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Service {{ $labels.service }} unhealthy on {{ $labels.instance }}"
          description: "Health check failing for {{ $labels.service }}"

      # ExaBGP process down
      - alert: ExaBGPDown
        expr: up{job="exabgp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ExaBGP process down on {{ $labels.instance }}"
          description: "ExaBGP is not responding to scrape requests"

      # High memory usage
      - alert: ExaBGPHighMemory
        expr: process_resident_memory_bytes{job="exabgp"} > 1e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ExaBGP high memory usage on {{ $labels.instance }}"
          description: "Memory usage: {{ $value | humanize }}B"

      # No keepalive messages
      - alert: NoKeepalives
        expr: rate(exabgp_bgp_messages_received_total{message_type="keepalive"}[5m]) == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "No BGP keepalives from {{ $labels.peer }}"
          description: "BGP session may be frozen"

Load rules:

# prometheus.yml
rule_files:
  - '/etc/prometheus/rules/*.yml'

Alertmanager Configuration

# alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty

    # Warnings go to Slack
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#alerts'
        title: 'ExaBGP Alert'
        text: '{{ .CommonAnnotations.description }}'

BGP-Specific Metrics

Advanced BGP Metrics

# Add to prometheus_exporter.py

# Prefix-specific metrics
prefix_announced = Gauge(
    'exabgp_prefix_announced',
    'Whether specific prefix is announced',
    ['peer', 'prefix']
)

# FlowSpec rules
flowspec_rules_active = Gauge(
    'exabgp_flowspec_rules_active',
    'Number of active FlowSpec rules',
    ['peer']
)

# Session establishment time
bgp_session_established_timestamp = Gauge(
    'exabgp_bgp_session_established_timestamp',
    'Timestamp when BGP session was established',
    ['peer']
)

# Route attributes
route_med = Gauge(
    'exabgp_route_med',
    'MED attribute of announced routes',
    ['peer', 'prefix']
)

# Community tracking
routes_with_community = Gauge(
    'exabgp_routes_with_community',
    'Routes with specific community',
    ['peer', 'community']
)

PromQL Queries for BGP

# Session duration
time() - exabgp_bgp_session_established_timestamp

# Route churn rate (announces + withdraws)
rate(exabgp_routes_announced[5m]) + rate(exabgp_routes_withdrawn[5m])

# Average routes per peer
avg(exabgp_routes_announced) by (peer)

# Session flaps (state changes)
changes(exabgp_bgp_session_state[1h])

# Percentage of healthy services
(sum(exabgp_health_check_status) / count(exabgp_health_check_status)) * 100

Complete Setup

Docker Compose with Monitoring Stack

version: '3.8'

services:
  # ExaBGP
  exabgp:
    image: exabgp/exabgp:5.0.0
    network_mode: host
    volumes:
      - ./config/exabgp.conf:/etc/exabgp/exabgp.conf:ro
      - ./scripts:/opt/scripts:ro
    restart: unless-stopped

  # Node exporter
  node-exporter:
    image: prom/node-exporter:latest
    network_mode: host
    pid: host
    restart: unless-stopped

  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./monitoring/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

  # Grafana
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
    restart: unless-stopped
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:

Troubleshooting

Metrics Not Appearing

# Check if exporter is running
curl http://localhost:9576/metrics

# Verify ExaBGP process is sending JSON
docker exec exabgp ps aux | grep prometheus_exporter

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq

Wrong Metric Values

# Enable debug logging in exporter
import logging
logging.basicConfig(level=logging.DEBUG)

# Log all received messages
print(f"DEBUG: Received message: {msg}", file=sys.stderr)

Prometheus Not Scraping

# Check Prometheus logs
docker logs prometheus

# Verify network connectivity
docker exec prometheus wget -O- http://exabgp:9576/metrics

# Check firewall
iptables -L -n | grep 9576

Grafana Dashboard Issues

# Test Prometheus data source
curl http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up

# Check Grafana logs
docker logs grafana

Best Practices

1. Scrape Interval

Match scrape interval to BGP timers:

# If BGP keepalive is 30s, use 15s scrape
scrape_interval: 15s

2. Metric Retention

Balance storage vs. history:

# Keep 30 days of data
--storage.tsdb.retention.time=30d

3. High Availability

Run Prometheus in HA mode:

# Two Prometheus instances scraping same targets
# Use Thanos or Cortex for long-term storage

4. Alert Fatigue

Tune alert thresholds:

# Require 2 minutes of downtime before alerting
for: 2m

5. Cardinality

Avoid high-cardinality labels:

# Bad: label per prefix (thousands of prefixes)
prefix_metric.labels(prefix="1.2.3.4/32")

# Good: aggregate by peer
routes_by_peer.labels(peer="192.168.1.1")

6. Security

Protect metrics endpoint:

# Basic auth on exporter
from werkzeug.security import check_password_hash

7. Documentation

Document custom metrics:

routes_announced = Gauge(
    'exabgp_routes_announced',
    'Number of routes currently announced via BGP',  # Clear description
    ['peer', 'afi', 'safi']
)

See Also

References


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally