Alerting Guide for FFmpeg RTMP

Complete guide for setting up and managing alerts in production FFmpeg RTMP deployments.

Overview
Alert Categories
Setup Instructions
Alert Configuration
Notification Channels
Testing Alerts
Incident Response
Maintenance and Tuning

Overview

The FFmpeg RTMP system includes comprehensive alerting rules for:

Critical incidents requiring immediate action (page on-call)
Warnings that need investigation (notify team)
Performance degradation for proactive monitoring (info/ticket)

Alert Severity Levels

Severity	Response	Notification	Example
critical	Immediate (page)	PagerDuty + Slack	Master down, all workers offline
warning	Within 1 hour	Slack + Email	Single worker down, high queue
info	Next business day	Slack (monitoring)	Performance degradation, low utilization

Alert Categories

Critical Alerts (Page Immediately)

FFmpegMasterNodeDown

Trigger: Master unreachable for 2+ minutes
Impact: Complete system outage
Action: Check service, logs, restart if needed

FFmpegAllWorkersDown

Trigger: Zero workers available for 5+ minutes
Impact: No processing capacity
Action: Check worker services, network connectivity

FFmpegCriticalFailureRate

Trigger: >50% jobs failing for 10+ minutes
Impact: System critically degraded
Action: Check worker resources, logs, input files

FFmpegQueueCritical

Trigger: Queue > 2000 jobs for 15+ minutes
Impact: Severe delays, SLA violations
Action: Add emergency capacity, check workers

FFmpegMasterDiskCritical

Trigger: Master disk < 5% free for 5+ minutes
Impact: Database writes failing
Action: Clean logs, remove backups, expand disk

Warning Alerts (Investigate)

FFmpegWorkerNodeDown

Trigger: Individual worker down 5+ minutes
Impact: Reduced capacity
Action: Check service on worker node

FFmpegHighFailureRate

Trigger: >10% jobs failing for 10+ minutes
Impact: Elevated error rate
Action: Investigate failed jobs, check resources

FFmpegQueueWarning

Trigger: Queue > 500 jobs for 15+ minutes
Impact: Increasing delays
Action: Monitor growth, plan capacity

FFmpegWorkerCapacityHigh

Trigger: Worker > 85% capacity for 30+ minutes
Impact: Limited headroom
Action: Consider adding capacity

FFmpegHighJobLatency

Trigger: P95 queue wait > 5 minutes for 15+ minutes
Impact: Slow job processing
Action: Add workers or increase concurrency

FFmpegWorkerDiskWarning

Trigger: Worker disk < 20% free for 10+ minutes
Impact: May reject jobs soon
Action: Clean temp files

FFmpegMasterDiskWarning

Trigger: Master disk < 15% free for 10+ minutes
Impact: Approaching critical
Action: Clean logs, plan expansion

Performance Alerts (Informational)

FFmpegSlowJobExecution

Trigger: Jobs 50% slower than baseline
Impact: Degraded performance
Action: Investigate CPU, disk, network

FFmpegLowWorkerUtilization

Trigger: Worker < 20% utilized for 2+ hours
Impact: Inefficient resource use
Action: Consider cost optimization

Setup Instructions

Prerequisites

Prometheus installed and configured
Alertmanager installed
Notification channels configured (Slack, PagerDuty, email)

Step 1: Deploy Alert Rules

Option A: Docker Compose

# Copy alert rules
cp docs/prometheus/ffmpeg-rtmp-alerts.yml deployment/prometheus/rules/

# Mount in docker-compose.yml
volumes:
  - ./deployment/prometheus/rules:/etc/prometheus/rules:ro

# Update prometheus.yml
rule_files:
  - '/etc/prometheus/rules/*.yml'

# Restart Prometheus
docker-compose restart prometheus

Option B: System Installation

# Copy rules to Prometheus directory
sudo cp docs/prometheus/ffmpeg-rtmp-alerts.yml /etc/prometheus/rules/

# Fix permissions
sudo chown prometheus:prometheus /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml

# Validate rules
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml

# Update /etc/prometheus/prometheus.yml
rule_files:
  - '/etc/prometheus/rules/*.yml'

# Reload Prometheus
sudo systemctl reload prometheus
# OR
curl -X POST http://localhost:9090/-/reload

Step 2: Deploy Alertmanager Configuration

Docker Compose:

# Copy config
cp docs/prometheus/alertmanager.yml deployment/prometheus/

# Update docker-compose.yml
alertmanager:
  image: prom/alertmanager:latest
  volumes:
    - ./deployment/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
  ports:
    - "9093:9093"

# Start Alertmanager
docker-compose up -d alertmanager

System Installation:

# Install Alertmanager
sudo apt-get install prometheus-alertmanager

# Copy config
sudo cp docs/prometheus/alertmanager.yml /etc/alertmanager/

# Set environment variables
echo 'SMTP_PASSWORD=your-password' | sudo tee -a /etc/default/alertmanager
echo 'PAGERDUTY_SERVICE_KEY=your-key' | sudo tee -a /etc/default/alertmanager

# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml

# Start service
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Step 3: Configure Prometheus to Use Alertmanager

Add to prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'localhost:9093'  # or alertmanager:9093 for Docker

Reload Prometheus:

# Docker
docker-compose restart prometheus

# System
sudo systemctl reload prometheus

Alert Configuration

Customizing Thresholds

Edit docs/prometheus/ffmpeg-rtmp-alerts.yml:

# Example: Change queue critical threshold
- alert: FFmpegQueueCritical
  expr: jobs_queued_total{job="ffmpeg-master"} > 3000  # Changed from 2000
  for: 10m  # Changed from 15m

Adding Custom Alerts

- name: custom_alerts
  interval: 1m
  rules:
  - alert: CustomAlert
    expr: your_metric > threshold
    for: duration
    labels:
      severity: warning
      component: custom
    annotations:
      summary: "Alert summary"
      description: "Detailed description"

Alert Label Best Practices

Always include these labels:

severity: critical, warning, or info
component: master, worker, queue, database, etc.
team: Team responsible for response

Notification Channels

Slack Setup

Create Slack App:

https://api.slack.com/apps → Create New App

Enable Incoming Webhooks:

Features → Incoming Webhooks → Activate

Add to Workspace:

Add New Webhook to Workspace
Select channel: #ffmpeg-alerts-critical
Copy webhook URL

Configure Alertmanager:

# Save webhook URL
echo "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" | \
  sudo tee /etc/alertmanager/slack_webhook_url

# Or set in alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

Channels to Create:

#ffmpeg-alerts-critical - Critical alerts (page team)
#ffmpeg-alerts - Warning alerts
#ffmpeg-ops - Operations team alerts
#ffmpeg-monitoring - Performance/info alerts

PagerDuty Setup

Create Service:

PagerDuty → Services → New Service
Name: FFmpeg RTMP Production

Add Integration:

Integrations → Add Integration
Type: Prometheus
Copy Integration Key

Configure Alertmanager:

export PAGERDUTY_SERVICE_KEY="your-integration-key"

# Add to /etc/default/alertmanager or docker-compose.yml

Set Escalation Policy:

Escalation Policies → Create Policy
Level 1: On-call engineer (immediate)
Level 2: Team lead (after 15 min)
Level 3: Manager (after 30 min)

Email Setup

Gmail Example:

global:
  smtp_from: 'alerts@company.com'
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'app-specific-password'  # Generate in Gmail settings
  smtp_require_tls: true

receivers:
  - name: 'email-ops'
    email_configs:
      - to: 'ops-team@company.com'
        headers:
          Subject: '[FFmpeg RTMP] {{ .GroupLabels.alertname }}'

SendGrid Example:

global:
  smtp_smarthost: 'smtp.sendgrid.net:587'
  smtp_auth_username: 'apikey'
  smtp_auth_password: '${SENDGRID_API_KEY}'

Testing Alerts

Test Alert Rules

# Validate syntax
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml

# Test specific query
promtool query instant http://localhost:9090 \
  'up{job="ffmpeg-master"} == 0'

Test Alertmanager Config

# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml

# Test routing
amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --tree \
  severity=critical component=master

Send Test Alert

Manual test alert:

curl -X POST http://localhost:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "component": "test"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing the alerting pipeline"
    },
    "startsAt": "'$(date -Iseconds)'",
    "endsAt": "'$(date -Iseconds -d '+5 minutes')'"
  }
]'

Trigger real alert (temporary):

# Stop master to trigger alert
sudo systemctl stop ffmpeg-master

# Wait 2+ minutes for alert to fire
# Check Prometheus: http://localhost:9090/alerts
# Check Alertmanager: http://localhost:9093

# Restart master
sudo systemctl start ffmpeg-master

Verify Alert Delivery

Check Prometheus Alerts:
```
http://localhost:9090/alerts
```
Check Alertmanager:
```
http://localhost:9093/#/alerts
```
Check Notification Channels:
- Slack: Look for test message
- PagerDuty: Check incidents page
- Email: Check inbox

Incident Response

Alert Response Workflow

Receive Alert
- Critical: PagerDuty page
- Warning: Slack notification
- Info: Email/ticket
Acknowledge
- PagerDuty: Acknowledge incident
- Slack: React with 👀 emoji
- Update team on status
Investigate
- Click dashboard link in alert
- Check master/worker logs
- Review recent changes
Remediate
- Follow runbook for alert type
- Apply fix
- Monitor for resolution
Close
- Verify alert resolved in Alertmanager
- Update incident notes
- Schedule postmortem if needed

Runbooks

See docs/INCIDENT_PLAYBOOKS.md for detailed response procedures:

Master Down
All Workers Down
High Failure Rate
Queue Overload
Disk Space Issues
Performance Degradation

Maintenance and Tuning

Regular Maintenance

Weekly:

Review alert statistics
Check for noisy alerts (tune thresholds)
Verify notification delivery

Monthly:

Review and update alert thresholds
Update runbook documentation
Test disaster recovery procedures

Silence Alerts During Maintenance

Web UI:

http://localhost:9093/#/silences
Click "New Silence"
Set matchers: alertname=FFmpegWorkerNodeDown, instance=worker1
Duration: 2 hours
Comment: Scheduled maintenance

CLI:

amtool silence add \
  alertname=FFmpegWorkerNodeDown \
  instance=worker1 \
  --duration=2h \
  --comment="Scheduled maintenance"

API:

curl -XPOST http://localhost:9093/api/v1/silences -d '{
  "matchers": [
    {"name": "alertname", "value": "FFmpegWorkerNodeDown"},
    {"name": "instance", "value": "worker1"}
  ],
  "startsAt": "'$(date -Iseconds)'",
  "endsAt": "'$(date -Iseconds -d '+2 hours)'",
  "createdBy": "operator@company.com",
  "comment": "Scheduled maintenance"
}'

Alert Tuning

Reduce false positives:

Increase for duration
Add more specific label matchers
Adjust thresholds based on baseline

Reduce alert fatigue:

Use inhibition rules
Group related alerts
Increase repeat_interval

Example tuning:

# Before: Too sensitive
- alert: FFmpegHighFailureRate
  expr: rate(jobs_failed_total[5m]) > 0.1
  for: 5m

# After: More reasonable
- alert: FFmpegHighFailureRate
  expr: |
    (rate(jobs_failed_total[5m]) / 
     (rate(jobs_completed_total[5m]) + rate(jobs_failed_total[5m]))) > 0.1
  for: 10m  # Increased duration

Metrics for Alert Health

Alert firing rate:

rate(alertmanager_alerts_received_total[1h])

Alert resolution time:

histogram_quantile(0.95, 
  rate(alertmanager_notification_latency_seconds_bucket[1h]))

Failed notifications:

rate(alertmanager_notifications_failed_total[1h])

Troubleshooting

Alerts Not Firing

Check Prometheus targets:

http://localhost:9090/targets
Ensure ffmpeg-master and ffmpeg-worker are UP

Test alert expression:

http://localhost:9090/graph
Run the expr from alert rule
Should return data when condition met

Check alert state:

http://localhost:9090/alerts
State should be: inactive → pending → firing

Alerts Not Notifying

Check Alertmanager received alert:
```
http://localhost:9093/#/alerts
```
Check routing:
```
amtool config routes show
```

Check notification logs:

# Docker
docker-compose logs alertmanager | grep -i notification

# System
journalctl -u alertmanager -f

Test receiver manually:

# Test Slack webhook
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \
  -d '{"text":"Test message"}'

Common Issues

Issue: Slack notifications not arriving

Check webhook URL is correct
Verify Slack app has permissions
Check Alertmanager logs for errors

Issue: PagerDuty not creating incidents

Verify integration key is correct
Check PagerDuty service is active
Ensure escalation policy configured

Issue: Too many alerts

Review inhibition rules
Increase alert thresholds
Add grouping/aggregation

Additional Resources

Version: 1.0
Last Updated: 2026-01-05
Status: Production Ready

FilesExpand file tree

ALERTING.md

Latest commit

History