Complete guide for setting up and managing alerts in production FFmpeg RTMP deployments.
- Overview
- Alert Categories
- Setup Instructions
- Alert Configuration
- Notification Channels
- Testing Alerts
- Incident Response
- Maintenance and Tuning
The FFmpeg RTMP system includes comprehensive alerting rules for:
- Critical incidents requiring immediate action (page on-call)
- Warnings that need investigation (notify team)
- Performance degradation for proactive monitoring (info/ticket)
| Severity | Response | Notification | Example |
|---|---|---|---|
| critical | Immediate (page) | PagerDuty + Slack | Master down, all workers offline |
| warning | Within 1 hour | Slack + Email | Single worker down, high queue |
| info | Next business day | Slack (monitoring) | Performance degradation, low utilization |
FFmpegMasterNodeDown
- Trigger: Master unreachable for 2+ minutes
- Impact: Complete system outage
- Action: Check service, logs, restart if needed
FFmpegAllWorkersDown
- Trigger: Zero workers available for 5+ minutes
- Impact: No processing capacity
- Action: Check worker services, network connectivity
FFmpegCriticalFailureRate
- Trigger: >50% jobs failing for 10+ minutes
- Impact: System critically degraded
- Action: Check worker resources, logs, input files
FFmpegQueueCritical
- Trigger: Queue > 2000 jobs for 15+ minutes
- Impact: Severe delays, SLA violations
- Action: Add emergency capacity, check workers
FFmpegMasterDiskCritical
- Trigger: Master disk < 5% free for 5+ minutes
- Impact: Database writes failing
- Action: Clean logs, remove backups, expand disk
FFmpegWorkerNodeDown
- Trigger: Individual worker down 5+ minutes
- Impact: Reduced capacity
- Action: Check service on worker node
FFmpegHighFailureRate
- Trigger: >10% jobs failing for 10+ minutes
- Impact: Elevated error rate
- Action: Investigate failed jobs, check resources
FFmpegQueueWarning
- Trigger: Queue > 500 jobs for 15+ minutes
- Impact: Increasing delays
- Action: Monitor growth, plan capacity
FFmpegWorkerCapacityHigh
- Trigger: Worker > 85% capacity for 30+ minutes
- Impact: Limited headroom
- Action: Consider adding capacity
FFmpegHighJobLatency
- Trigger: P95 queue wait > 5 minutes for 15+ minutes
- Impact: Slow job processing
- Action: Add workers or increase concurrency
FFmpegWorkerDiskWarning
- Trigger: Worker disk < 20% free for 10+ minutes
- Impact: May reject jobs soon
- Action: Clean temp files
FFmpegMasterDiskWarning
- Trigger: Master disk < 15% free for 10+ minutes
- Impact: Approaching critical
- Action: Clean logs, plan expansion
FFmpegSlowJobExecution
- Trigger: Jobs 50% slower than baseline
- Impact: Degraded performance
- Action: Investigate CPU, disk, network
FFmpegLowWorkerUtilization
- Trigger: Worker < 20% utilized for 2+ hours
- Impact: Inefficient resource use
- Action: Consider cost optimization
- Prometheus installed and configured
- Alertmanager installed
- Notification channels configured (Slack, PagerDuty, email)
Option A: Docker Compose
# Copy alert rules
cp docs/prometheus/ffmpeg-rtmp-alerts.yml deployment/prometheus/rules/
# Mount in docker-compose.yml
volumes:
- ./deployment/prometheus/rules:/etc/prometheus/rules:ro
# Update prometheus.yml
rule_files:
- '/etc/prometheus/rules/*.yml'
# Restart Prometheus
docker-compose restart prometheusOption B: System Installation
# Copy rules to Prometheus directory
sudo cp docs/prometheus/ffmpeg-rtmp-alerts.yml /etc/prometheus/rules/
# Fix permissions
sudo chown prometheus:prometheus /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Validate rules
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Update /etc/prometheus/prometheus.yml
rule_files:
- '/etc/prometheus/rules/*.yml'
# Reload Prometheus
sudo systemctl reload prometheus
# OR
curl -X POST http://localhost:9090/-/reloadDocker Compose:
# Copy config
cp docs/prometheus/alertmanager.yml deployment/prometheus/
# Update docker-compose.yml
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./deployment/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
# Start Alertmanager
docker-compose up -d alertmanagerSystem Installation:
# Install Alertmanager
sudo apt-get install prometheus-alertmanager
# Copy config
sudo cp docs/prometheus/alertmanager.yml /etc/alertmanager/
# Set environment variables
echo 'SMTP_PASSWORD=your-password' | sudo tee -a /etc/default/alertmanager
echo 'PAGERDUTY_SERVICE_KEY=your-key' | sudo tee -a /etc/default/alertmanager
# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml
# Start service
sudo systemctl enable alertmanager
sudo systemctl start alertmanagerAdd to prometheus.yml:
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093' # or alertmanager:9093 for DockerReload Prometheus:
# Docker
docker-compose restart prometheus
# System
sudo systemctl reload prometheusEdit docs/prometheus/ffmpeg-rtmp-alerts.yml:
# Example: Change queue critical threshold
- alert: FFmpegQueueCritical
expr: jobs_queued_total{job="ffmpeg-master"} > 3000 # Changed from 2000
for: 10m # Changed from 15m- name: custom_alerts
interval: 1m
rules:
- alert: CustomAlert
expr: your_metric > threshold
for: duration
labels:
severity: warning
component: custom
annotations:
summary: "Alert summary"
description: "Detailed description"Always include these labels:
severity: critical, warning, or infocomponent: master, worker, queue, database, etc.team: Team responsible for response
-
Create Slack App:
https://api.slack.com/apps → Create New App -
Enable Incoming Webhooks:
Features → Incoming Webhooks → Activate -
Add to Workspace:
Add New Webhook to Workspace Select channel: #ffmpeg-alerts-critical Copy webhook URL -
Configure Alertmanager:
# Save webhook URL echo "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" | \ sudo tee /etc/alertmanager/slack_webhook_url # Or set in alertmanager.yml global: slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
Channels to Create:
#ffmpeg-alerts-critical- Critical alerts (page team)#ffmpeg-alerts- Warning alerts#ffmpeg-ops- Operations team alerts#ffmpeg-monitoring- Performance/info alerts
-
Create Service:
PagerDuty → Services → New Service Name: FFmpeg RTMP Production -
Add Integration:
Integrations → Add Integration Type: Prometheus Copy Integration Key -
Configure Alertmanager:
export PAGERDUTY_SERVICE_KEY="your-integration-key" # Add to /etc/default/alertmanager or docker-compose.yml
-
Set Escalation Policy:
Escalation Policies → Create Policy Level 1: On-call engineer (immediate) Level 2: Team lead (after 15 min) Level 3: Manager (after 30 min)
Gmail Example:
global:
smtp_from: 'alerts@company.com'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'app-specific-password' # Generate in Gmail settings
smtp_require_tls: true
receivers:
- name: 'email-ops'
email_configs:
- to: 'ops-team@company.com'
headers:
Subject: '[FFmpeg RTMP] {{ .GroupLabels.alertname }}'SendGrid Example:
global:
smtp_smarthost: 'smtp.sendgrid.net:587'
smtp_auth_username: 'apikey'
smtp_auth_password: '${SENDGRID_API_KEY}'# Validate syntax
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Test specific query
promtool query instant http://localhost:9090 \
'up{job="ffmpeg-master"} == 0'# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml
# Test routing
amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
--tree \
severity=critical component=masterManual test alert:
curl -X POST http://localhost:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"component": "test"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing the alerting pipeline"
},
"startsAt": "'$(date -Iseconds)'",
"endsAt": "'$(date -Iseconds -d '+5 minutes')'"
}
]'Trigger real alert (temporary):
# Stop master to trigger alert
sudo systemctl stop ffmpeg-master
# Wait 2+ minutes for alert to fire
# Check Prometheus: http://localhost:9090/alerts
# Check Alertmanager: http://localhost:9093
# Restart master
sudo systemctl start ffmpeg-master-
Check Prometheus Alerts:
http://localhost:9090/alerts -
Check Alertmanager:
http://localhost:9093/#/alerts -
Check Notification Channels:
- Slack: Look for test message
- PagerDuty: Check incidents page
- Email: Check inbox
-
Receive Alert
- Critical: PagerDuty page
- Warning: Slack notification
- Info: Email/ticket
-
Acknowledge
- PagerDuty: Acknowledge incident
- Slack: React with 👀 emoji
- Update team on status
-
Investigate
- Click dashboard link in alert
- Check master/worker logs
- Review recent changes
-
Remediate
- Follow runbook for alert type
- Apply fix
- Monitor for resolution
-
Close
- Verify alert resolved in Alertmanager
- Update incident notes
- Schedule postmortem if needed
See docs/INCIDENT_PLAYBOOKS.md for detailed response procedures:
- Master Down
- All Workers Down
- High Failure Rate
- Queue Overload
- Disk Space Issues
- Performance Degradation
Weekly:
- Review alert statistics
- Check for noisy alerts (tune thresholds)
- Verify notification delivery
Monthly:
- Review and update alert thresholds
- Update runbook documentation
- Test disaster recovery procedures
Web UI:
http://localhost:9093/#/silences
Click "New Silence"
Set matchers: alertname=FFmpegWorkerNodeDown, instance=worker1
Duration: 2 hours
Comment: Scheduled maintenance
CLI:
amtool silence add \
alertname=FFmpegWorkerNodeDown \
instance=worker1 \
--duration=2h \
--comment="Scheduled maintenance"API:
curl -XPOST http://localhost:9093/api/v1/silences -d '{
"matchers": [
{"name": "alertname", "value": "FFmpegWorkerNodeDown"},
{"name": "instance", "value": "worker1"}
],
"startsAt": "'$(date -Iseconds)'",
"endsAt": "'$(date -Iseconds -d '+2 hours)'",
"createdBy": "operator@company.com",
"comment": "Scheduled maintenance"
}'Reduce false positives:
- Increase
forduration - Add more specific label matchers
- Adjust thresholds based on baseline
Reduce alert fatigue:
- Use inhibition rules
- Group related alerts
- Increase repeat_interval
Example tuning:
# Before: Too sensitive
- alert: FFmpegHighFailureRate
expr: rate(jobs_failed_total[5m]) > 0.1
for: 5m
# After: More reasonable
- alert: FFmpegHighFailureRate
expr: |
(rate(jobs_failed_total[5m]) /
(rate(jobs_completed_total[5m]) + rate(jobs_failed_total[5m]))) > 0.1
for: 10m # Increased durationAlert firing rate:
rate(alertmanager_alerts_received_total[1h])
Alert resolution time:
histogram_quantile(0.95,
rate(alertmanager_notification_latency_seconds_bucket[1h]))
Failed notifications:
rate(alertmanager_notifications_failed_total[1h])
-
Check Prometheus targets:
http://localhost:9090/targets Ensure ffmpeg-master and ffmpeg-worker are UP -
Test alert expression:
http://localhost:9090/graph Run the expr from alert rule Should return data when condition met -
Check alert state:
http://localhost:9090/alerts State should be: inactive → pending → firing
-
Check Alertmanager received alert:
http://localhost:9093/#/alerts -
Check routing:
amtool config routes show
-
Check notification logs:
# Docker docker-compose logs alertmanager | grep -i notification # System journalctl -u alertmanager -f
-
Test receiver manually:
# Test Slack webhook curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \ -d '{"text":"Test message"}'
Issue: Slack notifications not arriving
- Check webhook URL is correct
- Verify Slack app has permissions
- Check Alertmanager logs for errors
Issue: PagerDuty not creating incidents
- Verify integration key is correct
- Check PagerDuty service is active
- Ensure escalation policy configured
Issue: Too many alerts
- Review inhibition rules
- Increase alert thresholds
- Add grouping/aggregation
- Prometheus Alerting Documentation
- Alertmanager Configuration
- PromQL Guide
- Incident Playbooks
- Production Operations
Version: 1.0
Last Updated: 2026-01-05
Status: Production Ready