- Overview
- Deployment Guide
- Component Architecture
- Networking
- Job Queueing System
- Node Health Tracking
- Data Storage and Retrieval
- Runtime Model
The FFmpeg RTMP Power Monitoring system is a distributed architecture for running, monitoring, and analyzing video transcoding workloads. The system supports two operational modes:
- Standalone Mode: All components run on a single machine using Docker Compose
- Distributed Mode: A master node coordinates workloads across multiple compute nodes
This document describes the complete runtime model, data flows, and operational characteristics of the system.
All Deployments:
- Docker 20.10+ and Docker Compose 2.0+
- Linux host with kernel 4.15+ (for RAPL power monitoring)
- Python 3.10+ (for test scripts)
- FFmpeg with appropriate codec support
Distributed Mode Additional Requirements:
- Go 1.21+ (for building master/agent binaries)
- Network connectivity between master and compute nodes
- Shared or synchronized time across nodes (NTP recommended)
Standalone mode runs all services on a single machine for development or small-scale testing.
Step 1: Clone and Initialize
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
mkdir -p test_results
Step 2: Build and Start Services
# Start all services
make up-build
# Or with GPU support (requires NVIDIA Docker runtime)
make nvidia-up-build
Step 3: Verify Deployment
# Check all containers are running
docker compose ps
# Verify services are healthy
curl http://localhost:8428/health # VictoriaMetrics
curl http://localhost:3000 # Grafana
curl http://localhost:9093/-/healthy # Alertmanager
Step 4: Access Dashboards
- Grafana: http://localhost:3000 (admin/admin)
- VictoriaMetrics: http://localhost:8428
- Alertmanager: http://localhost:9093
Step 5: Run Test Workload
python3 scripts/run_tests.py single \
--name "deployment_test" \
--bitrate 2000k \
--duration 60
Distributed mode separates the master coordination node from compute worker nodes.
Master Node Deployment
Step 1: Build Master Binary
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
make build-master
Step 2: Start Master Service
# Basic startup
./bin/master --port 8080
# Production startup with logging
./bin/master --port 8080 > master.log 2>&1 &
# Verify master is running
curl http://localhost:8080/health
Step 3: Start Monitoring Stack
# Start VictoriaMetrics and Grafana on master
make vm-up-build
Compute Node Deployment
Step 1: Build Agent Binary
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
make build-agent
Step 2: Register and Start Agent
# Replace MASTER_IP with your master node's IP address
./bin/agent --register --master http://MASTER_IP:8080
# Agent will:
# 1. Detect hardware capabilities
# 2. Register with master
# 3. Begin polling for jobs
# 4. Send periodic heartbeats
Step 3: Verify Registration
# On master node, check registered nodes
curl http://MASTER_IP:8080/nodes | jq
Network Configuration
For production deployments, configure firewall rules:
# Master node - allow inbound
ufw allow 8080/tcp # Master API
ufw allow 3000/tcp # Grafana
ufw allow 8428/tcp # VictoriaMetrics
# Compute nodes - no inbound ports required (outbound only)
Resource Limits
Adjust Docker resource limits in docker-compose.yml:
services:
victoriametrics:
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
memory: 4G
Storage Configuration
Set custom data retention:
victoriametrics:
command:
- "--retentionPeriod=90d" # Default is 30d
TLS/Security Configuration
For production, enable HTTPS and authentication:
# Generate certificates (not implemented in v1.0)
# Configure nginx reverse proxy
# Enable Grafana authentication
# Set strong admin passwords
Master Node Components:
- Master Service (Go HTTP API) - Port 8080
- VictoriaMetrics (TSDB) - Port 8428
- Grafana (Dashboards) - Port 3000
- In-memory data structures (nodes, jobs, results)
Compute Node Components:
- Agent Service (Go binary)
- FFmpeg processes (when executing jobs)
- Local exporters (temporary, per job)
- Results aggregation and upload
Standalone Mode Components (all on one machine):
- All master components
- All exporters (12+ services)
- Nginx RTMP server
- Alertmanager
See the distributed architecture documentation for detailed component descriptions.
All communication is agent-initiated (pull model):
- Node Registration: Agent → Master (POST /nodes/register)
- Heartbeats: Agent → Master every 30s (POST /nodes/{id}/heartbeat)
- Job Polling: Agent → Master every 10s (GET /jobs/next?node_id=X)
- Result Submission: Agent → Master (POST /results)
Protocol: JSON over HTTP (HTTPS recommended for production)
Network Requirements:
- Master node: Publicly accessible on port 8080
- Compute nodes: Outbound connectivity only (no inbound ports)
- Latency: <100ms recommended
- Bandwidth: ~1-10 KB/s per node
VictoriaMetrics scrapes exporters every 5 seconds:
VictoriaMetrics ──[HTTP GET /metrics]──> Exporters
(Port 8428) (Ports 9500+)
Network: Docker bridge network streaming-net
Service Discovery: Docker DNS
Data Structure: Simple FIFO (First-In-First-Out) slice in memory
Location: Master node RAM
Persistence: None (jobs lost on restart in v1.0)
Created → Queued → Assigned → Running → Completed
- Created: User submits via POST /jobs
- Queued: Job added to FIFO queue on master
- Assigned: Agent polls and receives job
- Running: Agent executes FFmpeg workload
- Completed: Agent submits results to master
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"scenario": "1080p-h264-nvenc",
"confidence": "auto",
"parameters": {
"duration": 300,
"bitrate": "5000k",
"resolution": "1920x1080",
"encoder": "h264_nvenc",
"preset": "medium"
},
"status": "queued",
"created_at": "2025-12-30T10:00:00Z"
}
Current (v1.0): First-available agent gets next job in queue
No resource matching: Any agent can get any job
Future: Resource-aware scheduling, priority queues, retry logic
Agent sends heartbeat every 30 seconds via POST /nodes/{id}/heartbeat
Master updates node.LastSeen timestamp
Node status:
- available: Last heartbeat within 60 seconds
- offline: No heartbeat for >60 seconds
exporter-health-checker service polls all exporters every 30 seconds
Exposes metric: exporter_health_status{service="X"} = 1 (up) or 0 (down)
Monitored by: VictoriaMetrics, Grafana dashboards, Alertmanager
All containers have health check directives:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 10s
timeout: 5s
retries: 3
Query health: docker compose ps
Location: Docker volume victoriametrics-data
Retention: 30 days (configurable)
Size: ~50-100 MB per day
Query API:
curl 'http://localhost:8428/api/v1/query?query=rapl_power_watts'
Location: ./test_results/test_results_YYYYMMDD_HHMMSS.json
Format: JSON with test metadata and scenario configurations
Used by: results-exporter, analyze_results.py
Location: ./models/<hardware_id>/
Files:
power_predictor_latest.pklmultivariate_predictor_latest.pklmetadata.json
Contents:
- Node registry:
map[string]*Node - Job queue:
[]Job - Results:
map[string]*Result
Persistence: None (lost on restart in v1.0)
docker-compose.yml- Service definitionsvictoriametrics.yml- Metrics scrapingnginx.conf- RTMP serverpricing_config.json- Cost analysisgrafana/provisioning/- Dashboards
Metrics: Query VictoriaMetrics HTTP API with PromQL
Test Results: Read JSON files or query results-exporter metrics
Job Status: Query master REST API (GET /jobs, GET /nodes)
Grafana: Browse dashboards at http://localhost:3000
1. make up-build
2. Docker builds images
3. Creates network: streaming-net
4. Creates volumes: victoriametrics-data, grafana-data
5. Starts containers (30-60 seconds)
6. Health checks begin
7. VictoriaMetrics starts scraping
8. Grafana provisions dashboards
9. System ready
Master:
1. make build-master
2. ./bin/master --port 8080
3. Master initializes (<5 seconds)
4. make vm-up-build (monitoring stack)
5. System ready
Agent:
1. make build-agent
2. ./bin/agent --register --master http://MASTER_IP:8080
3. Agent detects hardware
4. Agent registers with master
5. Agent starts heartbeat and polling loops
6. Agent ready (<5 seconds)
1. User: python3 scripts/run_tests.py single --bitrate 2000k --duration 60
2. Test runner generates run_id
3. Stabilization period (10s)
4. FFmpeg starts streaming to nginx-rtmp
5. Exporters collect metrics
6. VictoriaMetrics scrapes metrics every 5s
7. Test runs for specified duration
8. FFmpeg terminates
9. Cooldown period (10s)
10. Test runner saves metadata to test_results/*.json
11. Results exporter exposes metrics
Total time: ~80 seconds for 60s test
1. User submits job: POST /jobs to master
2. Job queued on master
3. Agent polls: GET /jobs/next?node_id=X
4. Master returns job to agent
5. Agent starts local exporters
6. Agent executes FFmpeg workload
7. Agent collects metrics locally
8. Job completes
9. Agent runs analyzer
10. Agent sends results: POST /results to master
11. Master stores results
12. Agent resumes polling
Total time: ~5-10 minutes for 300s workload
Standalone Mode (idle):
- Memory: ~1-2 GB
- CPU: ~5-10%
- Disk: ~3-6 GB + 50-110 MB/day
Distributed Master:
- Memory: ~500-1000 MB
- CPU: <1%
- Disk: ~3-6 GB + 50-110 MB/day
Distributed Agent (idle):
- Memory: ~10-30 MB
- CPU: <1%
- Disk: ~10-30 MB
During Active Test:
- FFmpeg: 50-400% CPU, 100-500 MB RAM
- Exporters: 2-5% CPU, 120-600 MB RAM
Deployment:
docker-compose.yml- Service definitionsMakefile- Common commandsvictoriametrics.yml- Metrics scraping config
Master/Agent:
cmd/master/main.go- Master service entrypointcmd/agent/main.go- Agent service entrypointpkg/models/- Data models (Job, Node, Result)pkg/api/master.go- Master REST API handlerspkg/agent/client.go- Agent HTTP clientpkg/store/memory.go- In-memory storage implementation
Test Execution:
scripts/run_tests.py- Main test runnerscripts/analyze_results.py- Results analyzerscripts/recommend_test.py- Hardware-aware recommendations
Configuration:
nginx.conf- Nginx RTMP server configalertmanager/alertmanager.yml- Alert routingpricing_config.json- Cost analysis pricinggrafana/provisioning/- Grafana datasources and dashboards
Data Storage:
test_results/- Test metadata JSON filesmodels/- Trained ML models- Docker volumes:
victoriametrics-data,grafana-data
- RAPL: Running Average Power Limit - Intel CPU power monitoring interface
- TSDB: Time-Series Database - Optimized for timestamped metrics
- FIFO: First-In-First-Out - Queue ordering strategy
- Exporter: Service that exposes metrics in Prometheus format
- Scraping: Periodic pulling of metrics from exporters
- PromQL: Prometheus Query Language - Used to query metrics
- mTLS: Mutual TLS - Both client and server authenticate each other
- NVENC: NVIDIA hardware video encoder
- DCGM: Data Center GPU Manager - NVIDIA GPU monitoring toolkit
- cAdvisor: Container Advisor - Google's container metrics tool
Version: 1.0
Last Updated: 2025-12-30
System Version: v2.1+
Maintainer: See CONTRIBUTING.md