This document provides a comprehensive overview of how each exporter in the system receives its data, the complete setup process, data flow architecture, and troubleshooting guidance.
The system consists of multiple exporters that collect metrics from different sources and expose them in Prometheus format. Prometheus scrapes these exporters every 5 seconds (configurable in prometheus.yml).
-
System Metrics Exporters: Collect OS and hardware metrics
- rapl-exporter (Power consumption)
- node-exporter (System metrics)
- cadvisor (Container metrics)
-
Application Metrics Exporters: Monitor running services
- nginx-rtmp-exporter (RTMP streaming)
- docker-stats-exporter (Docker overhead)
-
Analysis Exporters: Process test results and calculate derived metrics
- results-exporter (Test results analysis)
- qoe-exporter (Quality of Experience)
- cost-exporter (Cost analysis)
-
Health Monitoring:
- exporter-health-checker (Monitors all exporters)
Port: 9500
Data Source: Linux kernel RAPL interface (/sys/class/powercap)
- Direct Kernel Access: Reads from
/sys/class/powercap/intel-rapl:* - Hardware Counters: Intel CPUs expose power consumption through RAPL registers
- Privileged Access: Requires privileged container mode or root access
Intel CPU RAPL Registers
↓
/sys/class/powercap/intel-rapl:*/energy_uj
↓
rapl_exporter.py reads files
↓
Calculates power (watts) from energy delta
↓
Exposes metrics at :9500/metrics
↓
Prometheus scrapes every 5s
rapl_power_watts{package="package-0", zone="package-0"} 45.5
rapl_power_watts{package="package-0", zone="core"} 30.2
rapl_energy_joules_total{package="package-0", zone="package-0"} 1234567.89
- Intel CPU with RAPL support (most Intel CPUs since Sandy Bridge)
- Privileged container with
/sys/class/powercapmounted read-only - Host kernel must expose RAPL counters
# docker-compose.yml excerpt
rapl-exporter:
privileged: true
volumes:
- /sys/class/powercap:/sys/class/powercap:ro
- /sys/devices:/sys/devices:ro| Problem | Solution |
|---|---|
| No metrics | Check if RAPL is available: ls /sys/class/powercap/intel-rapl:* |
| Permission denied | Run container as privileged or with CAP_SYS_RAWIO |
| Zero values | Some systems disable RAPL. Check BIOS settings |
Port: 9501
Data Source: Docker daemon API via /var/run/docker.sock
- Docker Socket: Connects to Docker daemon through Unix socket
- Container Stats: Uses
docker statscommand via subprocess - Process Stats: Reads Docker engine process stats from
/proc
Docker Daemon (dockerd)
↓
/var/run/docker.sock (Unix socket)
↓
docker_stats_exporter.py executes:
- docker stats --no-stream
- ps aux | grep dockerd
↓
Parses output and converts to metrics
↓
Exposes metrics at :9501/metrics
↓
Prometheus scrapes every 5s
docker_engine_cpu_percent 2.5
docker_engine_memory_percent 1.2
docker_container_cpu_percent{name="nginx-rtmp"} 15.3
docker_container_memory_percent{name="nginx-rtmp"} 2.1
- Docker socket mounted into container
- Host /proc filesystem mounted for process stats
# docker-compose.yml excerpt
docker-stats-exporter:
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc:/host/proc:ro| Problem | Solution |
|---|---|
| Cannot connect to Docker | Check socket mount: ls -la /var/run/docker.sock |
| Permission denied | Add user to docker group or run as root |
| No container stats | Ensure Docker containers are running |
Port: 9100
Data Source: Host system metrics from /proc, /sys
- Filesystem Access: Reads from
/proc,/sys, and root filesystem - System Interfaces: Accesses kernel-exposed metrics
- Standard Linux Monitoring: Uses standard OS interfaces
Host System Kernel
↓
/proc/* (CPU, memory, network)
/sys/* (hardware info)
↓
node_exporter reads pseudo-files
↓
Converts to Prometheus metrics
↓
Exposes metrics at :9100/metrics
↓
Prometheus scrapes every 5s
node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89
node_memory_MemAvailable_bytes 8589934592
node_network_receive_bytes_total{device="eth0"} 123456789
node_disk_io_time_seconds_total{device="sda"} 456.78
- Host filesystem mounted into container
- Proper path remapping via command-line flags
# docker-compose.yml excerpt
node-exporter:
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"| Problem | Solution |
|---|---|
| Missing metrics | Check volume mounts are correct |
| No CPU metrics | Verify /proc is mounted at /host/proc |
| No network metrics | Check /sys/class/net is accessible |
Port: 8080 Data Source: Docker daemon + cgroup filesystem
- cgroups: Reads container resource usage from
/sys/fs/cgroup - Docker API: Gets container metadata from Docker
- Filesystem Metrics: Monitors container filesystem usage
Linux cgroups (/sys/fs/cgroup/*)
+
Docker daemon (container metadata)
+
Container filesystem (/var/lib/docker)
↓
cadvisor aggregates data
↓
Exposes metrics at :8080/metrics
↓
Prometheus scrapes every 5s
container_cpu_usage_seconds_total{name="nginx-rtmp"} 123.45
container_memory_usage_bytes{name="nginx-rtmp"} 134217728
container_network_receive_bytes_total{name="nginx-rtmp"} 1234567
container_fs_usage_bytes{name="nginx-rtmp"} 52428800
- Privileged mode for full container visibility
- Multiple volume mounts for complete access
# docker-compose.yml excerpt
cadvisor:
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro| Problem | Solution |
|---|---|
| Missing containers | Check /var/lib/docker mount |
| No metrics | Verify privileged mode is enabled |
| High CPU usage | Normal for cAdvisor; limit with --housekeeping_interval |
Port: 9728 Data Source: Nginx RTMP stat endpoint
- HTTP Stat Endpoint: Queries
http://nginx-rtmp/stat(XML format) - Nginx Built-in Stats: Nginx RTMP module exposes internal state
- Real-time Monitoring: Gets current connection and stream info
Nginx RTMP Server (streaming activity)
↓
/stat endpoint (HTTP, XML format)
↓
nginx-rtmp-exporter queries endpoint
↓
Parses XML and converts to metrics
↓
Exposes metrics at :9728/metrics
↓
Prometheus scrapes every 5s
nginx_rtmp_connections 5
nginx_rtmp_streams 2
nginx_rtmp_bandwidth_in_bytes 1500000
nginx_rtmp_bandwidth_out_bytes 3000000
- Nginx RTMP must be running
- Stat endpoint enabled in nginx.conf
- Network connectivity to nginx-rtmp container
# nginx.conf excerpt
rtmp {
server {
listen 1935;
application live {
live on;
}
}
}
http {
server {
listen 80;
location /stat {
rtmp_stat all;
rtmp_stat_stylesheet stat.xsl;
}
}
}| Problem | Solution |
|---|---|
| Cannot reach /stat | Check nginx-rtmp is running and /stat is enabled |
| No streams | Start FFmpeg stream: ffmpeg -re -i video.mp4 -c copy -f flv rtmp://localhost/live/stream |
| Connection refused | Verify network connectivity between containers |
Port: 9502 Data Source: Test result JSON files + Prometheus historical data
- JSON Files: Reads from
/results/test_results_*.json(mounted volume) - Prometheus Queries: Fetches historical metrics for each test scenario
- Time-windowed Queries: Uses start_time and end_time from test results
run_tests.py executes FFmpeg tests
↓
Writes test_results/test_results_TIMESTAMP.json
↓
results_exporter reads latest JSON
↓
For each scenario:
- Queries Prometheus for metrics in [start_time, end_time]
- Aggregates: avg power, max CPU, etc.
↓
Exposes scenario metrics at :9502/metrics
↓
Prometheus scrapes every 5s (but exporter caches for 60s)
scenario_duration_seconds{scenario="1_stream_1080p"} 60
scenario_power_mean_watts{scenario="1_stream_1080p"} 48.5
scenario_power_max_watts{scenario="1_stream_1080p"} 52.3
scenario_cpu_mean_percent{scenario="1_stream_1080p"} 25.5
scenario_baseline_diff_power_watts{scenario="1_stream_1080p"} 8.2
- Test results directory mounted
- Prometheus URL configured
- Valid test results with timestamps
# docker-compose.yml excerpt
results-exporter:
volumes:
- ./test_results:/results
environment:
- RESULTS_DIR=/results
- PROMETHEUS_URL=http://victoriametrics:8428| Problem | Solution |
|---|---|
| No scenarios found | Run tests: python3 scripts/run_tests.py |
| No test results | Check mount: docker exec results-exporter ls /results |
| Stale metrics | Results exporter caches for 60s; wait or restart |
| Prometheus errors | Verify PROMETHEUS_URL is correct and Prometheus is accessible |
Port: 9503 Data Source: Test result JSON files
- JSON Files: Reads from
/results/test_results_*.json - Quality Metrics: Extracts VMAF, PSNR, SSIM from test results
- Calculation: Computes QoE scores using advisor module
run_tests.py executes tests with quality analysis
↓
Calculates VMAF/PSNR/SSIM during tests
↓
Writes results to test_results/test_results_TIMESTAMP.json
↓
qoe_exporter reads latest JSON
↓
Extracts quality metrics and calculates QoE scores
↓
Exposes metrics at :9503/metrics
↓
Prometheus scrapes every 5s (cached for 60s)
qoe_score{scenario="1_stream_1080p"} 4.2
quality_vmaf{scenario="1_stream_1080p"} 95.5
quality_psnr{scenario="1_stream_1080p"} 42.3
quality_ssim{scenario="1_stream_1080p"} 0.98
- Test results with quality metrics
- Advisor module available in Python path
# docker-compose.yml excerpt
qoe-exporter:
volumes:
- ./test_results:/results:ro
- ./advisor:/app/advisor:ro
environment:
- RESULTS_DIR=/results| Problem | Solution |
|---|---|
| No quality metrics | Ensure tests are run with --analyze-quality flag |
| Import errors | Check advisor module is mounted correctly |
| Zero values | Quality analysis may have failed during tests |
Port: 9504 Data Source: Test result JSON files + Prometheus historical data
- JSON Files: Reads from
/results/test_results_*.json - Prometheus Queries: Fetches CPU usage and power data for load-aware calculations
- Time-series Integration: Sums CPU-seconds and energy (joules) over test duration
run_tests.py executes tests
↓
Writes test_results/test_results_TIMESTAMP.json
(includes start_time, end_time, duration)
↓
cost_exporter reads latest JSON
↓
For each scenario:
- Queries Prometheus for:
* rate(container_cpu_usage_seconds_total[30s])
* sum(rapl_power_watts)
- Integrates over time window
- Calculates costs based on pricing config
↓
Exposes cost metrics at :9504/metrics
↓
Prometheus scrapes every 5s (cached for 60s)
cost_exporter_alive 1
cost_total_load_aware{scenario="1_stream_1080p",currency="USD"} 0.00234
cost_energy_load_aware{scenario="1_stream_1080p",currency="USD"} 0.00012
cost_compute_load_aware{scenario="1_stream_1080p",currency="USD"} 0.00222
- Test results with timestamps
- Prometheus URL for load-aware mode
- Pricing configuration via environment variables
# docker-compose.yml excerpt
cost-exporter:
volumes:
- ./test_results:/results:ro
- ./advisor:/app/advisor:ro
environment:
- RESULTS_DIR=/results
- PROMETHEUS_URL=http://victoriametrics:8428
- ENERGY_COST_PER_KWH=0.12
- CPU_COST_PER_HOUR=0.50
- CURRENCY=USD| Problem | Solution |
|---|---|
| Zero cost values | Check if Prometheus data is available for test time windows |
| No load-aware data | Verify PROMETHEUS_URL is set and accessible |
| Missing metrics | Enable debug logging: docker logs cost-exporter --tail 100 |
| Stale prices | Restart container after changing environment variables |
Port: 9400 Profile: nvidia (optional) Data Source: NVIDIA GPU via DCGM library
- NVIDIA DCGM: Connects to GPU through Data Center GPU Manager
- GPU Telemetry: Reads power, utilization, temperature, memory
- CUDA Runtime: Requires nvidia-container-runtime
NVIDIA GPU Hardware
↓
NVIDIA Driver (host)
↓
nvidia-container-runtime
↓
DCGM library in container
↓
dcgm-exporter queries GPU
↓
Exposes metrics at :9400/metrics
↓
Prometheus scrapes every 5s
DCGM_FI_DEV_GPU_UTIL 75
DCGM_FI_DEV_POWER_USAGE 180.5
DCGM_FI_DEV_GPU_TEMP 65
DCGM_FI_DEV_FB_USED 4096
- NVIDIA GPU installed
- NVIDIA drivers on host
- nvidia-container-toolkit installed
- Docker Compose nvidia profile
# Install nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Start with NVIDIA profile
docker compose --profile nvidia up -d# docker-compose.yml excerpt
dcgm-exporter:
profiles:
- nvidia
runtime: nvidia
cap_add:
- SYS_ADMIN| Problem | Solution |
|---|---|
| Container won't start | Check nvidia-docker: docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi |
| No GPU metrics | Verify GPU is visible: nvidia-smi |
| Permission denied | Add SYS_ADMIN capability |
| Wrong runtime | Ensure docker-compose uses runtime: nvidia |
Port: 9600 Data Source: All other exporters' /metrics endpoints
- HTTP Requests: Periodically fetches /metrics from each exporter
- Metric Validation: Parses response and validates expected metrics
- Health Assessment: Checks reachability, metrics presence, and data availability
All exporters expose :PORT/metrics
↓
exporter-health-checker periodically fetches:
- nginx-rtmp-exporter:9728/metrics
- rapl-exporter:9500/metrics
- docker-stats-exporter:9501/metrics
- (etc...)
↓
For each exporter:
- Validates HTTP 200 response
- Parses Prometheus metrics format
- Checks expected metrics are present
- Verifies data exists (not just definitions)
↓
Exposes health metrics at :9600/metrics
↓
Prometheus scrapes every 5s
exporter_health_status{exporter="rapl-exporter"} 1
exporter_reachable{exporter="rapl-exporter"} 1
exporter_metric_count{exporter="rapl-exporter"} 12
exporter_sample_count{exporter="rapl-exporter"} 48
exporter_has_data{exporter="rapl-exporter"} 1
- Network access to all exporters
- Python 3.11+ runtime
# docker-compose.yml excerpt
exporter-health-checker:
networks:
- streaming-net
command: ["--port", "9600"]| Problem | Solution |
|---|---|
| Cannot reach exporters | Check all exporters are running: docker ps |
| Wrong URLs | Verify exporter names match docker-compose service names |
| Timeout errors | Increase timeout: --timeout 30 |
┌─────────────────────────────────────────────────────────────────────┐
│ Data Sources │
├─────────────────────────────────────────────────────────────────────┤
│ • Intel RAPL (/sys/class/powercap) │
│ • Docker Daemon (/var/run/docker.sock) │
│ • Linux Kernel (/proc, /sys) │
│ • cgroups (/sys/fs/cgroup) │
│ • Nginx RTMP (HTTP stat endpoint) │
│ • Test Results (JSON files) │
│ • NVIDIA GPU (DCGM API) │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Exporters (Collectors) │
├─────────────────────────────────────────────────────────────────────┤
│ rapl-exporter:9500 │ Reads RAPL counters │
│ docker-stats:9501 │ Queries Docker API │
│ node-exporter:9100 │ Reads /proc, /sys │
│ cadvisor:8080 │ Reads cgroups │
│ nginx-exporter:9728 │ Queries Nginx /stat │
│ results-exporter:9502 │ Reads JSON + queries Prometheus │
│ qoe-exporter:9503 │ Reads JSON, calculates QoE │
│ cost-exporter:9504 │ Reads JSON + queries Prometheus │
│ dcgm-exporter:9400 │ Queries NVIDIA GPU │
│ health-checker:9600 │ Queries all exporters │
└─────────────────────────────────────────────────────────────────────┘
↓
All expose Prometheus metrics at /metrics
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Prometheus (Time-series Database) │
├─────────────────────────────────────────────────────────────────────┤
│ • Scrapes all exporters every 5 seconds │
│ • Stores metrics with 7-day retention │
│ • Evaluates alert rules every 5 seconds │
│ • Provides PromQL query interface │
└─────────────────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Grafana │ │ Alertmanager │ │ Analysis │
│ (Viz) │ │ (Alerts) │ │ Exporters │
│ :3000 │ │ :9093 │ │ (Re-query) │
└─────────────┘ └──────────────────┘ └──────────────────┘
↓
results/qoe/cost exporters
query Prometheus for
historical test data
1. User runs: python3 scripts/run_tests.py
2. Test runner:
├─ Records start_time = current timestamp
├─ Starts FFmpeg streaming to Nginx RTMP
├─ Waits for test duration
├─ Stops FFmpeg
├─ Records end_time = current timestamp
└─ Writes test_results_TIMESTAMP.json
3. All exporters collect metrics during test:
├─ rapl-exporter → power consumption
├─ cadvisor → container CPU/memory
├─ node-exporter → system metrics
└─ nginx-exporter → streaming stats
4. Prometheus stores all metrics with timestamps
5. Analysis exporters process results:
├─ results-exporter queries Prometheus for [start_time, end_time]
├─ Aggregates power, CPU, etc. for the test window
├─ qoe-exporter reads quality metrics from JSON
└─ cost-exporter queries Prometheus + calculates costs
6. Grafana visualizes:
├─ Real-time metrics from Prometheus
└─ Aggregated scenario metrics from analysis exporters
-
Docker & Docker Compose
# Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker $USER # Install Docker Compose sudo apt-get install docker-compose-plugin
-
Python 3 (for test runner)
sudo apt-get install python3 python3-pip pip3 install -r requirements.txt
-
FFmpeg (for streaming tests)
sudo apt-get install ffmpeg
git clone <repository-url>
cd ffmpeg-rtmp
docker compose up -d --build# Check all containers are running
docker ps
# Test each exporter
curl http://localhost:9500/metrics | head # RAPL
curl http://localhost:9501/metrics | head # Docker Stats
curl http://localhost:9100/metrics | head # Node Exporter
curl http://localhost:8080/metrics | head # cAdvisor
curl http://localhost:9728/metrics | head # Nginx RTMP
curl http://localhost:9502/metrics | head # Results
curl http://localhost:9503/metrics | head # QoE
curl http://localhost:9504/metrics | head # Cost
curl http://localhost:9600/metrics | head # Health Check# Open Prometheus UI
open http://localhost:8428/targets
# Or check via CLI
curl -s http://localhost:8428/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'All targets should show health: "up".
# Run a quick single-stream test
python3 scripts/run_tests.py --name quick --streams 1 --duration 60
# Check results were created
ls -lh test_results/
# Verify results exporter picked it up
curl http://localhost:9502/metrics | grep scenario_open http://localhost:3000
# Login: admin / adminNavigate to pre-provisioned dashboards:
- Power Monitoring Dashboard
- Cost Dashboard
- QoE Dashboard
-
Check Container Status
docker ps -a docker logs <container-name>
-
Check Exporter Health
# Run health checker python3 check_exporters_health.py # Or check via container docker exec exporter-health-checker python3 /app/check_exporters_health.py
-
Check Prometheus Scrape Errors
# View Prometheus logs docker logs prometheus # Check targets page curl http://localhost:8428/api/v1/targets
-
Enable Debug Logging
# For cost-exporter docker logs cost-exporter --follow # Restart with debug logging docker compose stop cost-exporter docker compose up cost-exporter
Symptoms: rapl_power_watts not in Prometheus
Solutions:
-
Check RAPL availability:
ls -la /sys/class/powercap/intel-rapl:* -
Verify container has access:
docker exec rapl-exporter ls /sys/class/powercap -
Check for Intel CPU:
cat /proc/cpuinfo | grep "model name"
-
Some systems disable RAPL - check BIOS settings
Symptoms: cost_total_load_aware{...} 0
Diagnosis:
-
Check if test results exist:
docker exec cost-exporter ls /results -
Enable debug logging:
docker logs cost-exporter 2>&1 | grep DEBUG
-
Check Prometheus connectivity:
docker exec cost-exporter curl -s http://victoriametrics:8428/-/healthy
Solutions:
- Run tests to generate data:
python3 scripts/run_tests.py - Verify PROMETHEUS_URL environment variable
- Check that test results have start_time and end_time
- Verify Prometheus has data for the test time window
Symptoms: nginx_rtmp_streams 0 even during active streaming
Solutions:
-
Check Nginx is receiving streams:
curl http://localhost:8080/stat
-
Start a test stream:
ffmpeg -re -f lavfi -i testsrc=duration=60:size=1280x720:rate=30 \ -f lavfi -i sine=frequency=1000:duration=60 \ -c:v libx264 -preset veryfast -b:v 1000k \ -c:a aac -b:a 128k \ -f flv rtmp://localhost/live/test -
Check nginx-exporter logs:
docker logs nginx-exporter
Symptoms: Old scenarios in metrics
Solutions:
-
Check cache TTL (60 seconds default)
-
Force refresh by restarting:
docker restart results-exporter
-
Verify new test results:
ls -lt test_results/*.json | head -1
Symptoms: Exporter UP but no metrics in Prometheus
Solutions:
-
Check Prometheus config:
docker exec prometheus cat /etc/prometheus/prometheus.yml -
Reload configuration:
curl -X POST http://localhost:8428/-/reload
-
Check scrape errors:
curl http://localhost:8428/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
Symptoms: CPU/memory usage very high
Solutions:
-
cAdvisor: Increase housekeeping interval
cadvisor: command: - --housekeeping_interval=30s
-
Prometheus: Reduce scrape frequency
global: scrape_interval: 15s # instead of 5s
-
Reduce retention:
prometheus: command: - "--storage.tsdb.retention.time=3d" # instead of 7d
Use this checklist for systematic troubleshooting:
- All containers running:
docker ps - All exporters responding:
curl localhost:<port>/metrics - Prometheus targets UP:
http://localhost:8428/targets - Test results exist:
ls test_results/ - RAPL available:
ls /sys/class/powercap/ - Docker socket accessible:
docker ps - Prometheus can query exporters: Check service discovery
- Grafana datasource connected: Check Grafana settings
- No network issues:
docker network ls - Sufficient disk space:
df -h
If issues persist:
-
Collect logs:
docker logs prometheus > prometheus.log docker logs cost-exporter > cost-exporter.log docker logs rapl-exporter > rapl-exporter.log
-
Run health check:
python3 check_exporters_health.py --debug > health-check.log -
Check system info:
uname -a > system-info.txt docker version >> system-info.txt docker compose version >> system-info.txt cat /proc/cpuinfo | grep "model name" >> system-info.txt
-
Create minimal reproduction:
# Stop everything docker compose down # Start only essentials docker compose up -d prometheus rapl-exporter # Test minimal setup curl http://localhost:9500/metrics curl http://localhost:8428/api/v1/query?query=rapl_power_watts