This document describes the bandwidth tracking and monitoring capabilities in the FFmpeg-RTMP distributed transcoding system.
Bandwidth metrics track the data throughput of transcoding jobs, measuring both input file consumption and output file generation. These metrics help you:
- Optimize resource allocation: Understand network and storage I/O requirements
- Capacity planning: Predict bandwidth needs for scaling
- Cost analysis: Track data transfer for cloud deployments
- Performance monitoring: Identify bottlenecks in data pipelines
- SLA tracking: Ensure throughput meets service level agreements
These metrics track individual job bandwidth characteristics:
Type: Gauge
Description: Size of input file from the last completed job (in bytes)
Labels: node_id
Example:
ffrtmp_job_last_input_bytes{node_id="worker-1:9091"} 52428800
Type: Gauge
Description: Size of output file from the last completed job (in bytes)
Labels: node_id
Example:
ffrtmp_job_last_output_bytes{node_id="worker-1:9091"} 41943040
Type: Gauge
Description: Bandwidth utilization for last completed job in Megabits per second (Mbps)
Calculation: ((input_bytes + output_bytes) * 8) / (duration_seconds * 1024 * 1024)
Labels: node_id
Example:
ffrtmp_job_last_bandwidth_mbps{node_id="worker-1:9091"} 15.32
These metrics track total bandwidth across all jobs on a worker:
Type: Counter
Description: Total bytes read from input files across all jobs since worker startup
Labels: node_id
Example:
ffrtmp_job_input_bytes_total{node_id="worker-1:9091"} 524288000
Query rate of input data processing:
rate(ffrtmp_job_input_bytes_total[5m])
Type: Counter
Description: Total bytes written to output files across all jobs since worker startup
Labels: node_id
Example:
ffrtmp_job_output_bytes_total{node_id="worker-1:9091"} 419430400
Query rate of output data generation:
rate(ffrtmp_job_output_bytes_total[5m])
Type: Gauge
Description: Worker overall bandwidth utilization as percentage (0-100)
Calculation: Normalized metric based on recent job bandwidth activity
Labels: node_id
Example:
ffrtmp_worker_bandwidth_utilization{node_id="worker-1:9091"} 45.2
Bandwidth metrics are also included in job results returned by the worker:
{
"job_id": "job-123",
"status": "completed",
"metrics": {
"duration": 62.5,
"input_file_bytes": 52428800,
"output_file_bytes": 41943040,
"bandwidth_mbps": 12.08,
"input_generation_duration_sec": 2.3,
"input_file_size_bytes": 52428800
}
}input_file_bytes: Size of input file processed (bytes)output_file_bytes: Size of output file generated (bytes)bandwidth_mbps: Bandwidth utilization for this job (Mbps)
Total data processed by all workers:
sum(ffrtmp_job_input_bytes_total)
Total data generated by all workers:
sum(ffrtmp_job_output_bytes_total)
Average bandwidth per worker:
avg(ffrtmp_job_last_bandwidth_mbps)
Peak bandwidth utilization:
max(ffrtmp_job_last_bandwidth_mbps)
Input data processing rate (MB/s) over 5 minutes:
rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)
Output data generation rate (MB/s) over 5 minutes:
rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)
Total I/O rate per worker:
(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)
Calculate average compression ratio:
(
sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
) / sum(ffrtmp_job_input_bytes_total) * 100
Per-worker compression ratio:
(
ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
) / ffrtmp_job_input_bytes_total * 100
Predict bandwidth requirements for scaling:
# Average Mbps per active job
avg(ffrtmp_job_last_bandwidth_mbps / ffrtmp_worker_active_jobs)
Bandwidth headroom (assuming 1 Gbps network):
1000 - sum(ffrtmp_job_last_bandwidth_mbps)
Bandwidth trend over 24 hours:
avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])
Peak bandwidth in last hour:
max_over_time(ffrtmp_job_last_bandwidth_mbps[1h])
Total data processed in last 24 hours:
increase(ffrtmp_job_input_bytes_total[24h])
Create a graph panel with these queries:
- expr: rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)
legend: "Input Rate (MB/s) - {{node_id}}"
- expr: rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)
legend: "Output Rate (MB/s) - {{node_id}}"
- expr: ffrtmp_job_last_bandwidth_mbps
legend: "Current Bandwidth (Mbps) - {{node_id}}"Single stat panel showing compression ratio:
- expr: |
(
sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
) / sum(ffrtmp_job_input_bytes_total) * 100
legend: "Compression Ratio (%)"Stat panels showing cumulative totals:
- expr: sum(ffrtmp_job_input_bytes_total) / (1024 * 1024 * 1024)
legend: "Total Input (GB)"
- expr: sum(ffrtmp_job_output_bytes_total) / (1024 * 1024 * 1024)
legend: "Total Output (GB)"Table panel with per-worker breakdown:
- expr: |
sum by (node_id) (
rate(ffrtmp_job_input_bytes_total[5m]) +
rate(ffrtmp_job_output_bytes_total[5m])
) / (1024 * 1024)
legend: "{{node_id}}"Alert when worker bandwidth exceeds 80% of capacity:
- alert: HighBandwidthUtilization
expr: ffrtmp_worker_bandwidth_utilization > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High bandwidth utilization on {{$labels.node_id}}"
description: "Worker {{$labels.node_id}} is using {{$value}}% of bandwidth capacity for over 10 minutes"Alert on sudden bandwidth increases:
- alert: BandwidthSpike
expr: |
rate(ffrtmp_job_input_bytes_total[1m]) >
avg_over_time(rate(ffrtmp_job_input_bytes_total[1m])[5m:1m]) * 2
for: 5m
labels:
severity: info
annotations:
summary: "Bandwidth spike detected on {{$labels.node_id}}"
description: "Input bandwidth is 2x above 5-minute average"Alert when compression ratio drops below expected:
- alert: LowCompressionRatio
expr: |
(
ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
) / ffrtmp_job_input_bytes_total * 100 < 10
for: 1h
labels:
severity: info
annotations:
summary: "Low compression ratio on {{$labels.node_id}}"
description: "Compression ratio is {{$value}}%, below 10% threshold"Best Practices:
- Co-locate workers with storage: Minimize network hops between input/output storage and workers
- Use high-bandwidth networks: 10 Gbps recommended for high-throughput workloads
- Monitor network saturation: Track
ffrtmp_worker_bandwidth_utilizationto avoid bottlenecks - Consider bandwidth limits: Use resource limits to cap bandwidth per job if needed
Network Requirements by Resolution:
| Resolution | Input Bitrate | Output Bitrate | Total Bandwidth |
|---|---|---|---|
| 720p | 5-10 Mbps | 3-5 Mbps | 8-15 Mbps |
| 1080p | 10-20 Mbps | 5-10 Mbps | 15-30 Mbps |
| 4K | 50-100 Mbps | 25-50 Mbps | 75-150 Mbps |
Optimize file access:
- Use local storage for input/output: Avoid network filesystems (NFS, CIFS) for temporary files
- SSD for temporary files: Fast I/O reduces job latency
- Dedicated mount points: Separate
/tmpor/var/transcodingfor worker I/O - Monitor disk bandwidth: Use
iostatto track disk utilization
Disk I/O Requirements:
- Sequential read: 500+ MB/s (SSD recommended)
- Sequential write: 500+ MB/s (SSD recommended)
- Random I/O: Less critical for video files (mostly sequential)
FFmpeg and GStreamer benefit from file system caching:
# Check cache usage
free -h
# Increase cache size if needed (example: 4GB)
echo 3 > /proc/sys/vm/drop_caches # Clear caches first
sysctl -w vm.vfs_cache_pressure=50 # Favor cache retentionTo calculate required bandwidth for N concurrent jobs:
Total_Bandwidth = N * Avg_Job_Bandwidth * Safety_Margin
Where:
- N = max_concurrent_jobs per worker
- Avg_Job_Bandwidth = typical Mbps per job (from metrics)
- Safety_Margin = 1.5 (50% headroom)
Example:
- 4 concurrent jobs per worker
- Average 20 Mbps per job
- Safety margin: 1.5x
Total = 4 * 20 * 1.5 = 120 Mbps
Recommendation: 1 Gbps network interface (provides 8x headroom)
To calculate storage needs:
Storage_Required = Daily_Jobs * Avg_Input_Size * Retention_Days
Where:
- Daily_Jobs = jobs processed per day
- Avg_Input_Size = average input file size (GB)
- Retention_Days = how long to keep files
Example:
- 1000 jobs per day
- Average 500 MB input size
- 7 days retention
Storage = 1000 * 0.5 * 7 = 3500 GB (3.5 TB)
Symptom: ffrtmp_worker_bandwidth_utilization constantly > 80%
Diagnosis:
# Check which workers are saturated
ffrtmp_worker_bandwidth_utilization > 80
# Check job bandwidth
topk(10, ffrtmp_job_last_bandwidth_mbps)
Solutions:
- Reduce
max-concurrent-jobsper worker - Add more workers to distribute load
- Upgrade network infrastructure
- Use lower bitrate encoding presets
Symptom: Jobs take longer than expected, but bandwidth is low
Diagnosis:
# Check if CPU-bound instead of I/O-bound
ffrtmp_worker_cpu_usage > 90 and ffrtmp_worker_bandwidth_utilization < 50
Solutions:
- CPU bottleneck - add more CPU cores or workers
- Use hardware encoding (NVENC, QSV, VAAPI)
- Adjust encoder presets (faster preset = less CPU)
Symptom: Bandwidth varies wildly between jobs
Diagnosis:
# Check standard deviation
stddev_over_time(ffrtmp_job_last_bandwidth_mbps[1h])
Solutions:
- Normalize input files (consistent bitrate/resolution)
- Use resource limits to cap bandwidth per job
- Investigate storage I/O issues (slow disk)
Symptom: Metrics show 0 bytes for input/output
Causes:
- RTMP streaming mode (no file I/O, only network streaming)
- Input/output paths not specified in job parameters
- Files cleaned up before metrics collection
Verification:
# Check if files exist after job
ls -lh /tmp/*.mp4
# Check job parameters
curl http://master:8080/api/v1/jobs/{job_id}import requests
import time
PROMETHEUS_URL = "http://localhost:9090"
def get_bandwidth_metrics():
query = "sum(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)"
response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
data = response.json()
if data["status"] == "success" and data["data"]["result"]:
mbps = float(data["data"]["result"][0]["value"][1])
return mbps
return 0
def monitor_bandwidth(interval=10, threshold=100):
"""Monitor bandwidth and alert if threshold exceeded"""
while True:
mbps = get_bandwidth_metrics()
print(f"Current bandwidth: {mbps:.2f} MB/s")
if mbps > threshold:
print(f" WARNING: Bandwidth {mbps:.2f} MB/s exceeds threshold {threshold} MB/s")
time.sleep(interval)
if __name__ == "__main__":
monitor_bandwidth(interval=30, threshold=100)#!/bin/bash
# Check if bandwidth capacity is available
PROMETHEUS_URL="http://localhost:9090"
MAX_BANDWIDTH_MBPS=1000 # 1 Gbps
THRESHOLD=80 # 80% utilization
# Get current bandwidth utilization
QUERY="sum(ffrtmp_worker_bandwidth_utilization)"
CURRENT=$(curl -s "${PROMETHEUS_URL}/api/v1/query?query=${QUERY}" | jq -r '.data.result[0].value[1]')
echo "Current bandwidth utilization: ${CURRENT}%"
if (( $(echo "$CURRENT > $THRESHOLD" | bc -l) )); then
echo " WARNING: Bandwidth utilization above ${THRESHOLD}%"
exit 1
else
echo " Bandwidth capacity available"
exit 0
fiBefore scaling, establish bandwidth baseline:
# Run for 24 hours, then query:
avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])
Define per-job bandwidth limits in job parameters (future feature):
{
"resource_limits": {
"max_bandwidth_mbps": 50
}
}Track bandwidth trends weekly:
avg_over_time(rate(ffrtmp_job_input_bytes_total[1d])[7d:1d]) / (1024 * 1024)
Calculate peak bandwidth needs:
max_over_time(sum(ffrtmp_job_last_bandwidth_mbps)[7d])
Lower bitrate = lower bandwidth:
{
"bitrate": "2M", // 2 Mbps output
"preset": "fast" // Less CPU, more bandwidth
}- Resource Limits Guide - CPU/memory/disk management
- Production Operations - Production deployment best practices
- Alerting Guide - Prometheus alerting configuration
- Worker Deployment - Worker setup and configuration
For issues or questions about bandwidth metrics:
- Check Prometheus metrics endpoint:
http://worker:9091/metrics - Verify job parameters include
inputandoutputpaths - Review worker logs for file size detection
- Check GitHub issues for known problems