Bandwidth Metrics Guide

This document describes the bandwidth tracking and monitoring capabilities in the FFmpeg-RTMP distributed transcoding system.

Overview

Bandwidth metrics track the data throughput of transcoding jobs, measuring both input file consumption and output file generation. These metrics help you:

Optimize resource allocation: Understand network and storage I/O requirements
Capacity planning: Predict bandwidth needs for scaling
Cost analysis: Track data transfer for cloud deployments
Performance monitoring: Identify bottlenecks in data pipelines
SLA tracking: Ensure throughput meets service level agreements

Available Metrics

Per-Job Metrics

These metrics track individual job bandwidth characteristics:

`ffrtmp_job_last_input_bytes`

Type: Gauge
Description: Size of input file from the last completed job (in bytes)
Labels: node_id

Example:

ffrtmp_job_last_input_bytes{node_id="worker-1:9091"} 52428800

`ffrtmp_job_last_output_bytes`

Type: Gauge
Description: Size of output file from the last completed job (in bytes)
Labels: node_id

Example:

ffrtmp_job_last_output_bytes{node_id="worker-1:9091"} 41943040

`ffrtmp_job_last_bandwidth_mbps`

Type: Gauge
Description: Bandwidth utilization for last completed job in Megabits per second (Mbps)
Calculation: ((input_bytes + output_bytes) * 8) / (duration_seconds * 1024 * 1024)
Labels: node_id

Example:

ffrtmp_job_last_bandwidth_mbps{node_id="worker-1:9091"} 15.32

Cumulative Worker Metrics

These metrics track total bandwidth across all jobs on a worker:

`ffrtmp_job_input_bytes_total`

Type: Counter
Description: Total bytes read from input files across all jobs since worker startup
Labels: node_id

Example:

ffrtmp_job_input_bytes_total{node_id="worker-1:9091"} 524288000

Query rate of input data processing:

rate(ffrtmp_job_input_bytes_total[5m])

`ffrtmp_job_output_bytes_total`

Type: Counter
Description: Total bytes written to output files across all jobs since worker startup
Labels: node_id

Example:

ffrtmp_job_output_bytes_total{node_id="worker-1:9091"} 419430400

Query rate of output data generation:

rate(ffrtmp_job_output_bytes_total[5m])

`ffrtmp_worker_bandwidth_utilization`

Type: Gauge
Description: Worker overall bandwidth utilization as percentage (0-100)
Calculation: Normalized metric based on recent job bandwidth activity
Labels: node_id

Example:

ffrtmp_worker_bandwidth_utilization{node_id="worker-1:9091"} 45.2

Job Result Metrics

Bandwidth metrics are also included in job results returned by the worker:

{
  "job_id": "job-123",
  "status": "completed",
  "metrics": {
    "duration": 62.5,
    "input_file_bytes": 52428800,
    "output_file_bytes": 41943040,
    "bandwidth_mbps": 12.08,
    "input_generation_duration_sec": 2.3,
    "input_file_size_bytes": 52428800
  }
}

Metric Fields

input_file_bytes: Size of input file processed (bytes)
output_file_bytes: Size of output file generated (bytes)
bandwidth_mbps: Bandwidth utilization for this job (Mbps)

Prometheus Queries

Basic Queries

Total data processed by all workers:

sum(ffrtmp_job_input_bytes_total)

Total data generated by all workers:

sum(ffrtmp_job_output_bytes_total)

Average bandwidth per worker:

avg(ffrtmp_job_last_bandwidth_mbps)

Peak bandwidth utilization:

max(ffrtmp_job_last_bandwidth_mbps)

Rate Calculations

Input data processing rate (MB/s) over 5 minutes:

rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)

Output data generation rate (MB/s) over 5 minutes:

rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)

Total I/O rate per worker:

(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)

Compression Ratio

Calculate average compression ratio:

(
  sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
) / sum(ffrtmp_job_input_bytes_total) * 100

Per-worker compression ratio:

(
  ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
) / ffrtmp_job_input_bytes_total * 100

Network Capacity Planning

Predict bandwidth requirements for scaling:

# Average Mbps per active job
avg(ffrtmp_job_last_bandwidth_mbps / ffrtmp_worker_active_jobs)

Bandwidth headroom (assuming 1 Gbps network):

1000 - sum(ffrtmp_job_last_bandwidth_mbps)

Historical Analysis

Bandwidth trend over 24 hours:

avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])

Peak bandwidth in last hour:

max_over_time(ffrtmp_job_last_bandwidth_mbps[1h])

Total data processed in last 24 hours:

increase(ffrtmp_job_input_bytes_total[24h])

Grafana Dashboards

Bandwidth Overview Panel

Create a graph panel with these queries:

- expr: rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)
  legend: "Input Rate (MB/s) - {{node_id}}"

- expr: rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)
  legend: "Output Rate (MB/s) - {{node_id}}"

- expr: ffrtmp_job_last_bandwidth_mbps
  legend: "Current Bandwidth (Mbps) - {{node_id}}"

Compression Efficiency Panel

Single stat panel showing compression ratio:

- expr: |
    (
      sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
    ) / sum(ffrtmp_job_input_bytes_total) * 100
  legend: "Compression Ratio (%)"

Data Transfer Volume Panel

Stat panels showing cumulative totals:

- expr: sum(ffrtmp_job_input_bytes_total) / (1024 * 1024 * 1024)
  legend: "Total Input (GB)"

- expr: sum(ffrtmp_job_output_bytes_total) / (1024 * 1024 * 1024)
  legend: "Total Output (GB)"

Worker Bandwidth Utilization Heatmap

Table panel with per-worker breakdown:

- expr: |
    sum by (node_id) (
      rate(ffrtmp_job_input_bytes_total[5m]) + 
      rate(ffrtmp_job_output_bytes_total[5m])
    ) / (1024 * 1024)
  legend: "{{node_id}}"

Alerting Rules

High Bandwidth Utilization

Alert when worker bandwidth exceeds 80% of capacity:

- alert: HighBandwidthUtilization
  expr: ffrtmp_worker_bandwidth_utilization > 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High bandwidth utilization on {{$labels.node_id}}"
    description: "Worker {{$labels.node_id}} is using {{$value}}% of bandwidth capacity for over 10 minutes"

Bandwidth Spike

Alert on sudden bandwidth increases:

- alert: BandwidthSpike
  expr: |
    rate(ffrtmp_job_input_bytes_total[1m]) > 
    avg_over_time(rate(ffrtmp_job_input_bytes_total[1m])[5m:1m]) * 2
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Bandwidth spike detected on {{$labels.node_id}}"
    description: "Input bandwidth is 2x above 5-minute average"

Low Compression Efficiency

Alert when compression ratio drops below expected:

- alert: LowCompressionRatio
  expr: |
    (
      ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
    ) / ffrtmp_job_input_bytes_total * 100 < 10
  for: 1h
  labels:
    severity: info
  annotations:
    summary: "Low compression ratio on {{$labels.node_id}}"
    description: "Compression ratio is {{$value}}%, below 10% threshold"

Performance Optimization

Network Bandwidth

Best Practices:

Co-locate workers with storage: Minimize network hops between input/output storage and workers
Use high-bandwidth networks: 10 Gbps recommended for high-throughput workloads
Monitor network saturation: Track ffrtmp_worker_bandwidth_utilization to avoid bottlenecks
Consider bandwidth limits: Use resource limits to cap bandwidth per job if needed

Network Requirements by Resolution:

Resolution	Input Bitrate	Output Bitrate	Total Bandwidth
720p	5-10 Mbps	3-5 Mbps	8-15 Mbps
1080p	10-20 Mbps	5-10 Mbps	15-30 Mbps
4K	50-100 Mbps	25-50 Mbps	75-150 Mbps

Storage I/O

Optimize file access:

Use local storage for input/output: Avoid network filesystems (NFS, CIFS) for temporary files
SSD for temporary files: Fast I/O reduces job latency
Dedicated mount points: Separate /tmp or /var/transcoding for worker I/O
Monitor disk bandwidth: Use iostat to track disk utilization

Disk I/O Requirements:

Sequential read: 500+ MB/s (SSD recommended)
Sequential write: 500+ MB/s (SSD recommended)
Random I/O: Less critical for video files (mostly sequential)

Memory Caching

FFmpeg and GStreamer benefit from file system caching:

# Check cache usage
free -h

# Increase cache size if needed (example: 4GB)
echo 3 > /proc/sys/vm/drop_caches  # Clear caches first
sysctl -w vm.vfs_cache_pressure=50  # Favor cache retention

Capacity Planning

Bandwidth Capacity Formula

To calculate required bandwidth for N concurrent jobs:

Total_Bandwidth = N * Avg_Job_Bandwidth * Safety_Margin

Where:
- N = max_concurrent_jobs per worker
- Avg_Job_Bandwidth = typical Mbps per job (from metrics)
- Safety_Margin = 1.5 (50% headroom)

Example:

4 concurrent jobs per worker
Average 20 Mbps per job
Safety margin: 1.5x

Total = 4 * 20 * 1.5 = 120 Mbps

Recommendation: 1 Gbps network interface (provides 8x headroom)

Storage Capacity Formula

To calculate storage needs:

Storage_Required = Daily_Jobs * Avg_Input_Size * Retention_Days

Where:
- Daily_Jobs = jobs processed per day
- Avg_Input_Size = average input file size (GB)
- Retention_Days = how long to keep files

Example:

1000 jobs per day
Average 500 MB input size
7 days retention

Storage = 1000 * 0.5 * 7 = 3500 GB (3.5 TB)

Troubleshooting

High Bandwidth Usage

Symptom: ffrtmp_worker_bandwidth_utilization constantly > 80%

Diagnosis:

# Check which workers are saturated
ffrtmp_worker_bandwidth_utilization > 80

# Check job bandwidth
topk(10, ffrtmp_job_last_bandwidth_mbps)

Solutions:

Reduce max-concurrent-jobs per worker
Add more workers to distribute load
Upgrade network infrastructure
Use lower bitrate encoding presets

Low Bandwidth (Performance Issue)

Symptom: Jobs take longer than expected, but bandwidth is low

Diagnosis:

# Check if CPU-bound instead of I/O-bound
ffrtmp_worker_cpu_usage > 90 and ffrtmp_worker_bandwidth_utilization < 50

Solutions:

CPU bottleneck - add more CPU cores or workers
Use hardware encoding (NVENC, QSV, VAAPI)
Adjust encoder presets (faster preset = less CPU)

Inconsistent Bandwidth

Symptom: Bandwidth varies wildly between jobs

Diagnosis:

# Check standard deviation
stddev_over_time(ffrtmp_job_last_bandwidth_mbps[1h])

Solutions:

Normalize input files (consistent bitrate/resolution)
Use resource limits to cap bandwidth per job
Investigate storage I/O issues (slow disk)

Zero Bandwidth Metrics

Symptom: Metrics show 0 bytes for input/output

Causes:

RTMP streaming mode (no file I/O, only network streaming)
Input/output paths not specified in job parameters
Files cleaned up before metrics collection

Verification:

# Check if files exist after job
ls -lh /tmp/*.mp4

# Check job parameters
curl http://master:8080/api/v1/jobs/{job_id}

Integration Examples

Python Script for Bandwidth Analysis

import requests
import time

PROMETHEUS_URL = "http://localhost:9090"

def get_bandwidth_metrics():
    query = "sum(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)"
    response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
    data = response.json()
    
    if data["status"] == "success" and data["data"]["result"]:
        mbps = float(data["data"]["result"][0]["value"][1])
        return mbps
    return 0

def monitor_bandwidth(interval=10, threshold=100):
    """Monitor bandwidth and alert if threshold exceeded"""
    while True:
        mbps = get_bandwidth_metrics()
        print(f"Current bandwidth: {mbps:.2f} MB/s")
        
        if mbps > threshold:
            print(f"  WARNING: Bandwidth {mbps:.2f} MB/s exceeds threshold {threshold} MB/s")
        
        time.sleep(interval)

if __name__ == "__main__":
    monitor_bandwidth(interval=30, threshold=100)

Bash Script for Capacity Check

#!/bin/bash
# Check if bandwidth capacity is available

PROMETHEUS_URL="http://localhost:9090"
MAX_BANDWIDTH_MBPS=1000  # 1 Gbps
THRESHOLD=80  # 80% utilization

# Get current bandwidth utilization
QUERY="sum(ffrtmp_worker_bandwidth_utilization)"
CURRENT=$(curl -s "${PROMETHEUS_URL}/api/v1/query?query=${QUERY}" | jq -r '.data.result[0].value[1]')

echo "Current bandwidth utilization: ${CURRENT}%"

if (( $(echo "$CURRENT > $THRESHOLD" | bc -l) )); then
    echo " WARNING: Bandwidth utilization above ${THRESHOLD}%"
    exit 1
else
    echo " Bandwidth capacity available"
    exit 0
fi

Best Practices

1. Baseline Your Workload

Before scaling, establish bandwidth baseline:

# Run for 24 hours, then query:
avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])

2. Set Bandwidth Budgets

Define per-job bandwidth limits in job parameters (future feature):

{
  "resource_limits": {
    "max_bandwidth_mbps": 50
  }
}

3. Monitor Trends

Track bandwidth trends weekly:

avg_over_time(rate(ffrtmp_job_input_bytes_total[1d])[7d:1d]) / (1024 * 1024)

4. Plan for Peak Load

Calculate peak bandwidth needs:

max_over_time(sum(ffrtmp_job_last_bandwidth_mbps)[7d])

5. Optimize Encoding Settings

Lower bitrate = lower bandwidth:

{
  "bitrate": "2M",    // 2 Mbps output
  "preset": "fast"    // Less CPU, more bandwidth
}

Support

For issues or questions about bandwidth metrics:

Check Prometheus metrics endpoint: http://worker:9091/metrics
Verify job parameters include input and output paths
Review worker logs for file size detection
Check GitHub issues for known problems

FilesExpand file tree

BANDWIDTH_METRICS.md

Latest commit

History

BANDWIDTH_METRICS.md

File metadata and controls

Bandwidth Metrics Guide

Overview

Available Metrics

Per-Job Metrics

ffrtmp_job_last_input_bytes

ffrtmp_job_last_output_bytes

ffrtmp_job_last_bandwidth_mbps

Cumulative Worker Metrics

ffrtmp_job_input_bytes_total

ffrtmp_job_output_bytes_total

ffrtmp_worker_bandwidth_utilization

Job Result Metrics

Metric Fields

Prometheus Queries

Basic Queries

Rate Calculations

Compression Ratio

Network Capacity Planning

Historical Analysis

Grafana Dashboards

Bandwidth Overview Panel

Compression Efficiency Panel

Data Transfer Volume Panel

Worker Bandwidth Utilization Heatmap

Alerting Rules

High Bandwidth Utilization

Bandwidth Spike

Low Compression Efficiency

Performance Optimization

Network Bandwidth

Storage I/O

Memory Caching

Capacity Planning

Bandwidth Capacity Formula

Storage Capacity Formula

Troubleshooting

High Bandwidth Usage

Low Bandwidth (Performance Issue)

Inconsistent Bandwidth

Zero Bandwidth Metrics

Integration Examples

Python Script for Bandwidth Analysis

Bash Script for Capacity Check

Best Practices

1. Baseline Your Workload

2. Set Bandwidth Budgets

3. Monitor Trends

4. Plan for Peak Load

5. Optimize Encoding Settings

Related Documentation

Support

`ffrtmp_job_last_input_bytes`

`ffrtmp_job_last_output_bytes`

`ffrtmp_job_last_bandwidth_mbps`

`ffrtmp_job_input_bytes_total`

`ffrtmp_job_output_bytes_total`

`ffrtmp_worker_bandwidth_utilization`