Skip to content

Latest commit

 

History

History
605 lines (444 loc) · 14.6 KB

File metadata and controls

605 lines (444 loc) · 14.6 KB

Bandwidth Metrics Guide

This document describes the bandwidth tracking and monitoring capabilities in the FFmpeg-RTMP distributed transcoding system.

Overview

Bandwidth metrics track the data throughput of transcoding jobs, measuring both input file consumption and output file generation. These metrics help you:

  • Optimize resource allocation: Understand network and storage I/O requirements
  • Capacity planning: Predict bandwidth needs for scaling
  • Cost analysis: Track data transfer for cloud deployments
  • Performance monitoring: Identify bottlenecks in data pipelines
  • SLA tracking: Ensure throughput meets service level agreements

Available Metrics

Per-Job Metrics

These metrics track individual job bandwidth characteristics:

ffrtmp_job_last_input_bytes

Type: Gauge
Description: Size of input file from the last completed job (in bytes)
Labels: node_id

Example:

ffrtmp_job_last_input_bytes{node_id="worker-1:9091"} 52428800

ffrtmp_job_last_output_bytes

Type: Gauge
Description: Size of output file from the last completed job (in bytes)
Labels: node_id

Example:

ffrtmp_job_last_output_bytes{node_id="worker-1:9091"} 41943040

ffrtmp_job_last_bandwidth_mbps

Type: Gauge
Description: Bandwidth utilization for last completed job in Megabits per second (Mbps)
Calculation: ((input_bytes + output_bytes) * 8) / (duration_seconds * 1024 * 1024)
Labels: node_id

Example:

ffrtmp_job_last_bandwidth_mbps{node_id="worker-1:9091"} 15.32

Cumulative Worker Metrics

These metrics track total bandwidth across all jobs on a worker:

ffrtmp_job_input_bytes_total

Type: Counter
Description: Total bytes read from input files across all jobs since worker startup
Labels: node_id

Example:

ffrtmp_job_input_bytes_total{node_id="worker-1:9091"} 524288000

Query rate of input data processing:

rate(ffrtmp_job_input_bytes_total[5m])

ffrtmp_job_output_bytes_total

Type: Counter
Description: Total bytes written to output files across all jobs since worker startup
Labels: node_id

Example:

ffrtmp_job_output_bytes_total{node_id="worker-1:9091"} 419430400

Query rate of output data generation:

rate(ffrtmp_job_output_bytes_total[5m])

ffrtmp_worker_bandwidth_utilization

Type: Gauge
Description: Worker overall bandwidth utilization as percentage (0-100)
Calculation: Normalized metric based on recent job bandwidth activity
Labels: node_id

Example:

ffrtmp_worker_bandwidth_utilization{node_id="worker-1:9091"} 45.2

Job Result Metrics

Bandwidth metrics are also included in job results returned by the worker:

{
  "job_id": "job-123",
  "status": "completed",
  "metrics": {
    "duration": 62.5,
    "input_file_bytes": 52428800,
    "output_file_bytes": 41943040,
    "bandwidth_mbps": 12.08,
    "input_generation_duration_sec": 2.3,
    "input_file_size_bytes": 52428800
  }
}

Metric Fields

  • input_file_bytes: Size of input file processed (bytes)
  • output_file_bytes: Size of output file generated (bytes)
  • bandwidth_mbps: Bandwidth utilization for this job (Mbps)

Prometheus Queries

Basic Queries

Total data processed by all workers:

sum(ffrtmp_job_input_bytes_total)

Total data generated by all workers:

sum(ffrtmp_job_output_bytes_total)

Average bandwidth per worker:

avg(ffrtmp_job_last_bandwidth_mbps)

Peak bandwidth utilization:

max(ffrtmp_job_last_bandwidth_mbps)

Rate Calculations

Input data processing rate (MB/s) over 5 minutes:

rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)

Output data generation rate (MB/s) over 5 minutes:

rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)

Total I/O rate per worker:

(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)

Compression Ratio

Calculate average compression ratio:

(
  sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
) / sum(ffrtmp_job_input_bytes_total) * 100

Per-worker compression ratio:

(
  ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
) / ffrtmp_job_input_bytes_total * 100

Network Capacity Planning

Predict bandwidth requirements for scaling:

# Average Mbps per active job
avg(ffrtmp_job_last_bandwidth_mbps / ffrtmp_worker_active_jobs)

Bandwidth headroom (assuming 1 Gbps network):

1000 - sum(ffrtmp_job_last_bandwidth_mbps)

Historical Analysis

Bandwidth trend over 24 hours:

avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])

Peak bandwidth in last hour:

max_over_time(ffrtmp_job_last_bandwidth_mbps[1h])

Total data processed in last 24 hours:

increase(ffrtmp_job_input_bytes_total[24h])

Grafana Dashboards

Bandwidth Overview Panel

Create a graph panel with these queries:

- expr: rate(ffrtmp_job_input_bytes_total[5m]) / (1024 * 1024)
  legend: "Input Rate (MB/s) - {{node_id}}"

- expr: rate(ffrtmp_job_output_bytes_total[5m]) / (1024 * 1024)
  legend: "Output Rate (MB/s) - {{node_id}}"

- expr: ffrtmp_job_last_bandwidth_mbps
  legend: "Current Bandwidth (Mbps) - {{node_id}}"

Compression Efficiency Panel

Single stat panel showing compression ratio:

- expr: |
    (
      sum(ffrtmp_job_input_bytes_total) - sum(ffrtmp_job_output_bytes_total)
    ) / sum(ffrtmp_job_input_bytes_total) * 100
  legend: "Compression Ratio (%)"

Data Transfer Volume Panel

Stat panels showing cumulative totals:

- expr: sum(ffrtmp_job_input_bytes_total) / (1024 * 1024 * 1024)
  legend: "Total Input (GB)"

- expr: sum(ffrtmp_job_output_bytes_total) / (1024 * 1024 * 1024)
  legend: "Total Output (GB)"

Worker Bandwidth Utilization Heatmap

Table panel with per-worker breakdown:

- expr: |
    sum by (node_id) (
      rate(ffrtmp_job_input_bytes_total[5m]) + 
      rate(ffrtmp_job_output_bytes_total[5m])
    ) / (1024 * 1024)
  legend: "{{node_id}}"

Alerting Rules

High Bandwidth Utilization

Alert when worker bandwidth exceeds 80% of capacity:

- alert: HighBandwidthUtilization
  expr: ffrtmp_worker_bandwidth_utilization > 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High bandwidth utilization on {{$labels.node_id}}"
    description: "Worker {{$labels.node_id}} is using {{$value}}% of bandwidth capacity for over 10 minutes"

Bandwidth Spike

Alert on sudden bandwidth increases:

- alert: BandwidthSpike
  expr: |
    rate(ffrtmp_job_input_bytes_total[1m]) > 
    avg_over_time(rate(ffrtmp_job_input_bytes_total[1m])[5m:1m]) * 2
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Bandwidth spike detected on {{$labels.node_id}}"
    description: "Input bandwidth is 2x above 5-minute average"

Low Compression Efficiency

Alert when compression ratio drops below expected:

- alert: LowCompressionRatio
  expr: |
    (
      ffrtmp_job_input_bytes_total - ffrtmp_job_output_bytes_total
    ) / ffrtmp_job_input_bytes_total * 100 < 10
  for: 1h
  labels:
    severity: info
  annotations:
    summary: "Low compression ratio on {{$labels.node_id}}"
    description: "Compression ratio is {{$value}}%, below 10% threshold"

Performance Optimization

Network Bandwidth

Best Practices:

  1. Co-locate workers with storage: Minimize network hops between input/output storage and workers
  2. Use high-bandwidth networks: 10 Gbps recommended for high-throughput workloads
  3. Monitor network saturation: Track ffrtmp_worker_bandwidth_utilization to avoid bottlenecks
  4. Consider bandwidth limits: Use resource limits to cap bandwidth per job if needed

Network Requirements by Resolution:

Resolution Input Bitrate Output Bitrate Total Bandwidth
720p 5-10 Mbps 3-5 Mbps 8-15 Mbps
1080p 10-20 Mbps 5-10 Mbps 15-30 Mbps
4K 50-100 Mbps 25-50 Mbps 75-150 Mbps

Storage I/O

Optimize file access:

  1. Use local storage for input/output: Avoid network filesystems (NFS, CIFS) for temporary files
  2. SSD for temporary files: Fast I/O reduces job latency
  3. Dedicated mount points: Separate /tmp or /var/transcoding for worker I/O
  4. Monitor disk bandwidth: Use iostat to track disk utilization

Disk I/O Requirements:

  • Sequential read: 500+ MB/s (SSD recommended)
  • Sequential write: 500+ MB/s (SSD recommended)
  • Random I/O: Less critical for video files (mostly sequential)

Memory Caching

FFmpeg and GStreamer benefit from file system caching:

# Check cache usage
free -h

# Increase cache size if needed (example: 4GB)
echo 3 > /proc/sys/vm/drop_caches  # Clear caches first
sysctl -w vm.vfs_cache_pressure=50  # Favor cache retention

Capacity Planning

Bandwidth Capacity Formula

To calculate required bandwidth for N concurrent jobs:

Total_Bandwidth = N * Avg_Job_Bandwidth * Safety_Margin

Where:
- N = max_concurrent_jobs per worker
- Avg_Job_Bandwidth = typical Mbps per job (from metrics)
- Safety_Margin = 1.5 (50% headroom)

Example:

  • 4 concurrent jobs per worker
  • Average 20 Mbps per job
  • Safety margin: 1.5x
Total = 4 * 20 * 1.5 = 120 Mbps

Recommendation: 1 Gbps network interface (provides 8x headroom)

Storage Capacity Formula

To calculate storage needs:

Storage_Required = Daily_Jobs * Avg_Input_Size * Retention_Days

Where:
- Daily_Jobs = jobs processed per day
- Avg_Input_Size = average input file size (GB)
- Retention_Days = how long to keep files

Example:

  • 1000 jobs per day
  • Average 500 MB input size
  • 7 days retention
Storage = 1000 * 0.5 * 7 = 3500 GB (3.5 TB)

Troubleshooting

High Bandwidth Usage

Symptom: ffrtmp_worker_bandwidth_utilization constantly > 80%

Diagnosis:

# Check which workers are saturated
ffrtmp_worker_bandwidth_utilization > 80

# Check job bandwidth
topk(10, ffrtmp_job_last_bandwidth_mbps)

Solutions:

  1. Reduce max-concurrent-jobs per worker
  2. Add more workers to distribute load
  3. Upgrade network infrastructure
  4. Use lower bitrate encoding presets

Low Bandwidth (Performance Issue)

Symptom: Jobs take longer than expected, but bandwidth is low

Diagnosis:

# Check if CPU-bound instead of I/O-bound
ffrtmp_worker_cpu_usage > 90 and ffrtmp_worker_bandwidth_utilization < 50

Solutions:

  1. CPU bottleneck - add more CPU cores or workers
  2. Use hardware encoding (NVENC, QSV, VAAPI)
  3. Adjust encoder presets (faster preset = less CPU)

Inconsistent Bandwidth

Symptom: Bandwidth varies wildly between jobs

Diagnosis:

# Check standard deviation
stddev_over_time(ffrtmp_job_last_bandwidth_mbps[1h])

Solutions:

  1. Normalize input files (consistent bitrate/resolution)
  2. Use resource limits to cap bandwidth per job
  3. Investigate storage I/O issues (slow disk)

Zero Bandwidth Metrics

Symptom: Metrics show 0 bytes for input/output

Causes:

  1. RTMP streaming mode (no file I/O, only network streaming)
  2. Input/output paths not specified in job parameters
  3. Files cleaned up before metrics collection

Verification:

# Check if files exist after job
ls -lh /tmp/*.mp4

# Check job parameters
curl http://master:8080/api/v1/jobs/{job_id}

Integration Examples

Python Script for Bandwidth Analysis

import requests
import time

PROMETHEUS_URL = "http://localhost:9090"

def get_bandwidth_metrics():
    query = "sum(rate(ffrtmp_job_input_bytes_total[5m]) + rate(ffrtmp_job_output_bytes_total[5m])) / (1024 * 1024)"
    response = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
    data = response.json()
    
    if data["status"] == "success" and data["data"]["result"]:
        mbps = float(data["data"]["result"][0]["value"][1])
        return mbps
    return 0

def monitor_bandwidth(interval=10, threshold=100):
    """Monitor bandwidth and alert if threshold exceeded"""
    while True:
        mbps = get_bandwidth_metrics()
        print(f"Current bandwidth: {mbps:.2f} MB/s")
        
        if mbps > threshold:
            print(f"  WARNING: Bandwidth {mbps:.2f} MB/s exceeds threshold {threshold} MB/s")
        
        time.sleep(interval)

if __name__ == "__main__":
    monitor_bandwidth(interval=30, threshold=100)

Bash Script for Capacity Check

#!/bin/bash
# Check if bandwidth capacity is available

PROMETHEUS_URL="http://localhost:9090"
MAX_BANDWIDTH_MBPS=1000  # 1 Gbps
THRESHOLD=80  # 80% utilization

# Get current bandwidth utilization
QUERY="sum(ffrtmp_worker_bandwidth_utilization)"
CURRENT=$(curl -s "${PROMETHEUS_URL}/api/v1/query?query=${QUERY}" | jq -r '.data.result[0].value[1]')

echo "Current bandwidth utilization: ${CURRENT}%"

if (( $(echo "$CURRENT > $THRESHOLD" | bc -l) )); then
    echo " WARNING: Bandwidth utilization above ${THRESHOLD}%"
    exit 1
else
    echo " Bandwidth capacity available"
    exit 0
fi

Best Practices

1. Baseline Your Workload

Before scaling, establish bandwidth baseline:

# Run for 24 hours, then query:
avg_over_time(ffrtmp_job_last_bandwidth_mbps[24h])

2. Set Bandwidth Budgets

Define per-job bandwidth limits in job parameters (future feature):

{
  "resource_limits": {
    "max_bandwidth_mbps": 50
  }
}

3. Monitor Trends

Track bandwidth trends weekly:

avg_over_time(rate(ffrtmp_job_input_bytes_total[1d])[7d:1d]) / (1024 * 1024)

4. Plan for Peak Load

Calculate peak bandwidth needs:

max_over_time(sum(ffrtmp_job_last_bandwidth_mbps)[7d])

5. Optimize Encoding Settings

Lower bitrate = lower bandwidth:

{
  "bitrate": "2M",    // 2 Mbps output
  "preset": "fast"    // Less CPU, more bandwidth
}

Related Documentation

Support

For issues or questions about bandwidth metrics:

  1. Check Prometheus metrics endpoint: http://worker:9091/metrics
  2. Verify job parameters include input and output paths
  3. Review worker logs for file size detection
  4. Check GitHub issues for known problems