Status: Complete
Date: 2026-01-06
Architecture: Swedish principles - boring, correct, non-reactive
This document describes the production-grade features added to the FFmpeg RTMP distributed transcoding system. All features follow strict architectural principles:
- Retries apply to messages, not work (transport only, never workload)
- Graceful shutdown (let jobs finish, no killing)
- Minimal, correct visibility (derived, not driving)
- Centralized logging (/var/log/ffrtmp structure)
Files Modified:
shared/pkg/agent/client.go(added retry.Config and retry.Do wrapping)
Scope:
SendHeartbeat()- Retry heartbeat delivery to masterGetNextJob()- Retry job polling from masterSendResults()- Retry result delivery to master- NEVER retry job execution
- NEVER retry wrapper actions
- NEVER retry FFmpeg workloads
retry.Config{
MaxRetries: 3,
InitialBackoff: 1 * time.Second,
MaxBackoff: 30 * time.Second,
Multiplier: 2.0,
}- Connection refused
- Connection timeout
- HTTP 502, 503, 504 (transient server errors)
- EOF, broken pipe
- Context cancellation stops retries immediately
"Retries apply to messages, not work"
If you can phrase an operation as "sending a message" or "checking for messages", retries are allowed. If it's "doing work" or "executing a task", NO retries.
Files Modified:
shared/pkg/shutdown/shutdown.go(enhanced with Done() channel)master/cmd/master/main.go(LIFO shutdown order)worker/cmd/agent/main.go(wait for jobs, bounded timeout)
- Receive SIGTERM/SIGINT
- Close
Done()channel - Stop accepting new jobs (break out of polling loop)
- Wait for active jobs to complete (30-second timeout)
- Stop heartbeat loop
- Execute shutdown handlers (metrics server → logger)
- Exit cleanly
- Receive SIGTERM/SIGINT
- Close
Done()channel - Execute shutdown handlers in LIFO order:
- Close logger
- Stop HTTP server (30s graceful)
- Stop metrics server
- Stop scheduler
- Stop cleanup manager
- Close database connection
- Exit cleanly
** Allowed:**
- Stop accepting new jobs
- Let running jobs finish
- Bounded wait (30 seconds)
- Emit final JobResult
** NEVER:**
- Kill workloads to speed up shutdown
- Change workload behavior
- Force-terminate FFmpeg processes
- Interrupt wrapper execution
Workloads are owned by the OS, not by the wrapper or agent. We govern exit, not execution.
Files Modified:
worker/cmd/agent/main.go(enhanced/readyendpoint)
The /ready endpoint returns HTTP 200 if all checks pass, 503 otherwise:
exec.LookPath("ffmpeg")Returns "available" or "not_found"
resources.CheckDiskSpace("/tmp")- Requires at least 10% free space
- Returns: "ok: X% used, Y MB available"
- Or: "low: X% used" (fails if >90% used)
client.SendHeartbeat() // with 5-second timeout- Tests connectivity to master
- Returns "reachable", "unreachable", or "timeout"
- Gracefully degrades if not registered yet ("not_registered")
Kubernetes Probes:
livenessProbe:
httpGet:
path: /health
port: 9091
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 9091
initialDelaySeconds: 5
periodSeconds: 10Response Format:
{
"status": "ready",
"checks": {
"ffmpeg": "available",
"disk_space": "ok: 45.2% used, 25600 MB available",
"master": "reachable"
},
"timestamp": "2026-01-06T12:00:00Z"
}Files Modified:
master/cmd/master/main.go(migrated 56 log calls)worker/cmd/agent/main.go(already using logger)shared/pkg/logging/logger.go(NewFileLogger)
/var/log/ffrtmp/
├── master/
│ └── master.log # Master server logs
├── worker/
│ └── agent.log # Worker agent logs
└── wrapper/
└── wrapper.log # Wrapper execution logs
# Fallback if /var/log not writable:
./logs/
├── master/master.log
├── worker/agent.log
└── wrapper/wrapper.log
- Multi-writer: Logs to both file AND stdout
- Stdout captured by systemd: journald integration
- Auto-rotation: Manual rotation with
RotateIfNeeded() - Logrotate configs: 14-day retention, daily rotation
- Log levels: debug, info, warn, error, fatal
- Master: 56/56 log calls migrated to logger.Info/Error/Fatal
- Worker: Already using logger
- Wrapper: Uses report.LogSummary (correct - not reactive)
GET /metrics- Prometheus formatGET /health- Liveness probe
GET /metrics- Worker Prometheus metricsGET /health- Liveness probeGET /ready- Readiness probe (enhanced)GET /wrapper/metrics- Wrapper-specific Prometheus metricsGET /violations- SLA violations JSON (last 50)
Layer 1: Immutable Job-Level Truth
type Result struct {
StartTime time.Time
EndTime time.Time
ExitCode int
PlatformSLA bool // wrapper met its obligations
Intent string
}Written ONCE per job, never updated.
Layer 2: Boring Counters Only
type Metrics struct {
JobsTotal atomic.Uint64
JobsSuccess atomic.Uint64
JobsFailed atomic.Uint64
PlatformSLAMet atomic.Uint64
PlatformSLAFailed atomic.Uint64
}No histograms, no clever interpretation.
Layer 3: Human-Readable Logs
func (r *Result) LogSummary() string {
// For ops to grep at 03:00
}Killer Feature: Violation Sampling
- Ring buffer of last 50 SLA violations
- Accessed via
/violationsendpoint - Newest first, for debugging
1. Production Readiness Test
scripts/test_production_readiness.shValidates:
- Retry logic integration (transport only)
- Graceful shutdown (master + worker)
- Enhanced readiness checks
- Logging migration
- No retries on workload execution
2. Metrics Endpoints Test
scripts/test_metrics_endpoints.shTests:
- Master /metrics, /health
- Worker /metrics, /health, /ready
- Wrapper /wrapper/metrics, /violations
3. End-to-End Test Suite
scripts/test_all_end_to_end.shComprehensive 34-test suite covering all wrapper phases.
Test Graceful Shutdown:
# Start master
./bin/master --tls=false --port=8080 --db=""
# In another terminal, send SIGTERM
kill -TERM $(pgrep master)
# Verify logs show clean shutdown
tail -f logs/master/master.logTest Readiness Checks:
# Start worker
./bin/agent --metrics-port=9091
# Check readiness
curl http://localhost:9091/ready | jq-
Boring on Purpose
- No clever retries
- No fancy backoff algorithms beyond exponential
- No heroics
-
Correctness Over Features
- Retries only where safe (messages)
- Shutdown doesn't kill workloads
- Logging is simple file writes
-
Non-Reactive Visibility
- Metrics derive from immutable truth
- Wrapper never reacts to metrics
- Flow: Workload → OS → Wrapper observes → Metrics → Humans look
-
Governance, Not Management
- We decide when to start/stop accepting work
- OS decides when workloads run
- Wrapper records what happened
Critical Scope Limits:
Retry:
- HTTP requests to master
- Heartbeat delivery
- Job polling
- Result reporting
NO Retry:
- Job execution
- Wrapper run/attach
- FFmpeg execution
- Workload failures
If the operation changes system state beyond just "sending a message", NO RETRIES.
Service Files:
deployment/systemd/ffrtmp-master.servicedeployment/systemd/ffrtmp-worker.service
Critical Settings:
# Worker service MUST have:
Delegate=yes # For cgroup management
KillMode=process # Don't kill workloadsGraceful Shutdown:
TimeoutStopSec=60 # 60 seconds for clean shutdown
Restart=on-failure # Auto-restart on crashConfigs:
deployment/logrotate/ffrtmp-masterdeployment/logrotate/ffrtmp-workerdeployment/logrotate/ffrtmp-wrapper
Settings:
- 14-day retention
- Daily rotation
- Compress old logs
- Create new log with correct permissions
scrape_configs:
- job_name: 'ffrtmp-master'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
- job_name: 'ffrtmp-workers'
static_configs:
- targets: ['worker1:9091', 'worker2:9091']
scrape_interval: 15sMaster:
ffrtmp_jobs_pending> 100 (backlog building)ffrtmp_nodes_offline> 0 (worker failure)
Worker:
worker_heartbeat_failures> 3 (connectivity issues)worker_disk_usage_percent> 90 (disk space low)wrapper_platform_sla_failed> 10 (wrapper problems)
- Check
/violationsendpoint for recent failures - Look for patterns in violation timestamps
- Grep logs for job IDs in violations
- Check if workload failed vs platform failed
Before deploying to production:
- Review retry config (max retries, backoff)
- Test graceful shutdown under load
- Verify log rotation is working
- Set up Prometheus scraping
- Configure alerts for SLA violations
- Test readiness probes with Kubernetes
- Verify disk space monitoring
- Check FFmpeg availability on all workers
- Test master reachability checks
- Review shutdown timeout (30s sufficient?)
All production readiness features are implemented following strict architectural principles:
Retry Logic: Messages only, never work
Graceful Shutdown: Let jobs finish, no killing
Readiness Checks: FFmpeg, disk, master connectivity
Centralized Logging: File + stdout, /var/log/ffrtmp
Metrics Endpoints: Prometheus + health/ready probes
No Broken Principles: Retries scoped correctly, shutdown clean
Total Changes:
- 4 files modified (client.go, shutdown.go, master/main.go, worker/main.go)
- 317 insertions, 151 deletions
- 3 commits
- 2 test scripts
- 100% backward compatible
Architecture Preserved:
- Wrapper still doesn't react to metrics
- OS still owns workload lifecycle
- Job execution has NO retries
- Visibility remains derived, not driving
- LOGGING.md - Centralized logging architecture
- WRAPPER_VISIBILITY.md - 3-layer visibility
- WRAPPER_INTEGRATION.md - Worker integration
- WRAPPER_REPLICATION_GUIDE.md - Complete implementation guide
Test Coverage:
scripts/test_production_readiness.sh- Feature validationscripts/test_metrics_endpoints.sh- Endpoint testingscripts/test_all_end_to_end.sh- 34 comprehensive tests