This document provides technical implementation details for the distributed compute feature.
Decision: Use push model (nodes push results to master)
Rationale:
- Reduces master complexity (no need to track scraping endpoints)
- Minimizes network connections (nodes initiate all connections)
- Better for firewall/NAT scenarios (only outbound connections from workers)
- Aligns with existing VictoriaMetrics architecture
Decision: Use in-memory storage for v1.0
Rationale:
- Simple, no external database dependencies
- Fast for development and testing
- Sufficient for initial deployment scale
- Easy to migrate to persistent storage in v1.5+
Trade-offs:
- Jobs/nodes lost on master restart
- No horizontal scaling of master (single instance)
- Memory constrained by available RAM
Decision: Simple FIFO queue for v1.0
Rationale:
- Minimal complexity
- Predictable behavior
- Good baseline for future optimization
Future Enhancements (v1.5+):
- Priority queues
- Resource-aware scheduling
- Affinity rules (GPU jobs → GPU nodes)
- Load balancing
Decision: Use HTTP with JSON payloads
Rationale:
- Universal protocol, works everywhere
- Easy to debug (curl, browser, etc.)
- Well-understood by ops teams
- Simple client implementation
Security Note: For production:
- Add HTTPS/TLS
- Implement mTLS for mutual authentication
- Add API tokens/JWT
Decision: Require explicit flag + confirmation
Rationale:
- Prevents accidental resource contention
- Clear UX with warnings and risks
- Allows development without multiple machines
- Forces conscious decision for production
cmd/
master/main.go - Master node entry point
agent/main.go - Agent node entry point
pkg/
models/
node.go - Node data structures
job.go - Job data structures
api/
master.go - Master HTTP handlers
store/
memory.go - In-memory storage implementation
agent/
client.go - HTTP client for master communication
hardware.go - Hardware detection utilities
MemoryStoreuses separate mutexes for nodes, jobs, and queue- Read-write locks (RWMutex) allow concurrent reads
- Critical section in
GetNextJobuses full lock to prevent race
- Each HTTP request runs in its own goroutine (Go standard)
- Store methods handle all synchronization internally
- No shared state between handlers
- Main goroutine: job polling loop
- Background goroutine: heartbeat loop
- Both communicate via same HTTP client (which is thread-safe)
Not implemented in v1.0 (minimal scope)
test_distributed.sh: End-to-end workflow validation- Tests all API endpoints
- Validates state transitions
- Verifies error handling
- Master-as-worker warning/confirmation
- Agent registration and hardware detection
- Job execution simulation
- Existing Python workflow compatibility
- Memory: O(N) for nodes + O(M) for jobs
- Typical: ~1KB per node, ~2KB per job
- Can handle 10K nodes + 100K jobs in <1GB RAM
- CPU: Minimal (HTTP request handling only)
- API is I/O bound, not CPU bound
- JSON encoding/decoding is fast
- Polling: Default 10s interval, configurable
- At 100 nodes, master sees 10 req/sec
- Easily scaled with load balancer
- Heartbeat: Default 30s interval
- Lightweight, just updates timestamp
- Bandwidth: Results batched in single JSON payload
- Typical payload: 1-10KB for metrics + analyzer output
- Not suitable for raw video transfer (not intended)
- No job persistence: Jobs lost on master restart
- No retry logic: Failed jobs require manual requeue
- No authentication: Trust-on-first-register only
- No multi-master: Single point of failure
- Simulated execution: Agent doesn't actually run FFmpeg yet
- Nodes can have same hostname (UUID differentiates)
- Jobs don't auto-retry (intentional for v1.0)
- No live log streaming (results are batch uploaded)
- No job cancellation API (can add if needed)
The distributed system is fully backward compatible with single-node workflows:
- Python scripts continue to work
- Docker stack unchanged
- VictoriaMetrics/Grafana unaffected
- Master/Agent are opt-in additions
- HTTP (no encryption)
- No authentication
- Trust-on-first-register
- UUID-based node identity
Risk Level: Development/Testing only
-
Transport Security
- HTTPS/TLS for all endpoints
- Certificate validation
-
Authentication
- API tokens per node
- JWT for requests
- Shared secrets
-
Authorization
- Role-based access control
- Node capabilities → job restrictions
-
Network Isolation
- Private network for master-worker communication
- Firewall rules
- VPN/Wireguard mesh
- Job retry with exponential backoff
- Dead node detection and cleanup
- Job timeout and recovery
- Resource-aware placement (CPU/GPU/RAM matching)
- Affinity and anti-affinity rules
- Priority queues
- mTLS
- API authentication
- Audit logging
- PostgreSQL or SQLite for job/node storage
- Results archival
- Historical analytics
- Master election (Raft/etcd)
- Distributed job queue
- Horizontal scaling
- Real FFmpeg execution in agent
- Kubernetes operator
- Helm charts
- Prometheus metrics from master
- Node registrations
- Job dispatches
- Job completions
- Heartbeat failures
- Master: job queue depth, node count, job completion rate
- Agent: job execution time, resource utilization
- Master down (health check failure)
- No available nodes
- Job queue growing (backlog)
- High job failure rate
make build-distributed./test_distributed.sh# Master logs
tail -f /tmp/master.log
# Agent logs
tail -f /tmp/agent.log
# Check state
curl http://localhost:8080/nodes | jq
curl http://localhost:8080/jobs | jq- Update models in
pkg/models/ - Add API handlers in
pkg/api/ - Update store in
pkg/store/ - Test with integration script
- Update documentation
- Keep it simple: In-memory storage was right choice for v1.0
- Safety first: Master-as-worker warnings prevented confusion
- Test early: Integration tests caught issues before manual testing
- Document risks: Clear docs on limitations prevent surprises
- Race conditions: Code review caught subtle concurrency bug
Implemented following the requirements in the original issue, with additional safety features (master-as-worker warnings) based on common sense and development best practices.