Skip to content
/ epoch Public

A fault-tolerant distributed job scheduler that delivers priority-based execution, tenant-aware fairness, and resilient checkpointed workloads with leader-elected high availability.

Notifications You must be signed in to change notification settings

anon-000/epoch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Epoch

A production-grade distributed job scheduler built from scratch.
Priority queues Β· Leader election Β· Fault tolerance Β· Tenant isolation Β· Real-time dashboard

Quickstart Β· Architecture Β· How It Works Β· Fault Tolerance Β· Walkthroughs


What is Epoch?

Epoch is a distributed job scheduling system that handles job submission, prioritized scheduling, execution across a pool of workers, automatic retries with exponential backoff, checkpointing for long-running jobs, tenant-aware fair-share scheduling, and leader-elected fault tolerance β€” all built from the ground up with no external orchestration frameworks.

It's designed to answer the question: "What does it actually take to build a reliable job scheduler that handles failures gracefully?"

Key Features

Feature Description
🎯 Multi-Level Priority Queue CRITICAL > HIGH > NORMAL with automatic priority aging to prevent starvation
πŸ‘‘ Leader-Elected Scheduler PostgreSQL advisory lock-based leader election with automatic failover
πŸ”„ Automatic Retries Exponential backoff with jitter, configurable per job, dead letter queue for permanent failures
⚑ Job Preemption Higher-priority jobs can preempt lower-priority running jobs
πŸ’Ύ Checkpointing Long-running jobs save progress periodically β€” crash recovery resumes from last checkpoint
🏒 Tenant Isolation Per-tenant worker quotas and fair-share scheduling prevent noisy neighbors
πŸ“Š Real-Time Dashboard Live monitoring with job states, worker utilization, tenant activity, and full event history
πŸ” Immutable Audit Log Every state transition is recorded β€” full lifecycle traceability for any job

πŸ“Έ Walkthroughs

πŸ–₯️ Dashboard Overview

Real-time metrics, job distribution charts, worker utilization, and recent activity feed.

dashboard_walkthrough.mov

πŸ“‹ Jobs Lifecycle

Submitting jobs, watching state transitions, retries, failures, dead letter flow, and the full event timeline.

all_jobs_walkthrough.mov

πŸ‘· Workers

Worker pool management, heartbeats, slot utilization, and drain mode.

workers.mov

🏒 Tenant Management

Multi-tenant configuration, per-tenant quotas, and priority boost.

tenants_demo.mov

🏒 Audit Logs

Immutable event history across all jobs β€” filter by event type, tenant, job name, and date range for full operational visibility.

audit.logs.mov

πŸ“Έ Checkpoints

Periodic state snapshots for long-running jobs β€” crash mid-execution and resume from where you left off.

checkpoints.mov

πŸš€ Quickstart

Prerequisites

One-Command Setup

git clone https://github.com/your-username/epoch.git
cd epoch
docker compose up --build

That's it. This spins up the entire stack:

Service Port Description
Frontend localhost:3000 Next.js dashboard
API localhost:8000 FastAPI REST API
Scheduler β€” Leader-elected scheduler
Scheduler Standby β€” Hot standby (takes over if leader dies)
Worker 1 β€” 4 execution slots
Worker 2 β€” 4 execution slots
PostgreSQL 5432 Source of truth
Redis 6379 Priority queues & pub/sub

Submit Your First Job

curl -X POST http://localhost:8000/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-first-job",
    "tenant_id": "default",
    "job_type": "jobs.data_processing:DataProcessingJob",
    "payload": {"input_size": 5},
    "priority": "NORMAL",
    "max_retries": 3,
    "timeout_seconds": 60
  }'

Then open localhost:3000 to watch it flow through the system.

API Quick Reference

Method Endpoint Description
POST /api/v1/jobs Submit a new job
GET /api/v1/jobs List all jobs (filterable by state, tenant)
GET /api/v1/jobs/{id} Get job details
GET /api/v1/jobs/{id}/events Full event audit log
POST /api/v1/jobs/{id}/retry Manually retry a failed/dead-lettered job
POST /api/v1/jobs/{id}/cancel Cancel a job
GET /api/v1/workers List all workers and their status
POST /api/v1/admin/workers/{id}/drain Drain a worker (stop accepting new jobs)
GET /api/v1/admin/scheduler/leader Current scheduler leader info
GET /api/v1/dashboard/stats Dashboard metrics

πŸ— Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Client      β”‚
β”‚  (Dashboard /   β”‚
β”‚   REST API)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ HTTP
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API Server    β”‚  ← Stateless, horizontally scalable
β”‚   (FastAPI)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ Writes job to DB + pushes to Redis queue
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚   β”‚  PostgreSQL  │◄───────►│    Redis     β”‚               β”‚
β”‚   β”‚              β”‚         β”‚              β”‚               β”‚
β”‚   β”‚ β€’ Job state  β”‚         β”‚ β€’ Priority   β”‚               β”‚
β”‚   β”‚ β€’ Workers    β”‚         β”‚   queue      β”‚               β”‚
β”‚   β”‚ β€’ Leader     β”‚         β”‚ β€’ Pub/Sub    β”‚               β”‚
β”‚   β”‚   election   β”‚         β”‚ β€’ Job locks  β”‚               β”‚
β”‚   β”‚ β€’ Checkpointsβ”‚         β”‚              β”‚               β”‚
β”‚   β”‚ β€’ Audit log  β”‚         β”‚              β”‚               β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                                   β”‚                       β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚         Scheduler (Leader-Elected)                 β”‚  β”‚
β”‚   β”‚                               β”‚                    β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”              β”‚  β”‚
β”‚   β”‚  β”‚  Scheduler  β”‚    β”‚  Scheduler    β”‚              β”‚  β”‚
β”‚   β”‚  β”‚  (Active) πŸ‘‘β”‚    β”‚  (Standby) ⏳ β”‚               β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β”‚
β”‚   β”‚         β”‚ Assigns jobs to workers                  β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚             β–Ό                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚              Worker Pool                           β”‚  β”‚
β”‚   β”‚                                                    β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚  β”‚
β”‚   β”‚  β”‚ Worker 1 β”‚  β”‚ Worker 2 β”‚  β”‚ Worker N β”‚          β”‚  β”‚
β”‚   β”‚  β”‚ (4 slots)β”‚  β”‚ (4 slots)β”‚  β”‚ (4 slots)β”‚          β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

Layer Technology
API FastAPI + Uvicorn
Database PostgreSQL 15 (advisory locks, JSONB)
Queue & Coordination Redis 7 (sorted sets, pub/sub)
ORM SQLAlchemy 2.0 (fully async)
Migrations Alembic
Frontend Next.js 16 + React 19 + Recharts + Tailwind CSS
Containerization Docker + Docker Compose
Checkpoints Local filesystem (S3-compatible interface)

βš™ How It Works

The Job State Machine

Every job follows a deterministic state machine. Every transition is persisted to PostgreSQL before any action is taken β€” this is the cornerstone of crash safety.

SUBMITTED ──► QUEUED ──► SCHEDULED ──► RUNNING ──► COMPLETED βœ…
                β”‚              β”‚           β”‚
                β”‚              β”‚           β”œβ”€β”€β–Ί CHECKPOINTED ──► RUNNING (resume)
                β”‚              β”‚           β”‚
                β”‚              β”‚           β”œβ”€β”€β–Ί FAILED ──► QUEUED (retry with backoff)
                β”‚              β”‚           β”‚         └──► DEAD_LETTER πŸ’€ (max retries exhausted)
                β”‚              β”‚           β”‚
                β”‚              β”‚           └──► TIMED_OUT ──► QUEUED (retry)
                β”‚              β”‚                     └──► DEAD_LETTER πŸ’€
                β”‚              β”‚
                β”‚              └──► PREEMPTED ──► QUEUED (re-queued, same priority)
                β”‚
                └──► CANCELLED 🚫

State definitions:

State What it means
SUBMITTED Job received by the API
QUEUED Sitting in the priority queue, waiting to be scheduled
SCHEDULED Assigned to a specific worker, about to execute
RUNNING Currently executing on a worker
COMPLETED Finished successfully
FAILED Execution failed (will be retried or dead-lettered)
TIMED_OUT Exceeded its timeout β€” treated like a failure
CHECKPOINTED Progress saved mid-execution (long-running jobs)
PREEMPTED Interrupted by a higher-priority job
CANCELLED Manually cancelled by the user
DEAD_LETTER Exhausted all retries β€” parked for manual inspection

The Scheduling Loop

The scheduler runs a continuous loop (every ~1 second):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     SCHEDULER LOOP                           β”‚
β”‚                                                              β”‚
β”‚  1. Am I the leader? (check advisory lock)                   β”‚
β”‚     └─ No β†’ sleep, try again                                 β”‚
β”‚     └─ Yes ↓                                                 β”‚
β”‚                                                              β”‚
β”‚  2. Dequeue highest-priority job from Redis sorted set       β”‚
β”‚                                                              β”‚
β”‚  3. Check tenant quota:                                      β”‚
β”‚     └─ Tenant at max concurrent jobs? β†’ defer, try next job  β”‚
β”‚                                                              β”‚
β”‚  4. Find an available worker:                                β”‚
β”‚     └─ Worker with free slots? β†’ assign job                  β”‚
β”‚     └─ No free workers? β†’ try preemption (if CRITICAL)       β”‚
β”‚     └─ Still no slot? β†’ put job back in queue                β”‚
β”‚                                                              β”‚
β”‚  5. Assign: Job β†’ SCHEDULED, notify worker via pub/sub       β”‚
β”‚                                                              β”‚
β”‚  6. Repeat                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Priority Queue with Aging

Jobs are stored in a Redis sorted set (ZPOPMIN β€” lowest score dequeued first). The score formula:

score = -(priority_weight Γ— 1000) + enqueue_timestamp
Priority Weight Effective Score Range
CRITICAL 3 β‰ˆ -3000 + timestamp (always first)
HIGH 2 β‰ˆ -2000 + timestamp
NORMAL 1 β‰ˆ -1000 + timestamp

Anti-starvation aging: NORMAL jobs gain +0.1 effective weight per minute in the queue. After ~10 minutes, a NORMAL job's effective priority equals HIGH. After ~20 minutes, it matches CRITICAL. This prevents low-priority jobs from waiting forever β€” a classic technique borrowed from Multi-Level Feedback Queue scheduling in operating systems.

Job Preemption

When a CRITICAL job arrives and all workers are full:

1. Find the lowest-priority RUNNING job
2. CRITICAL can preempt HIGH or NORMAL
3. HIGH can preempt NORMAL
4. Preempted job β†’ PREEMPTED β†’ re-queued with same priority
5. CRITICAL job gets the freed worker slot

The preempted job doesn't lose its work β€” if it supports checkpointing, it resumes from the last checkpoint when rescheduled.


πŸ›‘ Fault Tolerance

This is where things get interesting. Distributed systems fail in creative ways. Here's how Epoch handles each failure mode:

1. Worker Crashes

Problem: A worker dies mid-execution. The job is stuck in RUNNING with no one executing it.

Detection: Workers send heartbeats every 5 seconds. The scheduler monitors these. If a worker misses heartbeats for 30 seconds, it's declared dead.

Recovery:

1. Scheduler detects stale heartbeat (> 30s old)
2. Worker marked as OFFLINE
3. All jobs assigned to that worker are re-examined
4. RUNNING jobs β†’ TIMED_OUT β†’ queued for retry (with backoff)
5. If the job has a checkpoint, the retry resumes from there

No job is lost. PostgreSQL is the source of truth β€” a job in RUNNING state with no live worker will always be detected and recovered.

2. Scheduler Crashes

Problem: The scheduler (the brain of the system) dies.

Detection: PostgreSQL advisory locks. The leader holds a session-level advisory lock. If its connection drops (process crash, network partition), PostgreSQL automatically releases the lock.

Recovery:

1. Standby scheduler continuously tries pg_try_advisory_lock()
2. Leader dies β†’ lock released β†’ standby acquires it
3. New leader reads all state from PostgreSQL
4. Rebuilds in-memory queues from DB (QUEUED jobs β†’ Redis)
5. Resumes scheduling β€” no jobs lost, failover < 15 seconds

This is why we use PostgreSQL advisory locks instead of ZooKeeper or etcd β€” one fewer infrastructure dependency, and it's equally reliable for single-region deployments.

3. Job Execution Failures

Problem: A job throws an exception during execution.

Handling: Exponential backoff with jitter.

delay = min(base_delay Γ— 2^attempt + random_jitter, max_delay)

Example with base_delay=5s, max_delay=300s:
  Attempt 1: ~5s   wait
  Attempt 2: ~10s  wait
  Attempt 3: ~20s  wait
  Attempt 4: ~40s  wait
  ...
  Attempt N: max 300s wait

After max_retries (configurable per job, default: 3), the job moves to Dead Letter Queue β€” a holding area for permanently failed jobs that can be inspected and manually retried.

4. Timeout Detection

Problem: A job hangs β€” doesn't fail, doesn't complete, just sits there.

Detection: The scheduler checks for jobs in RUNNING state that have exceeded their timeout_seconds.

Recovery: Same as failure β€” TIMED_OUT β†’ retry with backoff β†’ dead letter after max retries.

5. Redis Crashes

Problem: Redis (the coordination layer) goes down.

Impact: The priority queue and pub/sub notifications are unavailable. Jobs can't be enqueued or dequeued.

Recovery: Redis is volatile β€” it's a performance optimization, not the source of truth. When Redis recovers:

1. Scheduler detects Redis reconnection
2. Rebuilds the priority queue from PostgreSQL
3. All QUEUED jobs in DB β†’ re-pushed to Redis sorted set
4. Scheduling resumes normally

No data is lost because PostgreSQL always has the canonical state.


πŸ” Retry System Deep Dive

The Retry Flow

Job executes β†’ FAILS
     β”‚
     β–Ό
attempt < max_retries?
     β”‚
     β”œβ”€β”€ YES β†’ compute backoff delay
     β”‚          β†’ Job state: FAILED β†’ QUEUED
     β”‚          β†’ Re-enqueue in Redis with delay
     β”‚          β†’ Will be rescheduled after delay
     β”‚
     └── NO  β†’ Job state: DEAD_LETTER πŸ’€
              β†’ Moved to dead_letter_jobs table
              β†’ Error details preserved
              β†’ Available for manual retry via API or dashboard

Backoff Formula

delay = min(base_delay Γ— 2^attempt + random.uniform(0, base_delay), max_delay)

The jitter prevents the thundering herd problem β€” if 100 jobs fail at the same time, they don't all retry at exactly the same moment.

Dead Letter Queue

Jobs that exhaust all retries are moved to the dead letter queue with:

  • The final error message
  • Total number of attempts
  • Full event history (every state transition)

From the dashboard or API, you can:

  • Inspect the failure reason
  • Manually retry β€” resets the attempt counter and re-queues the job

πŸ’Ύ Checkpointing

For long-running jobs (hours, days), losing all progress on a crash is unacceptable.

How It Works

Job starts β†’ executes work β†’ periodically saves checkpoint
     β”‚
     β”‚   Every N seconds (configurable, default: 30s):
     β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚   β”‚ 1. Job serializes its current state  β”‚
     β”‚   β”‚ 2. Blob written to checkpoint store  β”‚
     β”‚   β”‚ 3. Metadata saved to PostgreSQL      β”‚
     β”‚   β”‚    (path, sequence #, timestamp)     β”‚
     β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
Worker crashes!
     β”‚
     β–Ό
Scheduler detects dead worker β†’ re-queues the job
     β”‚
     β–Ό
New worker picks up the job:
  1. Loads latest checkpoint from store
  2. Deserializes saved state
  3. Resumes execution from where it left off

Jobs implement the BaseJob interface which provides save_checkpoint() and load_checkpoint() hooks:

class MyLongRunningJob(BaseJob):
    async def execute(self, job_id, payload, checkpoint_data=None):
        start = 0
        if checkpoint_data:
            start = checkpoint_data["progress"]  # Resume from checkpoint

        for i in range(start, 1000):
            # Do work...
            if i % 100 == 0:
                await self.save_checkpoint(job_id, {"progress": i})

        return {"success": True, "result": "done"}

🏒 Tenant Isolation

Epoch is designed for multi-tenant environments where different teams or customers share the same scheduler infrastructure.

Fair-Share Scheduling

Each tenant has configurable limits:

{
  "tenant_id": "team-ml",
  "max_concurrent_jobs": 5,
  "max_workers": 3,
  "priority_boost": 0
}

The scheduler enforces these during assignment:

1. Dequeue next job from priority queue
2. Check: "Is tenant X at their max_concurrent_jobs limit?"
   └─ Yes β†’ skip this job, try the next one
   └─ No  β†’ proceed to assign
3. This ensures no single tenant can monopolize all workers

Why This Matters

Without tenant isolation, a single tenant submitting 10,000 jobs would starve everyone else. Epoch's fair-share model ensures:

  • Each tenant gets their fair share of resources
  • One tenant's failures don't cascade to others
  • Priority boost lets you give premium tenants an edge

πŸ‘‘ Leader Election

Only one scheduler should be assigning jobs at any time. Multiple schedulers assigning the same job would cause duplicate execution.

How It Works

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚       PostgreSQL                β”‚
                  β”‚                                 β”‚
                  β”‚   Advisory Lock #123456789      β”‚
                  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                  β”‚   β”‚   Held by: Scheduler Aβ”‚     β”‚
                  β”‚   β”‚   Since: 2 minutes agoβ”‚     β”‚
                  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                  β”‚                                 β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                 β”‚                  β”‚
              β–Ό                 β–Ό                  β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Scheduler A β”‚  β”‚ Scheduler B  β”‚  β”‚  Scheduler C    β”‚
     β”‚  (Leader) πŸ‘‘β”‚  β”‚ (Standby) ⏳ β”‚  β”‚  (Standby) ⏳    β”‚
     β”‚             β”‚  β”‚              β”‚  β”‚                 β”‚
     β”‚ Acquired βœ… β”‚  β”‚ Try β†’ fail   β”‚  β”‚  Try β†’ fail     β”‚
     β”‚ Scheduling  β”‚  β”‚ Try β†’ fail   β”‚  β”‚  Try β†’ fail     β”‚
     β”‚             β”‚  β”‚ Try β†’ fail   β”‚  β”‚  Try β†’ fail     β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. pg_try_advisory_lock(123456789) β€” non-blocking lock attempt
  2. One scheduler wins β†’ becomes leader, writes to scheduler_leader table
  3. Others fail β†’ keep retrying every loop iteration
  4. Leader renews heartbeat every 5 seconds
  5. If leader dies β†’ PostgreSQL releases the lock β†’ standby takes over

No split-brain is possible β€” PostgreSQL guarantees at most one holder of an advisory lock at any time.


πŸ“Š Event Audit Log

Every state transition for every job is recorded in an immutable job_events table:

Job: data-pipeline-42

CREATED         | Job submitted: data-pipeline-42
QUEUED          | Enqueued for scheduling
SCHEDULED       | Assigned to worker abc123 (host-1)
RUNNING         | Execution started (attempt 1/3)
FAILED          | TimeoutError: connection to database timed out
RETRIED         | Auto-retry: re-queued (attempt 1/3)
SCHEDULED       | Assigned to worker def456 (host-2)
RUNNING         | Execution started (attempt 2/3)
COMPLETED       | Job completed successfully

Each event captures:

  • Timestamp (microsecond precision)
  • Event type (CREATED, QUEUED, SCHEDULED, RUNNING, FAILED, RETRIED, DEAD_LETTER, etc.)
  • Attempt number
  • Worker ID (which worker executed it)
  • Detail text (error messages, assignment info)

This gives you complete lifecycle traceability β€” you can reconstruct exactly what happened to any job.


πŸ“ Project Structure

epoch/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/                    # FastAPI application
β”‚   β”‚   β”œβ”€β”€ server.py           #   App factory, middleware, lifespan
β”‚   β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   β”‚   β”œβ”€β”€ jobs.py         #   Job CRUD + event history endpoint
β”‚   β”‚   β”‚   β”œβ”€β”€ workers.py      #   Worker listing
β”‚   β”‚   β”‚   β”œβ”€β”€ admin.py        #   Admin ops (drain, leader info, tenants)
β”‚   β”‚   β”‚   └── dashboard.py    #   Dashboard stats aggregation
β”‚   β”‚   └── schemas.py          #   Pydantic request/response models
β”‚   β”‚
β”‚   β”œβ”€β”€ scheduler/              # Core scheduling engine
β”‚   β”‚   β”œβ”€β”€ scheduler.py        #   Main loop: dequeue β†’ assign β†’ commit
β”‚   β”‚   β”œβ”€β”€ leader.py           #   PostgreSQL advisory lock leader election
β”‚   β”‚   β”œβ”€β”€ priority_queue.py   #   Score computation + aging formula
β”‚   β”‚   └── preemption.py       #   Find preemptable jobs + preempt logic
β”‚   β”‚
β”‚   β”œβ”€β”€ worker/                 # Job execution
β”‚   β”‚   β”œβ”€β”€ worker.py           #   Worker lifecycle, pub/sub listener
β”‚   β”‚   β”œβ”€β”€ executor.py         #   Subprocess isolation, timeout enforcement
β”‚   β”‚   └── heartbeat.py        #   Periodic heartbeat sender
β”‚   β”‚
β”‚   β”œβ”€β”€ checkpoint/             # Checkpoint management
β”‚   β”‚   β”œβ”€β”€ manager.py          #   Save/load/cleanup checkpoints
β”‚   β”‚   └── store.py            #   Storage backend (local FS, S3-ready)
β”‚   β”‚
β”‚   β”œβ”€β”€ models/                 # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ job.py              #   Job + DeadLetterJob models
β”‚   β”‚   β”œβ”€β”€ job_event.py        #   Immutable audit log model
β”‚   β”‚   β”œβ”€β”€ worker.py           #   Worker model
β”‚   β”‚   β”œβ”€β”€ checkpoint.py       #   Checkpoint metadata model
β”‚   β”‚   β”œβ”€β”€ scheduler.py        #   Scheduler leader model
β”‚   β”‚   └── tenant.py           #   Tenant config model
β”‚   β”‚
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   └── event_logger.py     #   record_event() helper for audit log
β”‚   β”‚
β”‚   β”œβ”€β”€ queue/
β”‚   β”‚   └── redis_queue.py      #   Redis sorted set + pub/sub operations
β”‚   β”‚
β”‚   β”œβ”€β”€ db/
β”‚   β”‚   └── session.py          #   Async session factory + init
β”‚   β”‚
β”‚   β”œβ”€β”€ config.py               #   Pydantic settings (env-driven)
β”‚   └── constants.py            #   State enums, transitions, Redis keys
β”‚
β”œβ”€β”€ jobs/                       # Sample job implementations
β”‚   β”œβ”€β”€ base.py                 #   BaseJob interface (checkpoint hooks)
β”‚   β”œβ”€β”€ data_processing.py      #   Simple data processing job
β”‚   └── long_running.py         #   Long-running job with checkpointing
β”‚
β”œβ”€β”€ frontend/                   # Next.js 16 dashboard
β”‚   └── src/
β”‚       β”œβ”€β”€ app/                #   Pages (dashboard, jobs, workers, tenants)
β”‚       β”œβ”€β”€ components/         #   React components
β”‚       └── lib/                #   API client, types, constants
β”‚
β”œβ”€β”€ walkthroughs/               # Demo videos
β”‚   β”œβ”€β”€ dashboard_walkthrough.mov
β”‚   β”œβ”€β”€ jobs_walkthrough.mov
β”‚   β”œβ”€β”€ workers.mov
β”‚   └── tenants_demo.mov
β”‚
β”œβ”€β”€ docker-compose.yml          # Full stack: PG + Redis + API + Scheduler(Γ—2) + Worker(Γ—2) + Frontend
β”œβ”€β”€ Dockerfile                  # Python 3.12 multi-service image
β”œβ”€β”€ requirements.txt            # Python dependencies
└── Plan.md                     # Original system design document

βš™οΈ Configuration

All configuration is via environment variables with the EPOCH_ prefix:

Variable Default Description
EPOCH_DATABASE_URL postgresql+asyncpg://epoch:epoch@localhost:5432/epoch PostgreSQL connection (async)
EPOCH_REDIS_URL redis://localhost:6379/0 Redis connection
EPOCH_SCHEDULER_LOOP_INTERVAL 1.0 Scheduler cycle interval (seconds)
EPOCH_LEADER_HEARTBEAT_INTERVAL 5.0 Leader heartbeat interval
EPOCH_LEADER_LOCK_TTL 15.0 Leader lock timeout
EPOCH_WORKER_MAX_SLOTS 4 Max concurrent jobs per worker
EPOCH_WORKER_HEARTBEAT_INTERVAL 5.0 Worker heartbeat interval
EPOCH_WORKER_HEARTBEAT_TIMEOUT 30.0 Declare worker dead after this
EPOCH_DEFAULT_MAX_RETRIES 3 Default retry limit
EPOCH_RETRY_BASE_DELAY 5.0 Base retry delay (seconds)
EPOCH_RETRY_MAX_DELAY 300.0 Max retry delay (5 minutes)
EPOCH_CHECKPOINT_DIR /tmp/epoch-checkpoints Checkpoint storage directory
EPOCH_CHECKPOINT_INTERVAL 30.0 Checkpoint save interval

πŸ§ͺ Writing Custom Jobs

Create a new job by extending BaseJob:

# jobs/my_job.py
from jobs.base import BaseJob

class MyJob(BaseJob):
    async def execute(self, job_id, payload, checkpoint_data=None):
        # Your job logic here
        items = payload.get("items", [])

        for item in items:
            result = await process(item)

        return {"success": True, "processed": len(items)}

Submit it via the API:

curl -X POST http://localhost:8000/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-custom-job",
    "tenant_id": "my-team",
    "job_type": "jobs.my_job:MyJob",
    "payload": {"items": [1, 2, 3]},
    "priority": "HIGH",
    "max_retries": 5,
    "timeout_seconds": 120
  }'

The job_type field uses Python's module:ClassName format β€” the executor dynamically imports and instantiates your job class.


πŸ”‘ Design Principles

  1. PostgreSQL is the source of truth. Redis is a performance optimization. If Redis dies, rebuild from PG. If PG has it, it happened.

  2. State transitions are atomic. Every state change is committed to the database before any side effect. Crash at any point β†’ consistent recovery.

  3. At-least-once execution. A job will be executed to completion, possibly more than once if a worker crashes. Design your jobs to be idempotent.

  4. No external orchestration dependencies. No ZooKeeper, no etcd, no Kubernetes CRDs. Just PostgreSQL + Redis β€” tools every team already runs.

  5. Fail loudly, recover quietly. Every failure is logged, tracked, and visible. Recovery happens automatically in the background.


🌍 Real-World Use Cases

Epoch is designed around patterns seen in production job scheduling systems. Here's how it maps to real scenarios:

🧾 Invoice & Payment Processing

A fintech platform processes thousands of invoices daily. Each invoice is submitted as a job with NORMAL priority, while failed payment retries are resubmitted at HIGH. Exponential backoff prevents hammering payment gateways, and the dead letter queue catches permanently declined transactions for manual review.

Epoch features used: Priority queue, automatic retries with backoff, dead letter queue, tenant isolation (per-merchant)

πŸ€– ML Model Training Pipelines

A data team queues model training jobs that run for hours. Checkpointing saves training progress every few minutes, so if a worker crashes mid-epoch (the ML kind), training resumes from the last checkpoint instead of restarting from scratch. Critical production model retrains use CRITICAL priority and preempt lower-priority experimental runs.

Epoch features used: Checkpointing, job preemption, long-running job support, priority scheduling

πŸ“§ Bulk Notification Delivery

An e-commerce platform sends millions of order confirmation emails, SMS alerts, and push notifications. Each tenant (seller) has isolated worker quotas to prevent a single high-volume seller from starving others. The audit log tracks every delivery attempt for compliance.

Epoch features used: Tenant isolation, fair-share scheduling, immutable audit log, high-throughput processing

πŸ“Š ETL & Data Pipeline Orchestration

A data engineering team runs nightly ETL pipelines β€” extract from APIs, transform with Python, load into a data warehouse. Each stage is a job with dependencies. Failed stages retry automatically, and the dashboard shows which pipelines are stuck, running, or completed.

Epoch features used: Automatic retries, real-time dashboard, job state tracking, timeout detection

πŸ–ΌοΈ Media Processing at Scale

A content platform transcodes uploaded videos into multiple resolutions. Video transcoding jobs are CPU-heavy and long-running. Checkpointing tracks progress per resolution, and priority aging ensures older uploads don't starve behind a flood of new ones.

Epoch features used: Checkpointing, priority aging (anti-starvation), worker slot management, timeout handling

🏦 Regulatory Report Generation

A bank generates end-of-day regulatory reports across multiple subsidiaries (tenants). Each subsidiary has dedicated worker capacity. Reports must complete within a deadline β€” timed-out jobs are flagged immediately. The full audit trail satisfies compliance requirements.

Epoch features used: Tenant isolation, timeout detection, audit log, scheduled execution, dead letter alerting

πŸ›’ Order Fulfillment Workflows

An e-commerce backend processes orders through stages: payment validation β†’ inventory reservation β†’ shipping label generation β†’ carrier dispatch. Each stage is a separate job. Failures at any stage trigger retries with backoff, and the event timeline shows exactly where an order got stuck.

Epoch features used: Retry with backoff, event audit timeline, job state machine, worker heartbeats

πŸ”¬ Scientific Computing & Simulations

A research lab runs Monte Carlo simulations across a worker pool. Each simulation variant is a job. The multi-level priority queue ensures funded research projects run before exploratory ones. Leader-elected scheduling guarantees exactly-once assignment even when scheduler nodes restart.

Epoch features used: Leader election, priority queue, distributed worker pool, fault tolerance


Built with β˜• and a deep appreciation for distributed systems that don't lose your data.

About

A fault-tolerant distributed job scheduler that delivers priority-based execution, tenant-aware fairness, and resilient checkpointed workloads with leader-elected high availability.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published