A production-grade distributed job scheduler built from scratch.
Priority queues Β· Leader election Β· Fault tolerance Β· Tenant isolation Β· Real-time dashboard
Quickstart Β· Architecture Β· How It Works Β· Fault Tolerance Β· Walkthroughs
Epoch is a distributed job scheduling system that handles job submission, prioritized scheduling, execution across a pool of workers, automatic retries with exponential backoff, checkpointing for long-running jobs, tenant-aware fair-share scheduling, and leader-elected fault tolerance β all built from the ground up with no external orchestration frameworks.
It's designed to answer the question: "What does it actually take to build a reliable job scheduler that handles failures gracefully?"
| Feature | Description |
|---|---|
| π― Multi-Level Priority Queue | CRITICAL > HIGH > NORMAL with automatic priority aging to prevent starvation |
| π Leader-Elected Scheduler | PostgreSQL advisory lock-based leader election with automatic failover |
| π Automatic Retries | Exponential backoff with jitter, configurable per job, dead letter queue for permanent failures |
| β‘ Job Preemption | Higher-priority jobs can preempt lower-priority running jobs |
| πΎ Checkpointing | Long-running jobs save progress periodically β crash recovery resumes from last checkpoint |
| π’ Tenant Isolation | Per-tenant worker quotas and fair-share scheduling prevent noisy neighbors |
| π Real-Time Dashboard | Live monitoring with job states, worker utilization, tenant activity, and full event history |
| π Immutable Audit Log | Every state transition is recorded β full lifecycle traceability for any job |
Real-time metrics, job distribution charts, worker utilization, and recent activity feed.
dashboard_walkthrough.mov
Submitting jobs, watching state transitions, retries, failures, dead letter flow, and the full event timeline.
all_jobs_walkthrough.mov
Worker pool management, heartbeats, slot utilization, and drain mode.
workers.mov
Multi-tenant configuration, per-tenant quotas, and priority boost.
tenants_demo.mov
Immutable event history across all jobs β filter by event type, tenant, job name, and date range for full operational visibility.
audit.logs.mov
Periodic state snapshots for long-running jobs β crash mid-execution and resume from where you left off.
checkpoints.mov
- Docker & Docker Compose v2+
git clone https://github.com/your-username/epoch.git
cd epoch
docker compose up --buildThat's it. This spins up the entire stack:
| Service | Port | Description |
|---|---|---|
| Frontend | localhost:3000 | Next.js dashboard |
| API | localhost:8000 | FastAPI REST API |
| Scheduler | β | Leader-elected scheduler |
| Scheduler Standby | β | Hot standby (takes over if leader dies) |
| Worker 1 | β | 4 execution slots |
| Worker 2 | β | 4 execution slots |
| PostgreSQL | 5432 | Source of truth |
| Redis | 6379 | Priority queues & pub/sub |
curl -X POST http://localhost:8000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"name": "my-first-job",
"tenant_id": "default",
"job_type": "jobs.data_processing:DataProcessingJob",
"payload": {"input_size": 5},
"priority": "NORMAL",
"max_retries": 3,
"timeout_seconds": 60
}'Then open localhost:3000 to watch it flow through the system.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/jobs |
Submit a new job |
GET |
/api/v1/jobs |
List all jobs (filterable by state, tenant) |
GET |
/api/v1/jobs/{id} |
Get job details |
GET |
/api/v1/jobs/{id}/events |
Full event audit log |
POST |
/api/v1/jobs/{id}/retry |
Manually retry a failed/dead-lettered job |
POST |
/api/v1/jobs/{id}/cancel |
Cancel a job |
GET |
/api/v1/workers |
List all workers and their status |
POST |
/api/v1/admin/workers/{id}/drain |
Drain a worker (stop accepting new jobs) |
GET |
/api/v1/admin/scheduler/leader |
Current scheduler leader info |
GET |
/api/v1/dashboard/stats |
Dashboard metrics |
βββββββββββββββββββ
β Client β
β (Dashboard / β
β REST API) β
ββββββββββ¬βββββββββ
β HTTP
βΌ
βββββββββββββββββββ
β API Server β β Stateless, horizontally scalable
β (FastAPI) β
ββββββββββ¬βββββββββ
β Writes job to DB + pushes to Redis queue
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββββββ ββββββββββββββββ β
β β PostgreSQL ββββββββββΊβ Redis β β
β β β β β β
β β β’ Job state β β β’ Priority β β
β β β’ Workers β β queue β β
β β β’ Leader β β β’ Pub/Sub β β
β β election β β β’ Job locks β β
β β β’ Checkpointsβ β β β
β β β’ Audit log β β β β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββ β
β β Scheduler (Leader-Elected) β β
β β β β β
β β βββββββββββββββ ββββββββββ΄βββββββ β β
β β β Scheduler β β Scheduler β β β
β β β (Active) πβ β (Standby) β³ β β β
β β ββββββββ¬βββββββ ββββββββββββββββ β β
β β β Assigns jobs to workers β β
β βββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Worker Pool β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Worker 1 β β Worker 2 β β Worker N β β β
β β β (4 slots)β β (4 slots)β β (4 slots)β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Layer | Technology |
|---|---|
| API | FastAPI + Uvicorn |
| Database | PostgreSQL 15 (advisory locks, JSONB) |
| Queue & Coordination | Redis 7 (sorted sets, pub/sub) |
| ORM | SQLAlchemy 2.0 (fully async) |
| Migrations | Alembic |
| Frontend | Next.js 16 + React 19 + Recharts + Tailwind CSS |
| Containerization | Docker + Docker Compose |
| Checkpoints | Local filesystem (S3-compatible interface) |
Every job follows a deterministic state machine. Every transition is persisted to PostgreSQL before any action is taken β this is the cornerstone of crash safety.
SUBMITTED βββΊ QUEUED βββΊ SCHEDULED βββΊ RUNNING βββΊ COMPLETED β
β β β
β β ββββΊ CHECKPOINTED βββΊ RUNNING (resume)
β β β
β β ββββΊ FAILED βββΊ QUEUED (retry with backoff)
β β β ββββΊ DEAD_LETTER π (max retries exhausted)
β β β
β β ββββΊ TIMED_OUT βββΊ QUEUED (retry)
β β ββββΊ DEAD_LETTER π
β β
β ββββΊ PREEMPTED βββΊ QUEUED (re-queued, same priority)
β
ββββΊ CANCELLED π«
State definitions:
| State | What it means |
|---|---|
SUBMITTED |
Job received by the API |
QUEUED |
Sitting in the priority queue, waiting to be scheduled |
SCHEDULED |
Assigned to a specific worker, about to execute |
RUNNING |
Currently executing on a worker |
COMPLETED |
Finished successfully |
FAILED |
Execution failed (will be retried or dead-lettered) |
TIMED_OUT |
Exceeded its timeout β treated like a failure |
CHECKPOINTED |
Progress saved mid-execution (long-running jobs) |
PREEMPTED |
Interrupted by a higher-priority job |
CANCELLED |
Manually cancelled by the user |
DEAD_LETTER |
Exhausted all retries β parked for manual inspection |
The scheduler runs a continuous loop (every ~1 second):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCHEDULER LOOP β
β β
β 1. Am I the leader? (check advisory lock) β
β ββ No β sleep, try again β
β ββ Yes β β
β β
β 2. Dequeue highest-priority job from Redis sorted set β
β β
β 3. Check tenant quota: β
β ββ Tenant at max concurrent jobs? β defer, try next job β
β β
β 4. Find an available worker: β
β ββ Worker with free slots? β assign job β
β ββ No free workers? β try preemption (if CRITICAL) β
β ββ Still no slot? β put job back in queue β
β β
β 5. Assign: Job β SCHEDULED, notify worker via pub/sub β
β β
β 6. Repeat β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Jobs are stored in a Redis sorted set (ZPOPMIN β lowest score dequeued first). The score formula:
score = -(priority_weight Γ 1000) + enqueue_timestamp
| Priority | Weight | Effective Score Range |
|---|---|---|
| CRITICAL | 3 | β -3000 + timestamp (always first) |
| HIGH | 2 | β -2000 + timestamp |
| NORMAL | 1 | β -1000 + timestamp |
Anti-starvation aging: NORMAL jobs gain +0.1 effective weight per minute in the queue. After ~10 minutes, a NORMAL job's effective priority equals HIGH. After ~20 minutes, it matches CRITICAL. This prevents low-priority jobs from waiting forever β a classic technique borrowed from Multi-Level Feedback Queue scheduling in operating systems.
When a CRITICAL job arrives and all workers are full:
1. Find the lowest-priority RUNNING job
2. CRITICAL can preempt HIGH or NORMAL
3. HIGH can preempt NORMAL
4. Preempted job β PREEMPTED β re-queued with same priority
5. CRITICAL job gets the freed worker slot
The preempted job doesn't lose its work β if it supports checkpointing, it resumes from the last checkpoint when rescheduled.
This is where things get interesting. Distributed systems fail in creative ways. Here's how Epoch handles each failure mode:
Problem: A worker dies mid-execution. The job is stuck in RUNNING with no one executing it.
Detection: Workers send heartbeats every 5 seconds. The scheduler monitors these. If a worker misses heartbeats for 30 seconds, it's declared dead.
Recovery:
1. Scheduler detects stale heartbeat (> 30s old)
2. Worker marked as OFFLINE
3. All jobs assigned to that worker are re-examined
4. RUNNING jobs β TIMED_OUT β queued for retry (with backoff)
5. If the job has a checkpoint, the retry resumes from there
No job is lost. PostgreSQL is the source of truth β a job in RUNNING state with no live worker will always be detected and recovered.
Problem: The scheduler (the brain of the system) dies.
Detection: PostgreSQL advisory locks. The leader holds a session-level advisory lock. If its connection drops (process crash, network partition), PostgreSQL automatically releases the lock.
Recovery:
1. Standby scheduler continuously tries pg_try_advisory_lock()
2. Leader dies β lock released β standby acquires it
3. New leader reads all state from PostgreSQL
4. Rebuilds in-memory queues from DB (QUEUED jobs β Redis)
5. Resumes scheduling β no jobs lost, failover < 15 seconds
This is why we use PostgreSQL advisory locks instead of ZooKeeper or etcd β one fewer infrastructure dependency, and it's equally reliable for single-region deployments.
Problem: A job throws an exception during execution.
Handling: Exponential backoff with jitter.
delay = min(base_delay Γ 2^attempt + random_jitter, max_delay)
Example with base_delay=5s, max_delay=300s:
Attempt 1: ~5s wait
Attempt 2: ~10s wait
Attempt 3: ~20s wait
Attempt 4: ~40s wait
...
Attempt N: max 300s wait
After max_retries (configurable per job, default: 3), the job moves to Dead Letter Queue β a holding area for permanently failed jobs that can be inspected and manually retried.
Problem: A job hangs β doesn't fail, doesn't complete, just sits there.
Detection: The scheduler checks for jobs in RUNNING state that have exceeded their timeout_seconds.
Recovery: Same as failure β TIMED_OUT β retry with backoff β dead letter after max retries.
Problem: Redis (the coordination layer) goes down.
Impact: The priority queue and pub/sub notifications are unavailable. Jobs can't be enqueued or dequeued.
Recovery: Redis is volatile β it's a performance optimization, not the source of truth. When Redis recovers:
1. Scheduler detects Redis reconnection
2. Rebuilds the priority queue from PostgreSQL
3. All QUEUED jobs in DB β re-pushed to Redis sorted set
4. Scheduling resumes normally
No data is lost because PostgreSQL always has the canonical state.
Job executes β FAILS
β
βΌ
attempt < max_retries?
β
βββ YES β compute backoff delay
β β Job state: FAILED β QUEUED
β β Re-enqueue in Redis with delay
β β Will be rescheduled after delay
β
βββ NO β Job state: DEAD_LETTER π
β Moved to dead_letter_jobs table
β Error details preserved
β Available for manual retry via API or dashboard
delay = min(base_delay Γ 2^attempt + random.uniform(0, base_delay), max_delay)The jitter prevents the thundering herd problem β if 100 jobs fail at the same time, they don't all retry at exactly the same moment.
Jobs that exhaust all retries are moved to the dead letter queue with:
- The final error message
- Total number of attempts
- Full event history (every state transition)
From the dashboard or API, you can:
- Inspect the failure reason
- Manually retry β resets the attempt counter and re-queues the job
For long-running jobs (hours, days), losing all progress on a crash is unacceptable.
Job starts β executes work β periodically saves checkpoint
β
β Every N seconds (configurable, default: 30s):
β ββββββββββββββββββββββββββββββββββββββββ
β β 1. Job serializes its current state β
β β 2. Blob written to checkpoint store β
β β 3. Metadata saved to PostgreSQL β
β β (path, sequence #, timestamp) β
β ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Worker crashes!
β
βΌ
Scheduler detects dead worker β re-queues the job
β
βΌ
New worker picks up the job:
1. Loads latest checkpoint from store
2. Deserializes saved state
3. Resumes execution from where it left off
Jobs implement the BaseJob interface which provides save_checkpoint() and load_checkpoint() hooks:
class MyLongRunningJob(BaseJob):
async def execute(self, job_id, payload, checkpoint_data=None):
start = 0
if checkpoint_data:
start = checkpoint_data["progress"] # Resume from checkpoint
for i in range(start, 1000):
# Do work...
if i % 100 == 0:
await self.save_checkpoint(job_id, {"progress": i})
return {"success": True, "result": "done"}Epoch is designed for multi-tenant environments where different teams or customers share the same scheduler infrastructure.
Each tenant has configurable limits:
{
"tenant_id": "team-ml",
"max_concurrent_jobs": 5,
"max_workers": 3,
"priority_boost": 0
}The scheduler enforces these during assignment:
1. Dequeue next job from priority queue
2. Check: "Is tenant X at their max_concurrent_jobs limit?"
ββ Yes β skip this job, try the next one
ββ No β proceed to assign
3. This ensures no single tenant can monopolize all workers
Without tenant isolation, a single tenant submitting 10,000 jobs would starve everyone else. Epoch's fair-share model ensures:
- Each tenant gets their fair share of resources
- One tenant's failures don't cascade to others
- Priority boost lets you give premium tenants an edge
Only one scheduler should be assigning jobs at any time. Multiple schedulers assigning the same job would cause duplicate execution.
βββββββββββββββββββββββββββββββββββ
β PostgreSQL β
β β
β Advisory Lock #123456789 β
β βββββββββββββββββββββββββ β
β β Held by: Scheduler Aβ β
β β Since: 2 minutes agoβ β
β βββββββββββββββββββββββββ β
β β
βββββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β Scheduler A β β Scheduler B β β Scheduler C β
β (Leader) πβ β (Standby) β³ β β (Standby) β³ β
β β β β β β
β Acquired β
β β Try β fail β β Try β fail β
β Scheduling β β Try β fail β β Try β fail β
β β β Try β fail β β Try β fail β
βββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
pg_try_advisory_lock(123456789)β non-blocking lock attempt- One scheduler wins β becomes leader, writes to
scheduler_leadertable - Others fail β keep retrying every loop iteration
- Leader renews heartbeat every 5 seconds
- If leader dies β PostgreSQL releases the lock β standby takes over
No split-brain is possible β PostgreSQL guarantees at most one holder of an advisory lock at any time.
Every state transition for every job is recorded in an immutable job_events table:
Job: data-pipeline-42
CREATED | Job submitted: data-pipeline-42
QUEUED | Enqueued for scheduling
SCHEDULED | Assigned to worker abc123 (host-1)
RUNNING | Execution started (attempt 1/3)
FAILED | TimeoutError: connection to database timed out
RETRIED | Auto-retry: re-queued (attempt 1/3)
SCHEDULED | Assigned to worker def456 (host-2)
RUNNING | Execution started (attempt 2/3)
COMPLETED | Job completed successfully
Each event captures:
- Timestamp (microsecond precision)
- Event type (CREATED, QUEUED, SCHEDULED, RUNNING, FAILED, RETRIED, DEAD_LETTER, etc.)
- Attempt number
- Worker ID (which worker executed it)
- Detail text (error messages, assignment info)
This gives you complete lifecycle traceability β you can reconstruct exactly what happened to any job.
epoch/
βββ src/
β βββ api/ # FastAPI application
β β βββ server.py # App factory, middleware, lifespan
β β βββ routes/
β β β βββ jobs.py # Job CRUD + event history endpoint
β β β βββ workers.py # Worker listing
β β β βββ admin.py # Admin ops (drain, leader info, tenants)
β β β βββ dashboard.py # Dashboard stats aggregation
β β βββ schemas.py # Pydantic request/response models
β β
β βββ scheduler/ # Core scheduling engine
β β βββ scheduler.py # Main loop: dequeue β assign β commit
β β βββ leader.py # PostgreSQL advisory lock leader election
β β βββ priority_queue.py # Score computation + aging formula
β β βββ preemption.py # Find preemptable jobs + preempt logic
β β
β βββ worker/ # Job execution
β β βββ worker.py # Worker lifecycle, pub/sub listener
β β βββ executor.py # Subprocess isolation, timeout enforcement
β β βββ heartbeat.py # Periodic heartbeat sender
β β
β βββ checkpoint/ # Checkpoint management
β β βββ manager.py # Save/load/cleanup checkpoints
β β βββ store.py # Storage backend (local FS, S3-ready)
β β
β βββ models/ # SQLAlchemy ORM models
β β βββ job.py # Job + DeadLetterJob models
β β βββ job_event.py # Immutable audit log model
β β βββ worker.py # Worker model
β β βββ checkpoint.py # Checkpoint metadata model
β β βββ scheduler.py # Scheduler leader model
β β βββ tenant.py # Tenant config model
β β
β βββ services/
β β βββ event_logger.py # record_event() helper for audit log
β β
β βββ queue/
β β βββ redis_queue.py # Redis sorted set + pub/sub operations
β β
β βββ db/
β β βββ session.py # Async session factory + init
β β
β βββ config.py # Pydantic settings (env-driven)
β βββ constants.py # State enums, transitions, Redis keys
β
βββ jobs/ # Sample job implementations
β βββ base.py # BaseJob interface (checkpoint hooks)
β βββ data_processing.py # Simple data processing job
β βββ long_running.py # Long-running job with checkpointing
β
βββ frontend/ # Next.js 16 dashboard
β βββ src/
β βββ app/ # Pages (dashboard, jobs, workers, tenants)
β βββ components/ # React components
β βββ lib/ # API client, types, constants
β
βββ walkthroughs/ # Demo videos
β βββ dashboard_walkthrough.mov
β βββ jobs_walkthrough.mov
β βββ workers.mov
β βββ tenants_demo.mov
β
βββ docker-compose.yml # Full stack: PG + Redis + API + Scheduler(Γ2) + Worker(Γ2) + Frontend
βββ Dockerfile # Python 3.12 multi-service image
βββ requirements.txt # Python dependencies
βββ Plan.md # Original system design document
All configuration is via environment variables with the EPOCH_ prefix:
| Variable | Default | Description |
|---|---|---|
EPOCH_DATABASE_URL |
postgresql+asyncpg://epoch:epoch@localhost:5432/epoch |
PostgreSQL connection (async) |
EPOCH_REDIS_URL |
redis://localhost:6379/0 |
Redis connection |
EPOCH_SCHEDULER_LOOP_INTERVAL |
1.0 |
Scheduler cycle interval (seconds) |
EPOCH_LEADER_HEARTBEAT_INTERVAL |
5.0 |
Leader heartbeat interval |
EPOCH_LEADER_LOCK_TTL |
15.0 |
Leader lock timeout |
EPOCH_WORKER_MAX_SLOTS |
4 |
Max concurrent jobs per worker |
EPOCH_WORKER_HEARTBEAT_INTERVAL |
5.0 |
Worker heartbeat interval |
EPOCH_WORKER_HEARTBEAT_TIMEOUT |
30.0 |
Declare worker dead after this |
EPOCH_DEFAULT_MAX_RETRIES |
3 |
Default retry limit |
EPOCH_RETRY_BASE_DELAY |
5.0 |
Base retry delay (seconds) |
EPOCH_RETRY_MAX_DELAY |
300.0 |
Max retry delay (5 minutes) |
EPOCH_CHECKPOINT_DIR |
/tmp/epoch-checkpoints |
Checkpoint storage directory |
EPOCH_CHECKPOINT_INTERVAL |
30.0 |
Checkpoint save interval |
Create a new job by extending BaseJob:
# jobs/my_job.py
from jobs.base import BaseJob
class MyJob(BaseJob):
async def execute(self, job_id, payload, checkpoint_data=None):
# Your job logic here
items = payload.get("items", [])
for item in items:
result = await process(item)
return {"success": True, "processed": len(items)}Submit it via the API:
curl -X POST http://localhost:8000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"name": "my-custom-job",
"tenant_id": "my-team",
"job_type": "jobs.my_job:MyJob",
"payload": {"items": [1, 2, 3]},
"priority": "HIGH",
"max_retries": 5,
"timeout_seconds": 120
}'The job_type field uses Python's module:ClassName format β the executor dynamically imports and instantiates your job class.
-
PostgreSQL is the source of truth. Redis is a performance optimization. If Redis dies, rebuild from PG. If PG has it, it happened.
-
State transitions are atomic. Every state change is committed to the database before any side effect. Crash at any point β consistent recovery.
-
At-least-once execution. A job will be executed to completion, possibly more than once if a worker crashes. Design your jobs to be idempotent.
-
No external orchestration dependencies. No ZooKeeper, no etcd, no Kubernetes CRDs. Just PostgreSQL + Redis β tools every team already runs.
-
Fail loudly, recover quietly. Every failure is logged, tracked, and visible. Recovery happens automatically in the background.
Epoch is designed around patterns seen in production job scheduling systems. Here's how it maps to real scenarios:
A fintech platform processes thousands of invoices daily. Each invoice is submitted as a job with NORMAL priority, while failed payment retries are resubmitted at HIGH. Exponential backoff prevents hammering payment gateways, and the dead letter queue catches permanently declined transactions for manual review.
Epoch features used: Priority queue, automatic retries with backoff, dead letter queue, tenant isolation (per-merchant)
A data team queues model training jobs that run for hours. Checkpointing saves training progress every few minutes, so if a worker crashes mid-epoch (the ML kind), training resumes from the last checkpoint instead of restarting from scratch. Critical production model retrains use CRITICAL priority and preempt lower-priority experimental runs.
Epoch features used: Checkpointing, job preemption, long-running job support, priority scheduling
An e-commerce platform sends millions of order confirmation emails, SMS alerts, and push notifications. Each tenant (seller) has isolated worker quotas to prevent a single high-volume seller from starving others. The audit log tracks every delivery attempt for compliance.
Epoch features used: Tenant isolation, fair-share scheduling, immutable audit log, high-throughput processing
A data engineering team runs nightly ETL pipelines β extract from APIs, transform with Python, load into a data warehouse. Each stage is a job with dependencies. Failed stages retry automatically, and the dashboard shows which pipelines are stuck, running, or completed.
Epoch features used: Automatic retries, real-time dashboard, job state tracking, timeout detection
A content platform transcodes uploaded videos into multiple resolutions. Video transcoding jobs are CPU-heavy and long-running. Checkpointing tracks progress per resolution, and priority aging ensures older uploads don't starve behind a flood of new ones.
Epoch features used: Checkpointing, priority aging (anti-starvation), worker slot management, timeout handling
A bank generates end-of-day regulatory reports across multiple subsidiaries (tenants). Each subsidiary has dedicated worker capacity. Reports must complete within a deadline β timed-out jobs are flagged immediately. The full audit trail satisfies compliance requirements.
Epoch features used: Tenant isolation, timeout detection, audit log, scheduled execution, dead letter alerting
An e-commerce backend processes orders through stages: payment validation β inventory reservation β shipping label generation β carrier dispatch. Each stage is a separate job. Failures at any stage trigger retries with backoff, and the event timeline shows exactly where an order got stuck.
Epoch features used: Retry with backoff, event audit timeline, job state machine, worker heartbeats
A research lab runs Monte Carlo simulations across a worker pool. Each simulation variant is a job. The multi-level priority queue ensures funded research projects run before exploratory ones. Leader-elected scheduling guarantees exactly-once assignment even when scheduler nodes restart.
Epoch features used: Leader election, priority queue, distributed worker pool, fault tolerance
Built with β and a deep appreciation for distributed systems that don't lose your data.