Skip to content

Level 3 Architectural Context & Learning Guide

Muhammad Saqib edited this page Nov 29, 2025 · 1 revision

Level 3 Architectural Context & Learning Guide

Purpose: This document explains the "why" behind Level 3 architectural decisions. It's educational material for understanding trade-offs and design patterns.

Note: This is reference material for human learning, not required for AI implementation. The actual step-by-step implementation plans are in LEVEL_3_MVP_PLAN.md and LEVEL_3_PRODUCTION_PLAN.md.


Problem 1: The "Fast Worker" Race Condition and Lost SSE Updates

The Bug Pattern

In a production queue-based system, a common but subtle bug occurs when:

Timeline (milliseconds):
  0ms:  Client POST /book → API creates job → API returns 202 + jobId
 10ms:  Worker picks up job → Processes in 10ms → Publishes "completed" event
 50ms:  Client receives 202 response → Starts establishing SSE connection
 60ms:  Client connects to GET /api/v1/bookings/status/:jobId
         → Subscribes to Redis channel for updates
  → BUG: The "completed" event was published at 10ms, before subscription at 60ms
         Client waits forever, never receives the result

Impact

  • Client hangs indefinitely waiting for status that already happened
  • User sees infinite loading spinner
  • System resources wasted on dead connections
  • Under load, this creates cascading timeout issues
  • Debugging is difficult: worker logs show success, but client never receives it

Why Traditional Approaches Fail

1. Redis Pub/Sub with Simple Subscription:

// ❌ BROKEN: Subscribes after worker may have finished
app.get('/status/:jobId', (req, res) => {
  // Worker might have already published here!
  const subscriber = redis.createSubscriber();
  subscriber.subscribe('booking-events');

  subscriber.on('message', (message) => {
    if (message.jobId === req.params.jobId) {
      res.write(`event: ${message.type}\n`);
    }
  });

  // What if worker published before we subscribed?
  // → Client waits forever
});

2. Message Persistence (Workaround):

  • Could persist all events to Redis/DB
  • Client connects → queries history → subscribes
  • But: Adds storage overhead, complexity, and latency
  • Not suitable for high-frequency events like booking status updates

Our Solution: Check State BEFORE Subscribe

// ✅ ROBUST: Check state first, then subscribe only if needed
app.get('/status/:jobId', async (req, res) => {
  // 1. Check current job state IMMEDIATELY
  const job = await bookingQueue.getJob(req.params.jobId);

  if (job.returnvalue) {
    // Worker already finished! Send result immediately
    res.write(`event: confirmed\ndata: ${JSON.stringify(job.returnvalue)}\n\n`);
    return res.end();
  }

  if (job.failedReason) {
    // Already failed! Send failure immediately
    res.write(`event: failed\ndata: ${JSON.stringify(job.failedReason)}\n\n`);
    return res.end();
  }

  // 2. ONLY subscribe if job is still active
  const queueEvents = new QueueEvents('booking');
  queueEvents.on('completed', (event) => {
    if (event.jobId === req.params.jobId) {
      res.write(`event: confirmed\n`);
      res.end();
    }
  });

  // Send current status
  res.write(`event: ${job.status}\n`);
});

Benefits

  • ✅ Client always receives final status (even if late)
  • ✅ No hanging connections
  • ✅ No message persistence overhead
  • ✅ Works reliably from 10ms to 10s processing times
  • ✅ Handles network delays, slow clients, and retries

Problem 2: Why Raw Redis Pub/Sub Doesn't Scale (And QueueEvents Does)

The Horizontal Scaling Problem

When you have multiple API instances behind a load balancer:

Load Balancer
    ├─→ API Instance 1 (holds Client A's SSE connection)
    ├─→ API Instance 2 (holds Client B's SSE connection)
    └─→ API Instance 3 (holds Client C's SSE connection)

Worker completes job for Client A:
    → Publishes to Redis channel: "job:123 completed"
    → Which instance should forward to Client A?
    → How do instances know which connections they hold?

Raw Redis Pub/Sub Approach (Complex & Brittle)

// ❌ COMPLEX: Each instance must track its own connections
const activeConnections = new Map(); // jobId → res

// Instance 1
globalRedisSubscriber.on('message', (channel, message) => {
  // Instance 1, 2, and 3 all receive this message!
  const event = JSON.parse(message);

  // But maybe only Instance 2 has the connection for this jobId
  const clientConnection = activeConnections.get(event.jobId);
  if (clientConnection) {
    // Instance 2 forwards to client
    clientConnection.write(`event: ${event.type}...`);
  }
  // Instances 1 & 3 ignore (no connection found)
});

// Problem 1: Connection tracking across instances is manual
// Problem 2: Race condition on instance restart (lost connections)
// Problem 3: Memory leaks if connections not cleaned up
// Problem 4: Each instance processes messages for ALL jobs (inefficient)

Why This Doesn't Scale

1. Connection State Synchronization:

  • Each API instance must maintain a map: jobId → clientResponse
  • No built-in way to sync this state across instances
  • If Instance 1 crashes, all its SSE connections are lost
  • Clients must reconnect, but may hit different instance → state lost

2. Message Filtering Overhead:

  • Every instance receives EVERY job completion event
  • Instance 3 processes events for jobs it has no connection for
  • Under 10K concurrent bookings → 10K events × 3 instances = 30K messages
  • Wastes CPU and network bandwidth

3. Connection Cleanup Complexity:

// Need to handle:
- Client disconnects
- Instance crashes
- Network timeouts
- Process restarts
// All must clean up connection state, or memory leaks

4. Deployment Challenges:

  • Rolling deployments: Old instances shutting down, new ones starting
  • How to migrate SSE connections without dropping updates?
  • Need custom connection migration logic

BullMQ QueueEvents Solution (Built for This)

// ✅ SIMPLE: QueueEvents handles scaling automatically
import { QueueEvents } from 'bullmq';

// Each API instance creates its own QueueEvents listener
const queueEvents = new QueueEvents('booking');

queueEvents.on('completed', ({ jobId, returnvalue }) => {
  // This callback runs in EVERY instance

  // But we only have the connection object in ONE instance
  const clientConnection = activeConnections.get(jobId);

  if (clientConnection) {
    // Only the instance that holds the connection sends the response
    clientConnection.write(`event: confirmed\n`);
  }
  // Other instances just ignore (no connection found)
});

How QueueEvents Solves Scaling

1. Built-in Pub/Sub:

  • Uses Redis Streams (not Pub/Sub) for reliable event delivery
  • Events persisted temporarily in Redis
  • Instances can reconnect and catch up on missed events
  • No manual channel management

2. Efficient Event Routing:

  • Events are lightweight: { jobId, status, returnvalue }
  • No custom serialization needed
  • BullMQ manages the Redis keys and channels automatically

3. Connection State Remains Local:

// ✅ SIMPLE: Each instance only tracks its own connections
const activeConnections = new Map(); // Only this instance's connections

// No need to sync with other instances
// No need for distributed state management
// Each instance is independent

4. Horizontal Scaling Benefits:

# Scale API to 5 instances
docker compose up -d --scale api=5

# Each instance:
# - Receives all events via QueueEvents
# - Checks local connection map
# - Forwards only if connection exists
# - No coordination needed between instances

# Result: Linear scaling, no shared state

Real-World Production Scenario

Before (Raw Redis Pub/Sub):
  - 3 API instances
  - 10,000 concurrent booking requests
  - Each job completion = 3 messages (one per instance)
  - Total: 10,000 jobs × 3 instances = 30,000 messages
  - Each instance processes 30,000 messages (most ignored)
  → 66% wasted CPU, complex connection tracking

After (BullMQ QueueEvents):
  - 3 API instances
  - 10,000 concurrent booking requests
  - Each job completion = 1 lightweight event
  - Each instance checks local map, only 1 instance forwards
  → 0% wasted CPU, simple local state

Why QueueEvents is the Right Choice

  1. Purpose-Built: Designed specifically for BullMQ job lifecycle events
  2. Type-Safe: Events typed as { jobId: string, status: string, ... }
  3. Reliable: Uses Redis Streams, not Pub/Sub (persistent vs ephemeral)
  4. Efficient: Minimal overhead, no manual serialization
  5. Battle-Tested: Used in production at scale by many companies
  6. Documented: Official BullMQ documentation and examples

When to Use Raw Redis Pub/Sub Instead

Never for this use case. QueueEvents is superior in every way for job status notifications.

The only time to use raw Redis Pub/Sub is for non-BullMQ events (e.g., custom notifications, chat messages) where you need custom channel management.


Technical Decisions & Rationale

Why BullMQ over Bull?

  • BullMQ: Actively maintained, TypeScript support, better performance
  • Bull: Legacy, fewer features, slower development
  • Decision: Use BullMQ for modern architecture

Why ioredis over node-redis?

  • BullMQ Requirement: BullMQ is built specifically on ioredis and requires it for connection handling
  • node-redis: Officially recommended by Redis for new projects, but NOT compatible with BullMQ
  • Decision: Use ioredis because BullMQ mandates it - no choice if we want BullMQ

Why Monorepo with Turborepo?

  • Separation: API and Worker as separate apps
  • Shared code: Database, types, utilities in packages
  • Independent scaling: Deploy API and Worker separately
  • Clear boundaries: Enforces clean architecture
  • Decision: Reorganized from monolith to monorepo (Milestone 0)

Why Server-Sent Events over WebSockets?

  • SSE: HTTP-based, simpler implementation, auto-reconnection
  • WebSockets: Bidirectional (not needed), more complex
  • Decision: SSE fits unidirectional server→client updates, scales better

Why Optimistic Locking over Distributed Locks?

  • Optimistic: Higher throughput, scales better, simpler
  • Distributed Locks: Redlock complexity, lower performance
  • Decision: Optimistic locking sufficient for booking tickets

Why Shared Zod Schemas?

  • Runtime Validation: Enforces contract between API and Worker
  • Type Safety: Single source of truth for job data structure
  • Prevents Silent Failures: Invalid data caught at boundaries
  • Decision: Zod schemas in packages/types validate producer and consumer

Retry Strategy

  • Max Attempts: 3 tries for version conflicts
  • Backoff: Exponential (100ms → 200ms → 400ms)
  • Rationale: Balance between success rate and latency

No Graceful Degradation (Hard Fail)

  • Decision: If Redis unavailable, return 503. No fallback to Level 2.
  • Rationale:
    • Level 2 synchronous transactions would overload database if Redis fails under load
    • Fallback logic adds complexity and new failure modes
    • Monitoring Redis health separately is more reliable
    • Forces infrastructure reliability for Redis
  • Client Handling: Return 503 → client retries with exponential backoff
  • Alternative: Use circuit breaker to return 503 immediately, not hang

Error Handling Matrix (NO GRACEFUL DEGRADATION)

Scenario Worker Behavior API Response Client SSE Event
Valid booking Success 202 Accepted confirmed
Event sold out Fail (no retry) 202 Accepted failed (409)
Invalid eventId Fail 202 Accepted failed (404)
Version conflict Retry (max env.WORKER_MAX_RETRIES) 202 Accepted confirmed
Worker crash Job re-queued 202 Accepted processing → confirmed
Rate limit exceeded N/A 429 Too Many Requests N/A
Queue depth exceeded N/A 503 Queue Full N/A
Redis down Cannot queue job 503 Service Unavailable (circuit open) N/A

IMPORTANT - No Graceful Degradation:

  • If Redis is down, API returns 503 immediately via circuit breaker
  • No fallback to Level 2 synchronous transactions
  • Reason: Prevents database overload during Redis failures
  • Client responsibility: Retry with exponential backoff
  • Monitor Redis health separately

Performance Expectations

Level 2 vs Level 3 Comparison

Metric Level 2 Level 3 MVP Level 3 Production
API Response 800-1500ms <100ms <100ms
Throughput 200-300 req/s 2000-5000 req/s 5000-10000 req/s
Timeout Rate 1-2% 0% 0%
Scalability Vertical only Horizontal (workers) Horizontal (API + workers)
Race Conditions 0 (pessimistic) 0 (optimistic) 0 (optimistic)
Real-Time Updates N/A Polling SSE
Rate Limiting No No Yes (10 req/min)
Circuit Breaker No No Yes
Monitoring Logs only Logs only Dashboard + Logs

Success Metrics & KPIs

Metric Level 2 Level 3 Target Validation Method
Throughput (req/s) 200-300 5,000-10,000 Load test (10K req)
API Response Time 800-1500ms <100ms Response timing
Timeout Rate 1-2% 0% Error counting
Race Conditions 0 0 Database verification
Worker Processing N/A 200-500ms avg Job logs
Data Integrity Booking count vs tickets
Scalability Vertical Horizontal Multiple workers
Queue Depth N/A <50 avg BullMQ dashboard
Job Failure Rate N/A <0.1% BullMQ metrics
Version Conflicts N/A <5% Retry count
SSE Reliability N/A 100% State check tests

Potential Risks & Mitigations

Risk Probability Impact Mitigation
Redis becomes bottleneck Medium High Add Redis clustering, increase workers
Worker crashes lose jobs Low Medium BullMQ persistence, automatic retry
SSE connections overload Low Medium Connection pooling, TTL on jobs
Version conflicts too high Low Medium Tune retry strategy, measure conflict rate
Increased complexity High Medium Comprehensive testing, documentation
Monorepo build issues Low Low Use Turborepo, clear package boundaries
BullMQ dashboard security Medium High Add auth middleware to /admin/queues
"Fast Worker" race condition High High State check pattern (Milestone 7/8)
Type mismatches API→Worker Medium High Zod schema validation (Milestone 3)
Redis failure causes downtime Low High Hard fail (503) + monitoring + alerts

Common Pitfalls to Avoid

During MVP Implementation (M0-M6)

  1. Skipping Milestone 0: Don't add Redis to monolithic structure. Restructure first.
  2. Wrong milestone order: Implement optimistic locking (M5) BEFORE API migration (M6)
  3. Starting workers before M5: Workers will fail to process jobs
  4. Migrating API before workers ready: Jobs created but not processed
  5. Skipping Zod validation: Type mismatches cause silent failures
  6. Not testing version conflicts: Optimistic locking bugs only appear under load
  7. Forgetting to remove Level 2 code: Confusion about which logic is active

During Production Hardening (M7-M10)

  1. Raw Redis Pub/Sub: Use BullMQ QueueEvents for horizontal scaling
  2. No SSE state check: "Fast Worker" race condition causes lost updates
  3. Exposing dashboard publicly: Major security risk
  4. No rate limiting: Queue overflow and abuse vectors
  5. Implementing graceful degradation: Hard fail is safer and simpler for Level 3
  6. Not testing Redis failures: Circuit breaker untested until production incident
  7. Polling instead of SSE: Wastes resources, poor user experience
  8. Hardcoded limits: Production tuning requires code changes
  9. No cleanup on disconnect: Memory leaks from orphaned SSE connections

Comment Template for Code

Use this template when documenting Level 3 implementations:

/**
 * Level 3 Implementation: Async Queue-Based Processing with SSE
 *
 * Architecture: /apps/api, /apps/worker, /packages/*
 *
 * How it works:
 * 1. API receives POST /book, validates with Zod → returns 202 + jobId
 * 2. BullMQ stores job in Redis, dashboard tracks status
 * 3. Worker pulls job, validates with Zod, uses optimistic locking
 * 4. Version conflict → retry (max 3, exponential backoff)
 * 5. Success/failure published via QueueEvents → SSE notifies client
 * 6. Fast worker handled: state check before subscribe
 *
 * Trade-offs:
 * ✅ Pros: <100ms API, 10K+ req/s, horizontal scaling, zero timeouts
 * ⚠️ Cons: Eventual consistency, added complexity, Redis dependency
 *
 * Why monorepo: Clean separation, independent scaling, shared packages
 * Why Zod schemas: Runtime validation prevents API-Worker contract breaches
 * Why QueueEvents: Built-in horizontal scaling, replaces raw Redis Pub/Sub
 * Why no degradation: Hard fail safer than database overload fallback
 * Why BullMQ dashboard: Real-time queue monitoring essential at scale
 *
 * Service Boundary: Zod schema is the contract. Always validate.
 */

Interview Talking Points

When discussing this project in interviews, emphasize:

Technical Depth

  • Problem: High-concurrency ticket booking with race conditions
  • Solution Evolution: Pessimistic locking (L2) → Async queue + Optimistic locking (L3)
  • Trade-offs Understood: Latency vs throughput, consistency models, complexity vs scalability

Architectural Decisions

  • Monorepo: Separation of concerns, independent scaling
  • BullMQ QueueEvents: Horizontal scaling without custom Pub/Sub complexity
  • SSE over WebSockets: Simpler for unidirectional updates
  • Hard Fail over Graceful Degradation: Predictable failure modes

Edge Cases & Resilience

  • "Fast Worker" Race: State check before subscribe (shows attention to detail)
  • Circuit Breaker: Fail fast, protect dependencies
  • Rate Limiting: Prevent abuse and overload

Production Readiness

  • 10K Load Testing: Validated under extreme conditions
  • Zero Data Integrity Issues: No overbookings across all tests
  • Monitoring: Separate dashboard service with security considerations
  • Documentation: Comprehensive plans and architectural context

This document is for learning and reference. For implementation, see:

  • LEVEL_3_MVP_PLAN.md - Step-by-step MVP implementation
  • LEVEL_3_PRODUCTION_PLAN.md - Production hardening steps
  • LEVEL_3_COMPLETE_PLAN.md - Combined reference (full context)

Clone this wiki locally