Level 3 Architectural Context & Learning Guide

Purpose: This document explains the "why" behind Level 3 architectural decisions. It's educational material for understanding trade-offs and design patterns.

Note: This is reference material for human learning, not required for AI implementation. The actual step-by-step implementation plans are in LEVEL_3_MVP_PLAN.md and LEVEL_3_PRODUCTION_PLAN.md.

Problem 1: The "Fast Worker" Race Condition and Lost SSE Updates

The Bug Pattern

In a production queue-based system, a common but subtle bug occurs when:

Timeline (milliseconds):
  0ms:  Client POST /book → API creates job → API returns 202 + jobId
 10ms:  Worker picks up job → Processes in 10ms → Publishes "completed" event
 50ms:  Client receives 202 response → Starts establishing SSE connection
 60ms:  Client connects to GET /api/v1/bookings/status/:jobId
         → Subscribes to Redis channel for updates
  → BUG: The "completed" event was published at 10ms, before subscription at 60ms
         Client waits forever, never receives the result

Impact

Client hangs indefinitely waiting for status that already happened
User sees infinite loading spinner
System resources wasted on dead connections
Under load, this creates cascading timeout issues
Debugging is difficult: worker logs show success, but client never receives it

Why Traditional Approaches Fail

1. Redis Pub/Sub with Simple Subscription:

// ❌ BROKEN: Subscribes after worker may have finished
app.get('/status/:jobId', (req, res) => {
  // Worker might have already published here!
  const subscriber = redis.createSubscriber();
  subscriber.subscribe('booking-events');

  subscriber.on('message', (message) => {
    if (message.jobId === req.params.jobId) {
      res.write(`event: ${message.type}\n`);
    }
  });

  // What if worker published before we subscribed?
  // → Client waits forever
});

2. Message Persistence (Workaround):

Could persist all events to Redis/DB
Client connects → queries history → subscribes
But: Adds storage overhead, complexity, and latency
Not suitable for high-frequency events like booking status updates

Our Solution: Check State BEFORE Subscribe

// ✅ ROBUST: Check state first, then subscribe only if needed
app.get('/status/:jobId', async (req, res) => {
  // 1. Check current job state IMMEDIATELY
  const job = await bookingQueue.getJob(req.params.jobId);

  if (job.returnvalue) {
    // Worker already finished! Send result immediately
    res.write(`event: confirmed\ndata: ${JSON.stringify(job.returnvalue)}\n\n`);
    return res.end();
  }

  if (job.failedReason) {
    // Already failed! Send failure immediately
    res.write(`event: failed\ndata: ${JSON.stringify(job.failedReason)}\n\n`);
    return res.end();
  }

  // 2. ONLY subscribe if job is still active
  const queueEvents = new QueueEvents('booking');
  queueEvents.on('completed', (event) => {
    if (event.jobId === req.params.jobId) {
      res.write(`event: confirmed\n`);
      res.end();
    }
  });

  // Send current status
  res.write(`event: ${job.status}\n`);
});

Benefits

✅ Client always receives final status (even if late)
✅ No hanging connections
✅ No message persistence overhead
✅ Works reliably from 10ms to 10s processing times
✅ Handles network delays, slow clients, and retries

Problem 2: Why Raw Redis Pub/Sub Doesn't Scale (And QueueEvents Does)

The Horizontal Scaling Problem

When you have multiple API instances behind a load balancer:

Load Balancer
    ├─→ API Instance 1 (holds Client A's SSE connection)
    ├─→ API Instance 2 (holds Client B's SSE connection)
    └─→ API Instance 3 (holds Client C's SSE connection)

Worker completes job for Client A:
    → Publishes to Redis channel: "job:123 completed"
    → Which instance should forward to Client A?
    → How do instances know which connections they hold?

Raw Redis Pub/Sub Approach (Complex & Brittle)

// ❌ COMPLEX: Each instance must track its own connections
const activeConnections = new Map(); // jobId → res

// Instance 1
globalRedisSubscriber.on('message', (channel, message) => {
  // Instance 1, 2, and 3 all receive this message!
  const event = JSON.parse(message);

  // But maybe only Instance 2 has the connection for this jobId
  const clientConnection = activeConnections.get(event.jobId);
  if (clientConnection) {
    // Instance 2 forwards to client
    clientConnection.write(`event: ${event.type}...`);
  }
  // Instances 1 & 3 ignore (no connection found)
});

// Problem 1: Connection tracking across instances is manual
// Problem 2: Race condition on instance restart (lost connections)
// Problem 3: Memory leaks if connections not cleaned up
// Problem 4: Each instance processes messages for ALL jobs (inefficient)

Why This Doesn't Scale

1. Connection State Synchronization:

Each API instance must maintain a map: jobId → clientResponse
No built-in way to sync this state across instances
If Instance 1 crashes, all its SSE connections are lost
Clients must reconnect, but may hit different instance → state lost

2. Message Filtering Overhead:

Every instance receives EVERY job completion event
Instance 3 processes events for jobs it has no connection for
Under 10K concurrent bookings → 10K events × 3 instances = 30K messages
Wastes CPU and network bandwidth

3. Connection Cleanup Complexity:

// Need to handle:
- Client disconnects
- Instance crashes
- Network timeouts
- Process restarts
// All must clean up connection state, or memory leaks

4. Deployment Challenges:

Rolling deployments: Old instances shutting down, new ones starting
How to migrate SSE connections without dropping updates?
Need custom connection migration logic

BullMQ QueueEvents Solution (Built for This)

// ✅ SIMPLE: QueueEvents handles scaling automatically
import { QueueEvents } from 'bullmq';

// Each API instance creates its own QueueEvents listener
const queueEvents = new QueueEvents('booking');

queueEvents.on('completed', ({ jobId, returnvalue }) => {
  // This callback runs in EVERY instance

  // But we only have the connection object in ONE instance
  const clientConnection = activeConnections.get(jobId);

  if (clientConnection) {
    // Only the instance that holds the connection sends the response
    clientConnection.write(`event: confirmed\n`);
  }
  // Other instances just ignore (no connection found)
});

How QueueEvents Solves Scaling

1. Built-in Pub/Sub:

Uses Redis Streams (not Pub/Sub) for reliable event delivery
Events persisted temporarily in Redis
Instances can reconnect and catch up on missed events
No manual channel management

2. Efficient Event Routing:

Events are lightweight: { jobId, status, returnvalue }
No custom serialization needed
BullMQ manages the Redis keys and channels automatically

3. Connection State Remains Local:

// ✅ SIMPLE: Each instance only tracks its own connections
const activeConnections = new Map(); // Only this instance's connections

// No need to sync with other instances
// No need for distributed state management
// Each instance is independent

4. Horizontal Scaling Benefits:

# Scale API to 5 instances
docker compose up -d --scale api=5

# Each instance:
# - Receives all events via QueueEvents
# - Checks local connection map
# - Forwards only if connection exists
# - No coordination needed between instances

# Result: Linear scaling, no shared state

Real-World Production Scenario

Before (Raw Redis Pub/Sub):
  - 3 API instances
  - 10,000 concurrent booking requests
  - Each job completion = 3 messages (one per instance)
  - Total: 10,000 jobs × 3 instances = 30,000 messages
  - Each instance processes 30,000 messages (most ignored)
  → 66% wasted CPU, complex connection tracking

After (BullMQ QueueEvents):
  - 3 API instances
  - 10,000 concurrent booking requests
  - Each job completion = 1 lightweight event
  - Each instance checks local map, only 1 instance forwards
  → 0% wasted CPU, simple local state

Why QueueEvents is the Right Choice

Purpose-Built: Designed specifically for BullMQ job lifecycle events
Type-Safe: Events typed as { jobId: string, status: string, ... }
Reliable: Uses Redis Streams, not Pub/Sub (persistent vs ephemeral)
Efficient: Minimal overhead, no manual serialization
Battle-Tested: Used in production at scale by many companies
Documented: Official BullMQ documentation and examples

When to Use Raw Redis Pub/Sub Instead

Never for this use case. QueueEvents is superior in every way for job status notifications.

The only time to use raw Redis Pub/Sub is for non-BullMQ events (e.g., custom notifications, chat messages) where you need custom channel management.

Technical Decisions & Rationale

Why BullMQ over Bull?

BullMQ: Actively maintained, TypeScript support, better performance
Bull: Legacy, fewer features, slower development
Decision: Use BullMQ for modern architecture

Why ioredis over node-redis?

BullMQ Requirement: BullMQ is built specifically on ioredis and requires it for connection handling
node-redis: Officially recommended by Redis for new projects, but NOT compatible with BullMQ
Decision: Use ioredis because BullMQ mandates it - no choice if we want BullMQ

Why Monorepo with Turborepo?

Separation: API and Worker as separate apps
Shared code: Database, types, utilities in packages
Independent scaling: Deploy API and Worker separately
Clear boundaries: Enforces clean architecture
Decision: Reorganized from monolith to monorepo (Milestone 0)

Why Server-Sent Events over WebSockets?

SSE: HTTP-based, simpler implementation, auto-reconnection
WebSockets: Bidirectional (not needed), more complex
Decision: SSE fits unidirectional server→client updates, scales better

Why Optimistic Locking over Distributed Locks?

Optimistic: Higher throughput, scales better, simpler
Distributed Locks: Redlock complexity, lower performance
Decision: Optimistic locking sufficient for booking tickets

Why Shared Zod Schemas?

Runtime Validation: Enforces contract between API and Worker
Type Safety: Single source of truth for job data structure
Prevents Silent Failures: Invalid data caught at boundaries
Decision: Zod schemas in packages/types validate producer and consumer

Retry Strategy

Max Attempts: 3 tries for version conflicts
Backoff: Exponential (100ms → 200ms → 400ms)
Rationale: Balance between success rate and latency

No Graceful Degradation (Hard Fail)

Decision: If Redis unavailable, return 503. No fallback to Level 2.
Rationale:
- Level 2 synchronous transactions would overload database if Redis fails under load
- Fallback logic adds complexity and new failure modes
- Monitoring Redis health separately is more reliable
- Forces infrastructure reliability for Redis
Client Handling: Return 503 → client retries with exponential backoff
Alternative: Use circuit breaker to return 503 immediately, not hang

Error Handling Matrix (NO GRACEFUL DEGRADATION)

Scenario	Worker Behavior	API Response	Client SSE Event
Valid booking	Success	202 Accepted	confirmed
Event sold out	Fail (no retry)	202 Accepted	failed (409)
Invalid eventId	Fail	202 Accepted	failed (404)
Version conflict	Retry (max env.WORKER_MAX_RETRIES)	202 Accepted	confirmed
Worker crash	Job re-queued	202 Accepted	processing → confirmed
Rate limit exceeded	N/A	429 Too Many Requests	N/A
Queue depth exceeded	N/A	503 Queue Full	N/A
Redis down	Cannot queue job	503 Service Unavailable (circuit open)	N/A

IMPORTANT - No Graceful Degradation:

If Redis is down, API returns 503 immediately via circuit breaker
No fallback to Level 2 synchronous transactions
Reason: Prevents database overload during Redis failures
Client responsibility: Retry with exponential backoff
Monitor Redis health separately

Performance Expectations

Level 2 vs Level 3 Comparison

Metric	Level 2	Level 3 MVP	Level 3 Production
API Response	800-1500ms	<100ms	<100ms
Throughput	200-300 req/s	2000-5000 req/s	5000-10000 req/s
Timeout Rate	1-2%	0%	0%
Scalability	Vertical only	Horizontal (workers)	Horizontal (API + workers)
Race Conditions	0 (pessimistic)	0 (optimistic)	0 (optimistic)
Real-Time Updates	N/A	Polling	SSE
Rate Limiting	No	No	Yes (10 req/min)
Circuit Breaker	No	No	Yes
Monitoring	Logs only	Logs only	Dashboard + Logs

Success Metrics & KPIs

Metric	Level 2	Level 3 Target	Validation Method
Throughput (req/s)	200-300	5,000-10,000	Load test (10K req)
API Response Time	800-1500ms	<100ms	Response timing
Timeout Rate	1-2%	0%	Error counting
Race Conditions	0	0	Database verification
Worker Processing	N/A	200-500ms avg	Job logs
Data Integrity	✅	✅	Booking count vs tickets
Scalability	Vertical	Horizontal	Multiple workers
Queue Depth	N/A	<50 avg	BullMQ dashboard
Job Failure Rate	N/A	<0.1%	BullMQ metrics
Version Conflicts	N/A	<5%	Retry count
SSE Reliability	N/A	100%	State check tests

Potential Risks & Mitigations

Risk	Probability	Impact	Mitigation
Redis becomes bottleneck	Medium	High	Add Redis clustering, increase workers
Worker crashes lose jobs	Low	Medium	BullMQ persistence, automatic retry
SSE connections overload	Low	Medium	Connection pooling, TTL on jobs
Version conflicts too high	Low	Medium	Tune retry strategy, measure conflict rate
Increased complexity	High	Medium	Comprehensive testing, documentation
Monorepo build issues	Low	Low	Use Turborepo, clear package boundaries
BullMQ dashboard security	Medium	High	Add auth middleware to /admin/queues
"Fast Worker" race condition	High	High	State check pattern (Milestone 7/8)
Type mismatches API→Worker	Medium	High	Zod schema validation (Milestone 3)
Redis failure causes downtime	Low	High	Hard fail (503) + monitoring + alerts

Common Pitfalls to Avoid

During MVP Implementation (M0-M6)

Skipping Milestone 0: Don't add Redis to monolithic structure. Restructure first.
Wrong milestone order: Implement optimistic locking (M5) BEFORE API migration (M6)
Starting workers before M5: Workers will fail to process jobs
Migrating API before workers ready: Jobs created but not processed
Skipping Zod validation: Type mismatches cause silent failures
Not testing version conflicts: Optimistic locking bugs only appear under load
Forgetting to remove Level 2 code: Confusion about which logic is active

During Production Hardening (M7-M10)

Raw Redis Pub/Sub: Use BullMQ QueueEvents for horizontal scaling
No SSE state check: "Fast Worker" race condition causes lost updates
Exposing dashboard publicly: Major security risk
No rate limiting: Queue overflow and abuse vectors
Implementing graceful degradation: Hard fail is safer and simpler for Level 3
Not testing Redis failures: Circuit breaker untested until production incident
Polling instead of SSE: Wastes resources, poor user experience
Hardcoded limits: Production tuning requires code changes
No cleanup on disconnect: Memory leaks from orphaned SSE connections

Comment Template for Code

Use this template when documenting Level 3 implementations:

/**
 * Level 3 Implementation: Async Queue-Based Processing with SSE
 *
 * Architecture: /apps/api, /apps/worker, /packages/*
 *
 * How it works:
 * 1. API receives POST /book, validates with Zod → returns 202 + jobId
 * 2. BullMQ stores job in Redis, dashboard tracks status
 * 3. Worker pulls job, validates with Zod, uses optimistic locking
 * 4. Version conflict → retry (max 3, exponential backoff)
 * 5. Success/failure published via QueueEvents → SSE notifies client
 * 6. Fast worker handled: state check before subscribe
 *
 * Trade-offs:
 * ✅ Pros: <100ms API, 10K+ req/s, horizontal scaling, zero timeouts
 * ⚠️ Cons: Eventual consistency, added complexity, Redis dependency
 *
 * Why monorepo: Clean separation, independent scaling, shared packages
 * Why Zod schemas: Runtime validation prevents API-Worker contract breaches
 * Why QueueEvents: Built-in horizontal scaling, replaces raw Redis Pub/Sub
 * Why no degradation: Hard fail safer than database overload fallback
 * Why BullMQ dashboard: Real-time queue monitoring essential at scale
 *
 * Service Boundary: Zod schema is the contract. Always validate.
 */

Interview Talking Points

When discussing this project in interviews, emphasize:

Technical Depth

Problem: High-concurrency ticket booking with race conditions
Solution Evolution: Pessimistic locking (L2) → Async queue + Optimistic locking (L3)
Trade-offs Understood: Latency vs throughput, consistency models, complexity vs scalability

Architectural Decisions

Monorepo: Separation of concerns, independent scaling
BullMQ QueueEvents: Horizontal scaling without custom Pub/Sub complexity
SSE over WebSockets: Simpler for unidirectional updates
Hard Fail over Graceful Degradation: Predictable failure modes

Edge Cases & Resilience

"Fast Worker" Race: State check before subscribe (shows attention to detail)
Circuit Breaker: Fail fast, protect dependencies
Rate Limiting: Prevent abuse and overload

Production Readiness

10K Load Testing: Validated under extreme conditions
Zero Data Integrity Issues: No overbookings across all tests
Monitoring: Separate dashboard service with security considerations
Documentation: Comprehensive plans and architectural context

This document is for learning and reference. For implementation, see:

LEVEL_3_MVP_PLAN.md - Step-by-step MVP implementation
LEVEL_3_PRODUCTION_PLAN.md - Production hardening steps
LEVEL_3_COMPLETE_PLAN.md - Combined reference (full context)

Level 3 Architectural Context & Learning Guide

Level 3 Architectural Context & Learning Guide

Problem 1: The "Fast Worker" Race Condition and Lost SSE Updates

The Bug Pattern

Impact

Why Traditional Approaches Fail

Our Solution: Check State BEFORE Subscribe

Benefits

Problem 2: Why Raw Redis Pub/Sub Doesn't Scale (And QueueEvents Does)

The Horizontal Scaling Problem

Raw Redis Pub/Sub Approach (Complex & Brittle)

Why This Doesn't Scale

BullMQ QueueEvents Solution (Built for This)

How QueueEvents Solves Scaling

Real-World Production Scenario

Why QueueEvents is the Right Choice

When to Use Raw Redis Pub/Sub Instead

Technical Decisions & Rationale

Why BullMQ over Bull?

Why ioredis over node-redis?

Why Monorepo with Turborepo?

Why Server-Sent Events over WebSockets?

Why Optimistic Locking over Distributed Locks?

Why Shared Zod Schemas?

Retry Strategy

No Graceful Degradation (Hard Fail)

Error Handling Matrix (NO GRACEFUL DEGRADATION)

Performance Expectations

Level 2 vs Level 3 Comparison

Success Metrics & KPIs

Potential Risks & Mitigations

Common Pitfalls to Avoid

During MVP Implementation (M0-M6)

During Production Hardening (M7-M10)

Comment Template for Code

Interview Talking Points

Technical Depth

Architectural Decisions

Edge Cases & Resilience

Production Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally