-
Notifications
You must be signed in to change notification settings - Fork 0
Level 3 vs Level 2: Scalability & Reliability Improvements
Muhammad Saqib edited this page Nov 24, 2025
·
1 revision
Level 2 (Synchronous) Scaling Limits:
All API instances compete for same database locks
API Server 1 ──┐
API Server 2 ──┤─── Database (Single bottleneck)
API Server 3 ──┘
- Problem: More API servers = more contention for database locks
- Result: Diminishing returns - adding servers doesn't help throughput
Level 3 (Asynchronous) Scaling:
API Server 1 ──┐
API Server 2 ──┤─── Redis Queue ─── Worker 1 ──┐
API Server 3 ──┘ ├── Database
Worker 2 ──┘
Worker 3 ──┘
- Benefit: Each component scales independently
- API servers: Scale based on HTTP request volume
- Workers: Scale based on queue depth
- Database: Controlled, predictable load
Level 2 Problem:
- Traffic spike → Database overwhelmed → System crashes
- No buffering between users and database
Level 3 Solution:
const queue = new Queue('bookings', {
limiter: {
max: 100, // Process max 100 jobs per second
duration: 1000 // Per second
}
});- Benefit: Database receives steady, controlled load
- Prevents: Database crashes during flash sales
Level 2 Scaling Constraints:
- API and database scaling are coupled
- Can't scale API without affecting database performance
Level 3 Independent Scaling:
# Scale API based on HTTP traffic
kubectl scale deployment api --replicas=10
# Scale workers based on queue depth
kubectl scale deployment workers --replicas=5
# Scale database based on worker throughput (predictable)Level 2 Single Point of Failure:
User → API → [Database timeout] → User gets error
# Entire booking flow fails
Level 3 Fault Tolerance:
User → API → [Redis accepts job] → ✅ "Booking queued"
# Database can be temporarily unavailable
# Workers will process when database recovers
Level 2 Failure Mode:
- Database slow → All requests fail
- Binary outcome: Works perfectly or fails completely
Level 3 Degradation:
- API: Still accepts requests instantly
- Queue: Stores jobs reliably
- Workers: Process slower but eventually complete
- Users: Get delayed confirmations but no lost requests
Level 2 No Retry:
try {
await db.transaction(...);
} catch (error) {
// User gets immediate failure
return res.status(500).json({ error: 'Booking failed' });
}Level 3 Automatic Retry:
const queue = new Queue('bookings', {
defaultJobOptions: {
attempts: 3, // Retry 3 times
backoff: {
type: 'exponential', // Exponential backoff
delay: 1000 // Start with 1 second
}
}
});Level 2 Problematic Scenarios:
- Invalid data causes transaction to fail
- No way to inspect or recover failed requests
Level 3 Error Handling:
queue.on('failed', (job, err) => {
if (job.attemptsMade >= job.opts.attempts) {
// Move to dead letter queue for manual inspection
await deadLetterQueue.add('failed-booking', job.data);
}
});- Max throughput: ~50-100 bookings/second
- Latency under load: 2-10 seconds
- Concurrent users: Limited by database lock contention
- Failure mode: All-or-nothing
- API throughput: 1000+ requests/second (instant responses)
- Worker throughput: 200-500 bookings/second (scalable)
- Queue capacity: 10,000+ pending jobs
- Concurrent users: Virtually unlimited (queue absorbs spikes)
- Failure mode: Graceful degradation
const metrics = {
queueDepth: await queue.getJobCounts(),
workerCount: await getWorkerCount(),
processingRate: calculateProcessingRate(),
errorRate: calculateErrorRate()
};
// Alert if queue grows too large
if (metrics.queueDepth.waiting > 5000) {
sendAlert('Queue depth critical - scale workers');
}const circuitBreaker = {
state: 'CLOSED',
failureCount: 0,
open: function() {
// Temporarily stop processing if database is struggling
this.state = 'OPEN';
setTimeout(() => this.state = 'HALF_OPEN', 30000);
}
};// Even with multiple workers, data integrity maintained
await db.query(`
UPDATE events
SET available_tickets = available_tickets - 1,
version = version + 1
WHERE id = $1 AND version = $2
`, [eventId, currentVersion]);
// If 0 rows affected, someone else booked - retry with fresh dataLevel 2 Outcome:
- First 100 users get through (slowly)
- Next 9,900 users get timeout errors
- Many users never even submit requests
- System becomes unresponsive
Level 3 Outcome:
- All 10,000 users get instant "Request queued" responses
- System processes requests in order
- No lost requests due to system overload
- Clear feedback on success/failure for every user
- System remains responsive throughout
The reliability and scalability improvements are architectural, not just cosmetic:
- Scalability: Move from coupled, contention-based scaling to independent, queue-based scaling
- Reliability: Move from brittle, all-or-nothing failure modes to resilient, graceful degradation
- Operational: Move from reactive firefighting to proactive monitoring and auto-scaling
The queue acts as a shock absorber that transforms unpredictable user traffic patterns into predictable, manageable database workloads