HTTP Load Shedding for Worker Protection by corlettb · Pull Request #4711 · alphagov/notifications-api

corlettb · 2026-01-27T10:33:45Z

Note

This change has not been tested under load in an environment yet.

The change should be considered a proposal until further testing has taken place.

What

Implements selective request throttling to protect worker capacity during traffic spikes. When workers become overloaded, the system identifies and throttles high-volume services, preventing resource exhaustion while maintaining service for other users.

Why

Workers have fixed capacity (8 concurrent requests per worker, 32 total across 4 workers). When a few high-volume services consume most available slots, low-volume services experience degraded performance or failures due to queue buildup. Manual intervention is currently required to identify and mitigate these scenarios.

Solution

Contribution-based throttling with statistical safeguards:

Tracks per-service request volumes using 60-second sliding windows
When worker load exceeds configurable threshold (default: 6/8 requests = 75%), identifies services that are:
- Contributing >=20% of total request volume, OR
- Requesting at >=10x the median service volume
Returns HTTP 429 (Retry-After: 60s) to throttled services
Safety gates prevent throttling in edge cases (single service, low traffic scenarios)

Memory-efficient design:

Per-worker in-memory tracking
Periodic cleanup prevents unbounded growth from idle services
No database or shared state required

Configuration (opt-in)

LOAD_SHEDDING_ENABLED=true          # Default: false
HIGH_WATER_MARK=6                   # Overload threshold (default: 6/8 per worker)
THROTTLE_CONTRIBUTION_PCT=20        # Contribution % threshold
THROTTLE_CONTRIBUTION_MIN_SERVICES=5   # Min services before applying contribution rule
THROTTLE_CONTRIBUTION_MIN_VOLUME=50    # Min total volume before applying contribution rule
THROTTLE_VOLUME_MEDIAN_MULTIPLE=10  # Median multiplier threshold

Observability

Prometheus metrics:

worker_load_shedding_active (Gauge): Current state per worker (1=active, 0=inactive)
load_shedding_activations_total (Counter): Total activation events across workers

Logging:

Load shedding activation/deactivation events
Individual throttling decisions with service details

Implements intelligent per-worker load shedding to prevent high-volume services from overwhelming workers during traffic spikes, protecting low-volume services during the autoscaling window (~1 minute). Architecture: - In-memory request tracking using 60-second sliding window with deques - Per-worker load monitoring via concurrent request counter - Contribution-based throttling (>=20% of volume or 10x median) - Comprehensive metrics for observability (gauge + counter) - No external dependencies (Redis-free) Throttling Logic: - Only activates when worker exceeds HIGH_WATER_MARK (80% capacity) - Throttles services that meet either condition: * Contributing >=20% of total request volume (catches single spammers) * Volume >=10x median (catches outliers in multi-service scenarios) - Returns 429 with Retry-After: 5 header Components Added: - app/load_shedding.py: ServiceVolumeTracker with deque-based tracking - ServiceUnavailableError: Custom 429 exception for throttled requests - ConcurrentRequestCounter: Per-worker load tracking - Integration in validators.check_rate_limiting() Configuration (disabled by default): - LOAD_SHEDDING_ENABLED: false (feature flag, opt-in) - HIGH_WATER_MARK: 26 (80% of 32 concurrent capacity per worker) - THROTTLE_CONTRIBUTION_PCT: 20 (throttle if >=20% of volume) - THROTTLE_VOLUME_MEDIAN_MULTIPLE: 10 (throttle if >=10x median) Observability Metrics: - worker_load_shedding_active (Gauge): Current state per worker (1=active, 0=inactive) Sum across workers shows total workers currently load shedding - load_shedding_activations_total (Counter): Activation events per worker Incremented once per transition from healthy -> overloaded Useful for: detecting flapping, historical analysis, missed events between scrapes - load_shedding.throttled.{service_id} (Counter): Per-service throttle count Logging: - ACTIVATED/DEACTIVATED events with current vs HIGH_WATER_MARK - Per-service throttling decisions with volume metrics - Contribution percentages and median comparisons Implementation Details: - ServiceVolumeTracker: Deque-based sliding window - Lazy cleanup of expired timestamps to keep memory bounded - State tracking for activation/deactivation transitions - No locks needed (Eventlet cooperative concurrency) - TYPE_CHECKING pattern for clean type hints

Each eventlet worker can handle 8 concurrent connections, not 32. The previous HIGH_WATER_MARK of 26 would never trigger since workers max out at 8 concurrent requests. Changes: - Update HIGH_WATER_MARK default from 26 to 6 (75% of 8 connections) - Fix fallback value in is_worker_overloaded() from 26 to 6 - Update test mocks to use realistic values (3-8 instead of 10-30)

corlettb added 4 commits January 26, 2026 15:51

Fix load shedding error handling and cleanup

ea0a335

Gate load shedding contribution rule

63b7ae3

corlettb force-pushed the BC-http-load-shedding-2 branch from 626ad24 to 6658214 Compare January 27, 2026 10:57

Check load shedding before api limits

e425fe1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP Load Shedding for Worker Protection#4711

HTTP Load Shedding for Worker Protection#4711
corlettb wants to merge 5 commits intomainfrom
BC-http-load-shedding-2

corlettb commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

corlettb commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

What

Why

Solution

Configuration (opt-in)

Observability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

corlettb commented Jan 27, 2026 •

edited

Loading