|
| 1 | +# Graceful Shutdown |
| 2 | + |
| 3 | +This document explains how Playwright Grid (Hub and Worker) behaves during shutdown, how to configure it, and how to integrate it with container orchestrators for zero‑surprise rollouts. |
| 4 | + |
| 5 | +Last updated: 2025-09-01 |
| 6 | + |
| 7 | +## Summary |
| 8 | +- Hub stops accepting new borrows as soon as shutdown begins and reports not-ready on readiness checks. |
| 9 | +- Worker denies new borrows, drains active client WebSocket sessions up to a configurable timeout, cleans up Redis state, and force-terminates sidecars only if sessions remain after the timeout. |
| 10 | +- Both components surface clear readiness signals (HTTP 503) to allow load balancers/orchestrators to stop sending traffic before processes exit. |
| 11 | + |
| 12 | +## Hub behavior |
| 13 | +When ASP.NET Core triggers ApplicationStopping (e.g., SIGTERM), the Hub: |
| 14 | +- Immediately stops accepting new borrow requests. |
| 15 | + - POST /session/borrow responds with 503 Service Unavailable. |
| 16 | + - Response includes `Retry-After: 30` header to hint clients to retry later. |
| 17 | +- Readiness endpoint reflects shutdown: |
| 18 | + - GET /health/ready returns 503 so containers are removed from load balancers. |
| 19 | +- Existing sessions are unaffected at Hub level; Hub is stateless for live WS proxying (the Worker owns the WebSocket lifecycle). |
| 20 | + |
| 21 | +Relevant code paths: |
| 22 | +- hub/Infrastructure/Web/EndpointMappingExtensions.cs |
| 23 | + - Internal `_acceptingBorrows` flag flips to false on `ApplicationStopping`. |
| 24 | + - `/session/borrow` returns 503 when not accepting borrows. |
| 25 | + - `/health/ready` returns 503 when not accepting borrows. |
| 26 | + |
| 27 | +## Worker behavior |
| 28 | +When the Worker receives shutdown (ApplicationStopping): |
| 29 | +- Sets `_acceptingBorrows = false` to deny new borrows at `/borrow/{labelKey}` with 503 and `Retry-After: 30`. |
| 30 | +- Begins graceful drain of active WebSocket sessions: |
| 31 | + - Waits up to `WORKER_DRAIN_TIMEOUT_SECONDS` (default 30s) for all active client WS connections to close. |
| 32 | + - During this period, no new borrows are accepted. |
| 33 | +- After waiting: |
| 34 | + - Performs cleanup of Redis lists/keys for this node and removes itself from `nodes` set. |
| 35 | + - If any sessions are still active, logs a warning and force-kills remaining sidecar processes to ensure timely shutdown. |
| 36 | +- Readiness reflects shutdown while draining: |
| 37 | + - GET `/health/ready` returns 503, signaling the orchestrator to stop routing new traffic. |
| 38 | + |
| 39 | +Relevant code paths: |
| 40 | +- worker/Services/WebServerHost.cs |
| 41 | + - Graceful drain, denying borrows, readiness 503 during shutdown. |
| 42 | +- worker/Services/PoolManager.cs |
| 43 | + - Tracks active WS connections per browserId and exposes `HasAnyActiveConnections()` for drain logic. |
| 44 | + - Cleanup of Redis state and optional force-kill of sidecars. |
| 45 | + |
| 46 | +## HTTP status codes and headers |
| 47 | +- New borrows denied during shutdown: |
| 48 | + - Hub: POST `/session/borrow` → 503 Service Unavailable, `Retry-After: 30`. |
| 49 | + - Worker: POST `/borrow/{labelKey}` → 503 Service Unavailable, `Retry-After: 30`. |
| 50 | +- Readiness while shutting down: |
| 51 | + - Hub: GET `/health/ready` → 503. |
| 52 | + - Worker: GET `/health/ready` → 503. |
| 53 | + |
| 54 | +## Configuration |
| 55 | +Environment variables impacting shutdown behavior: |
| 56 | +- WORKER_DRAIN_TIMEOUT_SECONDS |
| 57 | + - Default: 30. |
| 58 | + - How long the Worker waits for all active WS sessions to close before force-killing sidecars. |
| 59 | +- REDIS_* timeouts (Hub and Worker) |
| 60 | + - Control health ping timings; not shutdown-specific but influence `/health/ready` responsiveness. |
| 61 | + |
| 62 | +Defaults and safety: |
| 63 | +- If WORKER_DRAIN_TIMEOUT_SECONDS is not set or invalid, default 30s is used. |
| 64 | +- If drain times out, sidecars are force-terminated; this prevents hung shutdowns on orchestrators with hard SIGKILL deadlines. |
| 65 | + |
| 66 | +## Orchestrator integration |
| 67 | + |
| 68 | +### Docker / docker-compose |
| 69 | +- The built-in readiness endpoints and 503 responses during shutdown are sufficient for Compose to stop routing requests when using healthchecks or external LB. |
| 70 | +- Example healthcheck in docker-compose.yml: |
| 71 | + |
| 72 | +```yaml |
| 73 | +healthcheck: |
| 74 | + test: ["CMD", "curl", "-fsS", "http://localhost:5000/health/ready"] |
| 75 | + interval: 5s |
| 76 | + timeout: 2s |
| 77 | + retries: 3 |
| 78 | + start_period: 10s |
| 79 | +``` |
| 80 | +
|
| 81 | +Set a drain timeout: |
| 82 | +
|
| 83 | +```yaml |
| 84 | +environment: |
| 85 | + - WORKER_DRAIN_TIMEOUT_SECONDS=45 |
| 86 | +``` |
| 87 | +
|
| 88 | +### Kubernetes |
| 89 | +Use readiness probes and preStop hooks to ensure in-flight sessions drain: |
| 90 | +
|
| 91 | +```yaml |
| 92 | +readinessProbe: |
| 93 | + httpGet: |
| 94 | + path: /health/ready |
| 95 | + port: 5000 |
| 96 | + periodSeconds: 5 |
| 97 | + timeoutSeconds: 2 |
| 98 | + failureThreshold: 1 |
| 99 | + |
| 100 | +lifecycle: |
| 101 | + preStop: |
| 102 | + exec: |
| 103 | + command: ["/bin/sh", "-c", "sleep 40"] |
| 104 | +``` |
| 105 | +
|
| 106 | +- Set `terminationGracePeriodSeconds` to be >= WORKER_DRAIN_TIMEOUT_SECONDS + probe buffer. Example: 60. |
| 107 | +- The app flips readiness to 503 on shutdown automatically; the preStop sleep gives LBs time to drain before SIGTERM deadlines. |
| 108 | + |
| 109 | +## Observability |
| 110 | +- Logs |
| 111 | + - Hub: "[hub] ApplicationStopping: stop accepting new borrows". |
| 112 | + - Worker: "[worker] ApplicationStopping: initiating graceful drain" and possible timeout message. |
| 113 | +- Metrics |
| 114 | + - Standard HTTP/ASP.NET metrics are exposed (Prometheus). During shutdown, expect: |
| 115 | + - Increased 503 counts on borrow endpoints. |
| 116 | + - `/health/ready` 503 rate until container exits. |
| 117 | +- Dashboard |
| 118 | + - Ongoing sessions should continue; new borrows will fail fast with 503 until workers come back. |
| 119 | + |
| 120 | +## Verification steps |
| 121 | +- Local manual test |
| 122 | + 1) Start the stack (docker compose up --build). |
| 123 | + 2) Borrow a session and connect a client. |
| 124 | + 3) Send SIGTERM to a worker container: `docker kill --signal=TERM <worker_container>`. |
| 125 | + 4) Observe logs: drain starts; `/health/ready` returns 503; connection persists until closed or timeout. |
| 126 | +- Automated tests |
| 127 | + - Unit tests remain green. Integration tests can be extended in future to simulate shutdown; current grid tests rely on Testcontainers bootstrap and are compatible with the behavior. |
| 128 | + |
| 129 | +## Compatibility and client expectations |
| 130 | +- Clients should handle 503 responses on borrow and respect `Retry-After` header. |
| 131 | +- Existing WebSocket sessions can continue until user closes them or the drain timeout ends. |
| 132 | +- No API changes were introduced; the feature is backward compatible. |
| 133 | + |
| 134 | +## FAQ |
| 135 | +- Q: Will shutdown interrupt a running Playwright session? |
| 136 | + - A: Not immediately. The worker attempts a graceful drain. If the session exceeds the configured drain timeout, the sidecar is force-terminated to allow shutdown to complete. |
| 137 | +- Q: Do I need to change probes? |
| 138 | + - A: Ensure you’re using `/health/ready` for readiness. Liveness can stay on `/health`. |
| 139 | +- Q: Can I make drain longer than my platform’s termination grace period? |
| 140 | + - A: You can, but the platform may send SIGKILL before drain ends. Align `terminationGracePeriodSeconds` (K8s) or stop timeout (Docker) with your drain setting. |
| 141 | + |
| 142 | +## References |
| 143 | +- Source files: |
| 144 | + - hub/Infrastructure/Web/EndpointMappingExtensions.cs |
| 145 | + - worker/Services/WebServerHost.cs |
| 146 | + - worker/Services/PoolManager.cs |
| 147 | +- Related docs: |
| 148 | + - Node Liveness and Sweeper (node TTLs and cleanup) |
| 149 | + - Borrow TTL & Session Persistence |
0 commit comments