Skip to content

Commit 1a9fb41

Browse files
authored
Merge pull request #16 from agenixframework/feature/implement-gracefull-shutdown
Implement graceful shutdown, enhance Redis resilience, update docs, a…
2 parents db883ed + 346cf93 commit 1a9fb41

File tree

12 files changed

+365
-36
lines changed

12 files changed

+365
-36
lines changed

.github/workflows/docs.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,4 +68,12 @@ jobs:
6868
with:
6969
github_token: ${{ secrets.GITHUB_TOKEN }}
7070
publish_branch: gh-pages
71+
# MkDocs outputs to ./site — deployed to the root of gh-pages.
72+
# Configure GitHub Pages: Settings → Pages → Deploy from a branch → Branch: gh-pages, Folder: /
7173
publish_dir: ./site
74+
# Keep gh-pages as a clean, artifact-only branch with no shared history from main.
75+
# Avoids leftover files (e.g., a README) and keeps history small and deterministic.
76+
force_orphan: true
77+
# Create .nojekyll so GitHub Pages doesn't process the output with Jekyll.
78+
# Prevents Jekyll from ignoring directories starting with underscores and from interfering with MkDocs assets.
79+
enable_jekyll: false

docs/Graceful-Shutdown.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Graceful Shutdown
2+
3+
This document explains how Playwright Grid (Hub and Worker) behaves during shutdown, how to configure it, and how to integrate it with container orchestrators for zero‑surprise rollouts.
4+
5+
Last updated: 2025-09-01
6+
7+
## Summary
8+
- Hub stops accepting new borrows as soon as shutdown begins and reports not-ready on readiness checks.
9+
- Worker denies new borrows, drains active client WebSocket sessions up to a configurable timeout, cleans up Redis state, and force-terminates sidecars only if sessions remain after the timeout.
10+
- Both components surface clear readiness signals (HTTP 503) to allow load balancers/orchestrators to stop sending traffic before processes exit.
11+
12+
## Hub behavior
13+
When ASP.NET Core triggers ApplicationStopping (e.g., SIGTERM), the Hub:
14+
- Immediately stops accepting new borrow requests.
15+
- POST /session/borrow responds with 503 Service Unavailable.
16+
- Response includes `Retry-After: 30` header to hint clients to retry later.
17+
- Readiness endpoint reflects shutdown:
18+
- GET /health/ready returns 503 so containers are removed from load balancers.
19+
- Existing sessions are unaffected at Hub level; Hub is stateless for live WS proxying (the Worker owns the WebSocket lifecycle).
20+
21+
Relevant code paths:
22+
- hub/Infrastructure/Web/EndpointMappingExtensions.cs
23+
- Internal `_acceptingBorrows` flag flips to false on `ApplicationStopping`.
24+
- `/session/borrow` returns 503 when not accepting borrows.
25+
- `/health/ready` returns 503 when not accepting borrows.
26+
27+
## Worker behavior
28+
When the Worker receives shutdown (ApplicationStopping):
29+
- Sets `_acceptingBorrows = false` to deny new borrows at `/borrow/{labelKey}` with 503 and `Retry-After: 30`.
30+
- Begins graceful drain of active WebSocket sessions:
31+
- Waits up to `WORKER_DRAIN_TIMEOUT_SECONDS` (default 30s) for all active client WS connections to close.
32+
- During this period, no new borrows are accepted.
33+
- After waiting:
34+
- Performs cleanup of Redis lists/keys for this node and removes itself from `nodes` set.
35+
- If any sessions are still active, logs a warning and force-kills remaining sidecar processes to ensure timely shutdown.
36+
- Readiness reflects shutdown while draining:
37+
- GET `/health/ready` returns 503, signaling the orchestrator to stop routing new traffic.
38+
39+
Relevant code paths:
40+
- worker/Services/WebServerHost.cs
41+
- Graceful drain, denying borrows, readiness 503 during shutdown.
42+
- worker/Services/PoolManager.cs
43+
- Tracks active WS connections per browserId and exposes `HasAnyActiveConnections()` for drain logic.
44+
- Cleanup of Redis state and optional force-kill of sidecars.
45+
46+
## HTTP status codes and headers
47+
- New borrows denied during shutdown:
48+
- Hub: POST `/session/borrow` → 503 Service Unavailable, `Retry-After: 30`.
49+
- Worker: POST `/borrow/{labelKey}` → 503 Service Unavailable, `Retry-After: 30`.
50+
- Readiness while shutting down:
51+
- Hub: GET `/health/ready` → 503.
52+
- Worker: GET `/health/ready` → 503.
53+
54+
## Configuration
55+
Environment variables impacting shutdown behavior:
56+
- WORKER_DRAIN_TIMEOUT_SECONDS
57+
- Default: 30.
58+
- How long the Worker waits for all active WS sessions to close before force-killing sidecars.
59+
- REDIS_* timeouts (Hub and Worker)
60+
- Control health ping timings; not shutdown-specific but influence `/health/ready` responsiveness.
61+
62+
Defaults and safety:
63+
- If WORKER_DRAIN_TIMEOUT_SECONDS is not set or invalid, default 30s is used.
64+
- If drain times out, sidecars are force-terminated; this prevents hung shutdowns on orchestrators with hard SIGKILL deadlines.
65+
66+
## Orchestrator integration
67+
68+
### Docker / docker-compose
69+
- The built-in readiness endpoints and 503 responses during shutdown are sufficient for Compose to stop routing requests when using healthchecks or external LB.
70+
- Example healthcheck in docker-compose.yml:
71+
72+
```yaml
73+
healthcheck:
74+
test: ["CMD", "curl", "-fsS", "http://localhost:5000/health/ready"]
75+
interval: 5s
76+
timeout: 2s
77+
retries: 3
78+
start_period: 10s
79+
```
80+
81+
Set a drain timeout:
82+
83+
```yaml
84+
environment:
85+
- WORKER_DRAIN_TIMEOUT_SECONDS=45
86+
```
87+
88+
### Kubernetes
89+
Use readiness probes and preStop hooks to ensure in-flight sessions drain:
90+
91+
```yaml
92+
readinessProbe:
93+
httpGet:
94+
path: /health/ready
95+
port: 5000
96+
periodSeconds: 5
97+
timeoutSeconds: 2
98+
failureThreshold: 1
99+
100+
lifecycle:
101+
preStop:
102+
exec:
103+
command: ["/bin/sh", "-c", "sleep 40"]
104+
```
105+
106+
- Set `terminationGracePeriodSeconds` to be >= WORKER_DRAIN_TIMEOUT_SECONDS + probe buffer. Example: 60.
107+
- The app flips readiness to 503 on shutdown automatically; the preStop sleep gives LBs time to drain before SIGTERM deadlines.
108+
109+
## Observability
110+
- Logs
111+
- Hub: "[hub] ApplicationStopping: stop accepting new borrows".
112+
- Worker: "[worker] ApplicationStopping: initiating graceful drain" and possible timeout message.
113+
- Metrics
114+
- Standard HTTP/ASP.NET metrics are exposed (Prometheus). During shutdown, expect:
115+
- Increased 503 counts on borrow endpoints.
116+
- `/health/ready` 503 rate until container exits.
117+
- Dashboard
118+
- Ongoing sessions should continue; new borrows will fail fast with 503 until workers come back.
119+
120+
## Verification steps
121+
- Local manual test
122+
1) Start the stack (docker compose up --build).
123+
2) Borrow a session and connect a client.
124+
3) Send SIGTERM to a worker container: `docker kill --signal=TERM <worker_container>`.
125+
4) Observe logs: drain starts; `/health/ready` returns 503; connection persists until closed or timeout.
126+
- Automated tests
127+
- Unit tests remain green. Integration tests can be extended in future to simulate shutdown; current grid tests rely on Testcontainers bootstrap and are compatible with the behavior.
128+
129+
## Compatibility and client expectations
130+
- Clients should handle 503 responses on borrow and respect `Retry-After` header.
131+
- Existing WebSocket sessions can continue until user closes them or the drain timeout ends.
132+
- No API changes were introduced; the feature is backward compatible.
133+
134+
## FAQ
135+
- Q: Will shutdown interrupt a running Playwright session?
136+
- A: Not immediately. The worker attempts a graceful drain. If the session exceeds the configured drain timeout, the sidecar is force-terminated to allow shutdown to complete.
137+
- Q: Do I need to change probes?
138+
- A: Ensure you’re using `/health/ready` for readiness. Liveness can stay on `/health`.
139+
- Q: Can I make drain longer than my platform’s termination grace period?
140+
- A: You can, but the platform may send SIGKILL before drain ends. Align `terminationGracePeriodSeconds` (K8s) or stop timeout (Docker) with your drain setting.
141+
142+
## References
143+
- Source files:
144+
- hub/Infrastructure/Web/EndpointMappingExtensions.cs
145+
- worker/Services/WebServerHost.cs
146+
- worker/Services/PoolManager.cs
147+
- Related docs:
148+
- Node Liveness and Sweeper (node TTLs and cleanup)
149+
- Borrow TTL & Session Persistence

docs/assets/styles.css

Lines changed: 18 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,27 @@
1-
/* Minimal overrides for Material theme used by Playwright Grid docs */
2-
:root {
3-
/* Brand-ish colors (tweak as needed) */
4-
--md-primary-fg-color: #2563eb; /* blue-600 */
5-
--md-accent-fg-color: #22c55e; /* green-500 */
6-
}
1+
/* Minimal tweaks on top of Material theme */
72

8-
/* Typography */
9-
.md-typeset {
10-
-webkit-font-smoothing: antialiased;
3+
/* Improve code block readability */
4+
:root {
5+
--code-bg: #0b1220;
116
}
12-
.md-typeset h1, .md-typeset h2, .md-typeset h3 {
13-
letter-spacing: 0.2px;
7+
.md-typeset pre>code {
8+
background-color: var(--code-bg);
9+
color: #e6edf3;
1410
}
11+
12+
/* Slightly tighter paragraphs */
1513
.md-typeset p {
16-
margin: 0.5em 0 0.9em;
14+
line-height: 1.55;
1715
}
1816

19-
/* Inline code */
20-
.md-typeset :not(pre) > code {
21-
padding: 0.1em 0.25em;
22-
border-radius: 0.25rem;
23-
white-space: break-spaces;
17+
/* Make tables scroll on small screens */
18+
.md-typeset table {
19+
display: block;
20+
overflow-x: auto;
21+
white-space: nowrap;
2422
}
2523

26-
/* Code blocks */
27-
.md-typeset pre {
28-
margin: 0.75rem 0;
29-
}
30-
.md-typeset pre > code {
31-
display: block;
32-
padding: 0.75rem 1rem;
33-
border-radius: 0.5rem;
34-
white-space: pre-wrap; /* wrap long lines */
24+
/* Subtle link underline on hover */
25+
.md-typeset a:hover {
26+
text-decoration: underline;
3527
}

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Use the links below to get started and dive into specific topics.
1818
- Capacity Queue (pending borrows, fairness, timeouts): Capacity-Queue.md
1919
- Borrow TTL & Session Persistence: Borrow-TTL-and-Session-Persistence.md
2020
- Node Liveness & Sweeper: Node-Liveness-and-Sweeper.md
21+
- Graceful Shutdown: Graceful-Shutdown.md
2122
- Session Distribution across workers: distribution.md
2223

2324
## Observability

docs/tasks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@ The following is an ordered, actionable checklist covering architectural and cod
1616
10. [X] Introduce a capacity queue in Hub for pending borrows with timeout and fairness (per-label and per-run caps) to reduce thundering herd.
1717
11. [X] Implement node heartbeat/liveness tracker with configurable timeout; evict stale nodes and reclaim/expire orphaned sessions.
1818
12. [X] Add borrow TTL and auto-return on timeout; persist session state to Redis to survive Hub restarts.
19-
13. [ ] Harden Redis usage: resilience (timeouts, retries with jitter, circuit breaker), connection settings, and health checks integrated into readiness.
19+
13. [X] Harden Redis usage: resilience (timeouts, retries with jitter, circuit breaker), connection settings, and health checks integrated into readiness.
2020
14. [ ] Support secret rotation: accept multiple HUB_RUNNER_SECRET/HUB_NODE_SECRET values (comma-separated) and log deprecation windows.
2121
15. [ ] Redact secrets and PII in logs; ensure headers and sensitive values never appear in structured logs.
2222
16. [ ] Add rate limiting (per IP and per runner id) on Hub borrow/return to protect from abuse; return 429 with Retry-After.
2323
17. [ ] Add optional IP allowlist or token-based auth (e.g., PAT via header) for Hub API alongside shared secrets.
24-
18. [ ] Implement graceful shutdown: Hub stops accepting new borrows; Worker drains sessions and returns cleanly on SIGTERM.
24+
18. [X] Implement graceful shutdown: Hub stops accepting new borrows; Worker drains sessions and returns cleanly on SIGTERM.
2525
19. [ ] Enforce maximum WebSocket message size and idle timeouts in Worker; send periodic pings and close dead connections.
2626
20. [ ] Add backpressure controls in Worker WS proxy (bounded channels, drop policy, and metrics for drops).
2727
21. [ ] Strengthen Worker sidecar management: sidecar health endpoint, restart/backoff strategy, and clear error surfacing to Hub.

hub/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,5 @@ COPY --from=build /app/publish .
2323
ENV ASPNETCORE_URLS=http://0.0.0.0:5000 \
2424
DOTNET_RUNNING_IN_CONTAINER=true
2525
EXPOSE 5000
26-
HEALTHCHECK --interval=10s --timeout=3s --retries=5 --start-period=20s CMD curl -fsSL http://127.0.0.1:5000/health || exit 1
26+
HEALTHCHECK --interval=10s --timeout=3s --retries=5 --start-period=20s CMD curl -fsSL http://127.0.0.1:5000/health/ready || exit 1
2727
ENTRYPOINT ["dotnet","PlaywrightHub.dll"]

hub/Infrastructure/Web/EndpointMappingExtensions.cs

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,10 @@ private sealed class Waiter
190190

191191
public static class EndpointMappingExtensions
192192
{
193+
private static long _redisBreakerUntilTicks;
194+
private static int _redisConsecutiveFailures;
195+
private static volatile bool _acceptingBorrows = true;
196+
193197
public static void MapHubEndpoints(this WebApplication app)
194198
{
195199
var config = app.Configuration;
@@ -212,6 +216,13 @@ public static void MapHubEndpoints(this WebApplication app)
212216
var resultsStore = services.GetRequiredService<IResultsStore>();
213217
var resultsHubCtx = services.GetRequiredService<IHubContext<ResultsHub, IResultsClient>>();
214218

219+
// Graceful shutdown: stop accepting new borrow requests when application is stopping
220+
app.Lifetime.ApplicationStopping.Register(() =>
221+
{
222+
_acceptingBorrows = false;
223+
try { Console.WriteLine("[hub] ApplicationStopping: stop accepting new borrows"); } catch { }
224+
});
225+
215226
var hubRunnerSecret = config["HUB_RUNNER_SECRET"] ?? "runner-secret";
216227
var hubNodeSecret = config["HUB_NODE_SECRET"] ?? "node-secret";
217228
var nodeTimeoutSeconds = int.TryParse(config["HUB_NODE_TIMEOUT"], out var t) ? t : 60;
@@ -420,6 +431,14 @@ public static void MapHubEndpoints(this WebApplication app)
420431
return Results.Unauthorized();
421432
}
422433

434+
// During graceful shutdown, deny new borrows with 503
435+
if (!_acceptingBorrows)
436+
{
437+
try { req.HttpContext.Response.Headers.Append("Retry-After", "30"); } catch { }
438+
try { borrowOutcomes.WithLabels("unknown", "denied").Inc(); } catch { }
439+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
440+
}
441+
423442
var body = await req.ReadFromJsonAsync<Dictionary<string, string>>() ??
424443
new Dictionary<string, string>();
425444
if (!body.TryGetValue("labelKey", out var labelKey) || string.IsNullOrEmpty(labelKey))
@@ -2230,6 +2249,63 @@ static bool IsSecret(string key)
22302249

22312250
app.MapGet("/nodes", () => Results.Ok(db.SetMembers("nodes").Select(x => x.ToString())));
22322251
app.MapGet("/health", () => Results.Ok(new { status = "ok" }));
2252+
app.MapGet("/health/ready", async () =>
2253+
{
2254+
// During graceful shutdown, report not ready
2255+
if (!_acceptingBorrows)
2256+
{
2257+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
2258+
}
2259+
try
2260+
{
2261+
var timeoutMs = int.TryParse(config["REDIS_HEALTH_TIMEOUT_MS"], out var ms) ? Math.Max(100, ms) : 1000;
2262+
var breakerThreshold = int.TryParse(config["REDIS_BREAKER_THRESHOLD"], out var th) ? Math.Max(1, th) : 3;
2263+
var breakerCooldownMs = int.TryParse(config["REDIS_BREAKER_COOLDOWN_MS"], out var cd) ? Math.Max(500, cd) : 5000;
2264+
2265+
// Circuit breaker: if open, fail fast
2266+
var nowTicks = DateTime.UtcNow.Ticks;
2267+
if (System.Threading.Interlocked.Read(ref _redisBreakerUntilTicks) > nowTicks)
2268+
{
2269+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
2270+
}
2271+
2272+
if (!mux.IsConnected)
2273+
{
2274+
var until = nowTicks + (long)breakerCooldownMs * TimeSpan.TicksPerMillisecond;
2275+
System.Threading.Interlocked.Exchange(ref _redisBreakerUntilTicks, until);
2276+
System.Threading.Interlocked.Exchange(ref _redisConsecutiveFailures, 0);
2277+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
2278+
}
2279+
2280+
var sw = System.Diagnostics.Stopwatch.StartNew();
2281+
var pingTask = db.PingAsync();
2282+
var completed = await Task.WhenAny(pingTask, Task.Delay(timeoutMs));
2283+
if (completed != pingTask)
2284+
{
2285+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
2286+
}
2287+
await pingTask; // propagate any errors
2288+
sw.Stop();
2289+
2290+
// Success: reset failure counters and breaker
2291+
System.Threading.Interlocked.Exchange(ref _redisConsecutiveFailures, 0);
2292+
System.Threading.Interlocked.Exchange(ref _redisBreakerUntilTicks, 0);
2293+
return Results.Ok(new { status = "ready", redis = new { pingMs = sw.ElapsedMilliseconds } });
2294+
}
2295+
catch
2296+
{
2297+
var threshold = System.Threading.Interlocked.Increment(ref _redisConsecutiveFailures);
2298+
var breakerThreshold = int.TryParse(config["REDIS_BREAKER_THRESHOLD"], out var th2) ? Math.Max(1, th2) : 3;
2299+
var breakerCooldownMs = int.TryParse(config["REDIS_BREAKER_COOLDOWN_MS"], out var cd2) ? Math.Max(500, cd2) : 5000;
2300+
if (threshold >= breakerThreshold)
2301+
{
2302+
var until = DateTime.UtcNow.Ticks + (long)breakerCooldownMs * TimeSpan.TicksPerMillisecond;
2303+
System.Threading.Interlocked.Exchange(ref _redisBreakerUntilTicks, until);
2304+
System.Threading.Interlocked.Exchange(ref _redisConsecutiveFailures, 0);
2305+
}
2306+
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
2307+
}
2308+
});
22332309
}
22342310

22352311
private static bool CheckSecret(HttpRequest req, string header, string expected)

hub/Services/HubServiceRunner.cs

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,22 @@ public static async Task RunAsync(string[] args)
107107

108108

109109
var redisUrl = builder.Configuration["REDIS_URL"] ?? "redis:6379";
110-
var mux = await ConnectionMultiplexer.ConnectAsync(redisUrl);
110+
// Configure Redis connection with resilience and timeouts
111+
var redisOptions = ConfigurationOptions.Parse(redisUrl, true);
112+
redisOptions.AbortOnConnectFail = false; // keep retrying
113+
redisOptions.ConnectRetry = 3;
114+
redisOptions.KeepAlive = 15;
115+
int GetInt(string key, int def)
116+
{
117+
return int.TryParse(builder.Configuration[key], out var v) ? Math.Max(0, v) : def;
118+
}
119+
redisOptions.ConnectTimeout = GetInt("REDIS_CONNECT_TIMEOUT_MS", 5000);
120+
redisOptions.SyncTimeout = GetInt("REDIS_SYNC_TIMEOUT_MS", 5000);
121+
redisOptions.AsyncTimeout = GetInt("REDIS_ASYNC_TIMEOUT_MS", 5000);
122+
// Exponential reconnect backoff policy (includes jitter internally)
123+
redisOptions.ReconnectRetryPolicy = new ExponentialRetry(5000);
124+
125+
var mux = await ConnectionMultiplexer.ConnectAsync(redisOptions);
111126
var db = mux.GetDatabase();
112127

113128
// Services for dashboard integration

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ nav:
5858
- Capacity Queue: Capacity-Queue.md
5959
- Borrow TTL & Session Persistence: Borrow-TTL-and-Session-Persistence.md
6060
- Node Liveness and Sweeper: Node-Liveness-and-Sweeper.md
61+
- Graceful Shutdown: Graceful-Shutdown.md
6162
- "Playwright .NET pw:api": PlaywrightDotNet-pw-api.md
6263
- Presentation: presentation-why-playwright-grid.md
6364
- Metrics and Grafana: Metrics-and-Grafana.md

worker/Services/PoolManager.cs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,12 @@ public bool HasActiveConnection(string browserId)
6060
return _activeWs.TryGetValue(browserId, out var v) && v > 0;
6161
}
6262

63+
public bool HasAnyActiveConnections()
64+
{
65+
try { return !_activeWs.IsEmpty; }
66+
catch { return _activeWs.Count > 0; }
67+
}
68+
6369
private static string NormalizeBrowser(string s)
6470
{
6571
return string.IsNullOrWhiteSpace(s) ? "Chromium" : s.Trim();

0 commit comments

Comments
 (0)