Skip to content

Commit 0c350ec

Browse files
committed
Add Borrow TTL handling, improve node sweeper, and refactor Dockerfiles for caching
- Introduced Borrow TTL sweeper to auto-return expired sessions and persist state to Redis. - Enhanced NodeSweeperService to prune in-use entries, preventing capacity leaks from orphaned records. - Updated Dockerfiles across services for better cache optimization. - Revised documentation: added node liveness, sweeper explanation, and operational tips. - Adjusted Redis session state persistence for better resilience and recovery. - Minor refactoring of unused imports and code cleanup.
1 parent c464b1a commit 0c350ec

File tree

18 files changed

+338
-44
lines changed

18 files changed

+338
-44
lines changed

Agenix.PlaywrightGrid.Domain.Tests/LabelKeyTests.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
using NUnit.Framework;
21
using Agenix.PlaywrightGrid.Domain;
2+
using NUnit.Framework;
33

44
namespace Agenix.PlaywrightGrid.Domain.Tests;
55

Agenix.PlaywrightGrid.Domain.Tests/LabelMatcherTests.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
using System;
22
using System.Collections.Generic;
33
using System.Linq;
4-
using NUnit.Framework;
54
using Agenix.PlaywrightGrid.Domain;
5+
using NUnit.Framework;
66

77
namespace Agenix.PlaywrightGrid.Domain.Tests;
88

Agenix.PlaywrightGrid.Domain/Agenix.PlaywrightGrid.Domain.csproj

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
<TargetFramework>net8.0</TargetFramework>
55
<ImplicitUsings>enable</ImplicitUsings>
66
<Nullable>enable</Nullable>
7+
<IsPackable>false</IsPackable>
78
</PropertyGroup>
89

910
</Project>

Agenix.PlaywrightGrid.HubClient/Agenix.PlaywrightGrid.HubClient.csproj

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,4 @@
2020
<PackageReference Include="Microsoft.Extensions.DependencyInjection.Abstractions" Version="9.0.8"/>
2121
<PackageReference Include="Microsoft.Playwright" Version="1.54.0"/>
2222
</ItemGroup>
23-
<ItemGroup>
24-
<ProjectReference Include="..\Agenix.PlaywrightGrid.Domain\Agenix.PlaywrightGrid.Domain.csproj" />
25-
</ItemGroup>
2623
</Project>

Agenix.PlaywrightGrid.HubClient/HubUrlProvider.cs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
using System;
21
using Microsoft.Extensions.Configuration;
32

43
namespace Agenix.PlaywrightGrid.HubClient;

Agenix.PlaywrightGrid.HubClient/PlaywrightEventForwarder.cs

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
using System;
2-
using System.Threading.Tasks;
31
using Microsoft.Playwright;
42

53
namespace Agenix.PlaywrightGrid.HubClient;

dashboard/Dockerfile

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,11 @@ FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
33
ARG BUILD_CONFIGURATION=Release
44
WORKDIR /src
55

6-
# Restore using only the project file for better layer caching
7-
COPY dashboard/Dashboard.csproj dashboard/
8-
COPY Agenix.PlaywrightGrid.Domain/Agenix.PlaywrightGrid.Domain.csproj Agenix.PlaywrightGrid.Domain/
9-
RUN dotnet restore dashboard/Dashboard.csproj
10-
11-
# Build and publish
6+
# Copy source
127
COPY . .
8+
# Restore
9+
RUN dotnet restore dashboard/Dashboard.csproj
10+
# Publish
1311
RUN dotnet publish dashboard/Dashboard.csproj -c $BUILD_CONFIGURATION -o /app/publish /p:UseAppHost=false
1412

1513
# Runtime stage

dashboard/Program.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@
1111
using Microsoft.Extensions.DependencyInjection;
1212
using Microsoft.Extensions.Hosting;
1313
using Microsoft.Extensions.Logging;
14+
using OpenTelemetry.Exporter;
1415
using OpenTelemetry.Metrics;
15-
using OpenTelemetry.Trace;
1616
using OpenTelemetry.Resources;
17-
using OpenTelemetry.Exporter;
17+
using OpenTelemetry.Trace;
1818

1919
const string hubSignalRConfigKey = "HUB_SIGNALR";
2020

docs/Node-Liveness-and-Sweeper.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Node Liveness and Sweeper (Hub)
2+
3+
This document explains how the Hub tracks worker node liveness, the configuration knobs, and what happens when a node becomes stale or disappears. It also outlines how orphaned sessions are reclaimed to avoid capacity leaks.
4+
5+
Overview
6+
- Workers periodically emit heartbeats to Redis updating:
7+
- node:{nodeId} hash fields: LastSeen (ISO-8601 UTC), Labels (JSON), Capacity
8+
- nodes set membership (nodeId)
9+
- node_alive:{nodeId} key with TTL (default 90s in Worker)
10+
- The Hub runs a background NodeSweeperService that periodically scans for stale nodes and prunes associated capacity entries. It complements the Worker heartbeats by performing garbage collection when nodes are dead or unreachable.
11+
12+
Configuration (environment variables)
13+
- HUB_NODE_TIMEOUT: seconds of inactivity before a node is considered stale. Default: 60.
14+
- HUB_SWEEPER_EXPIRE: if true, the sweeper will actually expire nodes and prune data. If false, it will refresh a short TTL on node_alive:{nodeId} and log what would happen. Default: false (dry-run).
15+
16+
How the sweeper works
17+
1) Tick interval: every ~20 seconds the service performs a pass.
18+
2) For each nodeId in Redis set "nodes":
19+
- If node_alive:{nodeId} exists → node is healthy, skip.
20+
- Else, parse node:{nodeId} LastSeen (strict ISO-8601 Roundtrip). If missing/invalid or older than HUB_NODE_TIMEOUT → candidate for expiration.
21+
- Small tolerance: if LastSeen is in the future by >5s (clock skew), do not expire.
22+
- Double check: if node_alive:{nodeId} re-appears during the pass, skip to avoid race with a fresh heartbeat.
23+
- If there are still available:* entries that reference this node, we treat the node as alive and refresh node_alive TTL to 30s, skipping expiration for this tick. This avoids evicting a node that is actively serving capacity but briefly missed heartbeat.
24+
3) When expiring (HUB_SWEEPER_EXPIRE=true):
25+
- Remove nodeId from set "nodes" and delete hash key node:{nodeId}.
26+
- Prune available:* lists: remove entries containing this nodeId.
27+
- Prune inuse:* lists (new): remove entries containing this nodeId, and best-effort delete lightweight mappings browser_run:{browserId} and browser_test:{browserId} if browserId is present. This reclaims capacity that would otherwise be stuck.
28+
4) Logs include per-tick stats: scanned, expired, errors, and tick duration.
29+
30+
Why prune inuse:* too?
31+
Previously, only available:* lists were pruned. If a node died while a browser was borrowed (inuse:*), capacity would remain stuck. The sweeper now removes those orphaned records and clears run/test mappings so new borrows are not blocked by phantom in-use entries.
32+
33+
Related components
34+
- Worker HeartbeatService: updates LastSeen and sets node_alive TTL so healthy nodes are never swept.
35+
- RunCleanupService: a separate hub background service that can auto-return outstanding browsers when runs become inactive or exceed max duration. This operates at run level, whereas NodeSweeperService operates at node level.
36+
37+
Operational tips
38+
- If you are testing locally and want to observe sweeper behavior quickly:
39+
- Set HUB_NODE_TIMEOUT=5 and HUB_SWEEPER_EXPIRE=true on the hub.
40+
- Stop a worker to simulate a dead node.
41+
- Watch hub logs for "[Sweeper] Expiring node=..." and pruning messages.
42+
- In CI or during cautious rollouts, set HUB_SWEEPER_EXPIRE=false to dry-run. The sweeper will log and refresh a short node_alive TTL instead of deleting anything.
43+
44+
Metrics
45+
- While the sweeper itself does not currently expose Prometheus metrics, overall pool gauges (available counts per label) are updated elsewhere. Consider adding sweeper-specific counters if needed for ops visibility.
46+
47+
Security considerations
48+
- The sweeper only reads/writes keys used by the grid. Keys deleted are specific to the expired node or to browserId mappings captured from the in-use entries.
49+
50+
Version
51+
- Introduced orphaned in-use pruning in this repository session (2025-08-31).
52+
53+
Interpreting Sweeper logs
54+
- The service logs a summary at the end of each pass, e.g.: [Sweeper] Tick done: scanned=3 expired=0 errors=0 took=2ms
55+
- scanned=N: number of nodeIds in the Redis set "nodes" that were evaluated this tick.
56+
- expired=N: how many nodes were actually expired (removed and pruned) in this tick. This remains 0 when:
57+
- Nodes are healthy (node_alive:{nodeId} TTL present), or
58+
- LastSeen is within HUB_NODE_TIMEOUT, or
59+
- HUB_SWEEPER_EXPIRE=false (dry-run mode), or
60+
- The sweeper detected active available:* entries for the node and refreshed a short TTL instead of expiring.
61+
- errors=N: number of caught exceptions during processing (per-node or loop-level). Non-zero suggests Redis or parsing issues.
62+
- took=Xms: how long the entire sweep iteration took in milliseconds.
63+
- If you see scanned>0 with expired=0 consistently, it typically means heartbeats are healthy and no nodes are stale.

docs/tasks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ The following is an ordered, actionable checklist covering architectural and cod
1414
8. [X] Add distributed tracing via OpenTelemetry (traces, metrics, logs) with exporters configurable (OTLP/Prometheus).
1515
9. [X] Expand Prometheus metrics: borrow latency histogram, borrow outcomes (success/timeout/denied), pool utilization per label, queue length, node heartbeats.
1616
10. [X] Introduce a capacity queue in Hub for pending borrows with timeout and fairness (per-label and per-run caps) to reduce thundering herd.
17-
11. [ ] Implement node heartbeat/liveness tracker with configurable timeout; evict stale nodes and reclaim/expire orphaned sessions.
18-
12. [ ] Add borrow TTL and auto-return on timeout; persist session state to Redis to survive Hub restarts.
17+
11. [X] Implement node heartbeat/liveness tracker with configurable timeout; evict stale nodes and reclaim/expire orphaned sessions.
18+
12. [X] Add borrow TTL and auto-return on timeout; persist session state to Redis to survive Hub restarts.
1919
13. [ ] Harden Redis usage: resilience (timeouts, retries with jitter, circuit breaker), connection settings, and health checks integrated into readiness.
2020
14. [ ] Support secret rotation: accept multiple HUB_RUNNER_SECRET/HUB_NODE_SECRET values (comma-separated) and log deprecation windows.
2121
15. [ ] Redact secrets and PII in logs; ensure headers and sensitive values never appear in structured logs.

0 commit comments

Comments
 (0)