|
| 1 | +# Node Liveness and Sweeper (Hub) |
| 2 | + |
| 3 | +This document explains how the Hub tracks worker node liveness, the configuration knobs, and what happens when a node becomes stale or disappears. It also outlines how orphaned sessions are reclaimed to avoid capacity leaks. |
| 4 | + |
| 5 | +Overview |
| 6 | +- Workers periodically emit heartbeats to Redis updating: |
| 7 | + - node:{nodeId} hash fields: LastSeen (ISO-8601 UTC), Labels (JSON), Capacity |
| 8 | + - nodes set membership (nodeId) |
| 9 | + - node_alive:{nodeId} key with TTL (default 90s in Worker) |
| 10 | +- The Hub runs a background NodeSweeperService that periodically scans for stale nodes and prunes associated capacity entries. It complements the Worker heartbeats by performing garbage collection when nodes are dead or unreachable. |
| 11 | + |
| 12 | +Configuration (environment variables) |
| 13 | +- HUB_NODE_TIMEOUT: seconds of inactivity before a node is considered stale. Default: 60. |
| 14 | +- HUB_SWEEPER_EXPIRE: if true, the sweeper will actually expire nodes and prune data. If false, it will refresh a short TTL on node_alive:{nodeId} and log what would happen. Default: false (dry-run). |
| 15 | + |
| 16 | +How the sweeper works |
| 17 | +1) Tick interval: every ~20 seconds the service performs a pass. |
| 18 | +2) For each nodeId in Redis set "nodes": |
| 19 | + - If node_alive:{nodeId} exists → node is healthy, skip. |
| 20 | + - Else, parse node:{nodeId} LastSeen (strict ISO-8601 Roundtrip). If missing/invalid or older than HUB_NODE_TIMEOUT → candidate for expiration. |
| 21 | + - Small tolerance: if LastSeen is in the future by >5s (clock skew), do not expire. |
| 22 | + - Double check: if node_alive:{nodeId} re-appears during the pass, skip to avoid race with a fresh heartbeat. |
| 23 | + - If there are still available:* entries that reference this node, we treat the node as alive and refresh node_alive TTL to 30s, skipping expiration for this tick. This avoids evicting a node that is actively serving capacity but briefly missed heartbeat. |
| 24 | +3) When expiring (HUB_SWEEPER_EXPIRE=true): |
| 25 | + - Remove nodeId from set "nodes" and delete hash key node:{nodeId}. |
| 26 | + - Prune available:* lists: remove entries containing this nodeId. |
| 27 | + - Prune inuse:* lists (new): remove entries containing this nodeId, and best-effort delete lightweight mappings browser_run:{browserId} and browser_test:{browserId} if browserId is present. This reclaims capacity that would otherwise be stuck. |
| 28 | +4) Logs include per-tick stats: scanned, expired, errors, and tick duration. |
| 29 | + |
| 30 | +Why prune inuse:* too? |
| 31 | +Previously, only available:* lists were pruned. If a node died while a browser was borrowed (inuse:*), capacity would remain stuck. The sweeper now removes those orphaned records and clears run/test mappings so new borrows are not blocked by phantom in-use entries. |
| 32 | + |
| 33 | +Related components |
| 34 | +- Worker HeartbeatService: updates LastSeen and sets node_alive TTL so healthy nodes are never swept. |
| 35 | +- RunCleanupService: a separate hub background service that can auto-return outstanding browsers when runs become inactive or exceed max duration. This operates at run level, whereas NodeSweeperService operates at node level. |
| 36 | + |
| 37 | +Operational tips |
| 38 | +- If you are testing locally and want to observe sweeper behavior quickly: |
| 39 | + - Set HUB_NODE_TIMEOUT=5 and HUB_SWEEPER_EXPIRE=true on the hub. |
| 40 | + - Stop a worker to simulate a dead node. |
| 41 | + - Watch hub logs for "[Sweeper] Expiring node=..." and pruning messages. |
| 42 | +- In CI or during cautious rollouts, set HUB_SWEEPER_EXPIRE=false to dry-run. The sweeper will log and refresh a short node_alive TTL instead of deleting anything. |
| 43 | + |
| 44 | +Metrics |
| 45 | +- While the sweeper itself does not currently expose Prometheus metrics, overall pool gauges (available counts per label) are updated elsewhere. Consider adding sweeper-specific counters if needed for ops visibility. |
| 46 | + |
| 47 | +Security considerations |
| 48 | +- The sweeper only reads/writes keys used by the grid. Keys deleted are specific to the expired node or to browserId mappings captured from the in-use entries. |
| 49 | + |
| 50 | +Version |
| 51 | +- Introduced orphaned in-use pruning in this repository session (2025-08-31). |
| 52 | + |
| 53 | +Interpreting Sweeper logs |
| 54 | +- The service logs a summary at the end of each pass, e.g.: [Sweeper] Tick done: scanned=3 expired=0 errors=0 took=2ms |
| 55 | + - scanned=N: number of nodeIds in the Redis set "nodes" that were evaluated this tick. |
| 56 | + - expired=N: how many nodes were actually expired (removed and pruned) in this tick. This remains 0 when: |
| 57 | + - Nodes are healthy (node_alive:{nodeId} TTL present), or |
| 58 | + - LastSeen is within HUB_NODE_TIMEOUT, or |
| 59 | + - HUB_SWEEPER_EXPIRE=false (dry-run mode), or |
| 60 | + - The sweeper detected active available:* entries for the node and refreshed a short TTL instead of expiring. |
| 61 | + - errors=N: number of caught exceptions during processing (per-node or loop-level). Non-zero suggests Redis or parsing issues. |
| 62 | + - took=Xms: how long the entire sweep iteration took in milliseconds. |
| 63 | +- If you see scanned>0 with expired=0 consistently, it typically means heartbeats are healthy and no nodes are stale. |
0 commit comments