|
| 1 | +## Connectivity Layer — Overlay Networking for Leader/Followers |
| 2 | + |
| 3 | +### Purpose |
| 4 | + |
| 5 | +Provide reliable, low‑friction connectivity for run‑everywhere and Mutagen between the leader and follower hosts across Linux, macOS, and Windows. |
| 6 | + |
| 7 | +Key properties: |
| 8 | +- One overlay per host/VM, not per task. All agent sessions reuse the same connectivity. |
| 9 | +- Prefer SSH as the execution transport; Mutagen spawns its agent over SSH. |
| 10 | + |
| 11 | +### Actors (Roles) |
| 12 | + |
| 13 | +- **Coordinator**: The controller that creates sessions, provisions followers, requests connectivity credentials, and orchestrates handshakes (typically the `aw` client or WebUI backend acting on behalf of the user). |
| 14 | +- **Leader**: The primary workspace host (Linux preferred) that owns FsSnapshots and initiates fences and run‑everywhere. |
| 15 | +- **Followers**: Secondary hosts (Windows/macOS/Linux) that execute commands and validate builds/tests. |
| 16 | + |
| 17 | +### Recommended Options |
| 18 | + |
| 19 | +- Tailscale (default) |
| 20 | + - WireGuard‑based mesh with automatic NAT traversal, MagicDNS, device tags/ACLs, and optional Tailscale SSH. |
| 21 | + - Simple SSO onboarding across OS’s. Suitable for parallel tasks because a single daemon/TUN per host serves all sessions. |
| 22 | + - Userspace mode for locked‑down containers: `tailscaled --tun=userspace-networking --socks5-server=127.0.0.1:1055` and route SSH/Mutagen via the SOCKS proxy. |
| 23 | + - Self‑hosted control plane: Headscale. |
| 24 | + |
| 25 | + - Ephemeral nodes for short‑lived sessions: |
| 26 | + - Use ephemeral auth keys (or `--state=mem:`) so devices auto‑remove shortly after going offline; they receive a fresh IP each time. |
| 27 | + - Immediate cleanup: call `tailscale logout` on teardown. |
| 28 | + - Scope access via ACL tags (e.g., `tag:session-<id>` ↔ `tag:session-<id>` only). |
| 29 | + |
| 30 | +- ZeroTier (good alternative) |
| 31 | + - L2/L3 virtual network with NAT traversal and central controller. Easy multi‑OS setup. |
| 32 | + - Use allocated overlay IPs/Magic DNS in `.agents/hosts.json`. |
| 33 | + |
| 34 | +- Raw WireGuard (minimal) |
| 35 | + - Fast and simple, but manual key/IP management and NAT traversal setup. Best for small/static topologies or when WG is already in place. |
| 36 | + |
| 37 | +- SSH‑only (fallback) |
| 38 | + - Direct SSH over public/private networks, or reverse SSH tunnels if followers cannot accept inbound connections. More ops overhead but universally available. |
| 39 | + |
| 40 | +### Operational Guidance |
| 41 | + |
| 42 | +- Standardize on SSH |
| 43 | + - Mutagen can run over SSH; run‑everywhere executes remote commands via SSH. |
| 44 | + - Keep follower SSH access non‑root; prefer short‑lived keys or Tailscale SSH. |
| 45 | + |
| 46 | +- Host Catalog |
| 47 | + - Store overlay addresses and metadata in `.agents/hosts.json` (or via REST): |
| 48 | + ```json |
| 49 | + { |
| 50 | + "hosts": [ |
| 51 | + { "name": "win-01", "os": "windows", "address": "win-01.tailnet.example", "tags": ["os=windows"], "sshUser": "builder" }, |
| 52 | + { "name": "mac-01", "os": "macos", "address": "100.101.102.103", "tags": ["os=macos"], "sshUser": "builder" } |
| 53 | + ] |
| 54 | + } |
| 55 | + ``` |
| 56 | + |
| 57 | +- Security |
| 58 | + - Use overlay ACLs (Tailscale/ZeroTier) to restrict leader↔follower reachability. |
| 59 | + - Disable password auth on SSH; prefer keys/SSO; limit to non‑privileged users. |
| 60 | + - For NetBird: create Setup Keys with `ephemeral=true`, optional `reusable`, and `auto_groups` to isolate short‑lived peers; peers auto‑purge after inactivity. Use API `DELETE /api/peers/{id}` for immediate removal. |
| 61 | + |
| 62 | +- Performance |
| 63 | + - Co‑locate followers when possible; validate MTU; monitor sync‑fence latency in CI multi‑OS smoke tests. |
| 64 | + |
| 65 | +### NetBird Support (ephemeral peers) |
| 66 | + |
| 67 | +Workflow: |
| 68 | + |
| 69 | +1) Coordinator requests an ephemeral Setup Key from the Agents‑Workflow REST service (see REST section below). |
| 70 | +2) Followers run `netbird up --setup-key <KEY>`. |
| 71 | +3) Access is scoped via `auto_groups` and policies (only session peers can talk to each other and the leader). |
| 72 | +4) Teardown: stop the agent; peers auto‑purge after ~10 minutes of inactivity, or the coordinator calls `DELETE /api/peers/{id}` via NetBird API for immediate removal. |
| 73 | + |
| 74 | +Notes: |
| 75 | +- Keys may be reusable to simplify parallel follower startup; groups isolate session scope. |
| 76 | +- Ephemeral peers should not be referenced by fixed IPs; use names/groups. |
| 77 | + |
| 78 | +### Fallback Relay Mode (no overlay available) |
| 79 | + |
| 80 | +When the server or client reports overlays are unavailable, the coordinator can relay messages/logs between leader and followers. |
| 81 | + |
| 82 | +- Transport: Server‑Sent Events (SSE) for downstream (client subscribe) and HTTP POST for upstream (client publish). |
| 83 | +- Channels: Namespaced by `sessionId` and `host`. |
| 84 | +- Scope: run‑everywhere control messages, stdout/stderr, exit codes, and minimal Mutagen control signals (not bulk sync). |
| 85 | + |
| 86 | +Relay behavior: |
| 87 | +- Followers subscribe to `/relay/{sessionId}/{host}/control` and `/relay/{sessionId}/{host}/stdin`. |
| 88 | +- Followers POST logs to `/relay/{sessionId}/{host}/stdout` and `/relay/{sessionId}/{host}/stderr`, and status to `/relay/{sessionId}/{host}/status`. |
| 89 | +- Leader/client multiplexes commands to followers and aggregates outputs. |
| 90 | + |
| 91 | +### Handshake & Sync Confirmation |
| 92 | + |
| 93 | +Goal: Confirm follower connectivity (overlay or relay) before first run‑everywhere, with a short timeout. |
| 94 | + |
| 95 | +Sequence (overlay path): |
| 96 | +1) Client → REST: `POST /connect/keys` (request ephemeral keys for preferred provider in priority order: netbird → tailscale). |
| 97 | +2) Server → Client: returns available provider and session‑scoped credentials (e.g., NetBird setup key or Tailscale ephemeral auth key) plus ACL/group tags. |
| 98 | +3) Client: distributes credentials to follower instances; followers join overlay. |
| 99 | +4) Client → REST: `POST /connect/handshake` with the list of expected followers. |
| 100 | +5) Followers → REST: `POST /connect/handshake/ack` upon successful join (overlay reachability + SSH check). |
| 101 | +6) Server → Client: `200 OK` when all acks received or timeout reached; includes per‑host status. |
| 102 | + |
| 103 | +Fallback (relay path): Same handshake but without overlay checks; followers instead establish SSE subscriptions; acks include relay stream readiness. |
| 104 | + |
| 105 | +### Sequence Diagram (overlay path) |
| 106 | + |
| 107 | +```mermaid |
| 108 | +sequenceDiagram |
| 109 | + participant Client as Coordinator (aw client) |
| 110 | + participant REST as REST Service |
| 111 | + participant F as Followers (per host) |
| 112 | +
|
| 113 | + Client->>REST: POST /connect/keys { providers: [netbird,tailscale] } |
| 114 | + REST-->>Client: { provider, credentials } |
| 115 | + Client->>F: distribute credentials (netbird/tailscale up) |
| 116 | + Client->>REST: POST /connect/handshake { sessionId, hosts } |
| 117 | + F-->>REST: POST /connect/handshake/ack (hostA) |
| 118 | + F-->>REST: POST /connect/handshake/ack (hostB) |
| 119 | + REST-->>Client: 200 OK { statuses } |
| 120 | +``` |
| 121 | + |
| 122 | +### Fallback Transport (no overlay, no public IPs) |
| 123 | + |
| 124 | +Assumptions: No TUN devices (no Tailscale/NetBird dataplane), no inbound connectivity to the Coordinator (no public IP), and HTTP CONNECT is not available/useful. |
| 125 | + |
| 126 | +Approach A (VM‑hosted SOCKS + client relay hub — default in ad‑hoc fleets): |
| 127 | + |
| 128 | +- Each VM runs the agents‑workflow backend which exposes a local SOCKS5 proxy on 127.0.0.1 (inside the VM). All software in the VM (SSH, Mutagen) uses this local SOCKS to reach other fleet members. |
| 129 | +- The `aw` client maintains persistent control connections to each VM backend and relays per‑connection bytestreams between VMs (leader ↔ follower) over these client↔VM links. |
| 130 | +- This requires zero inbound connectivity and no public IPs; only outbound from the client to each VM. |
| 131 | + |
| 132 | +Mermaid sequence (VM‑local SOCKS, client relay hub): |
| 133 | + |
| 134 | +```mermaid |
| 135 | +sequenceDiagram |
| 136 | + participant Proc as VM Process (SSH/Mutagen) |
| 137 | + participant SOCKS as VM Local SOCKS5 (agents backend) |
| 138 | + participant Client as aw relay hub |
| 139 | + participant Other as Other VM Local SOCKS5 |
| 140 | +
|
| 141 | + Proc->>SOCKS: CONNECT follower-01:22 |
| 142 | + SOCKS->>Client: open stream to follower-01 |
| 143 | + Client->>Other: open stream request |
| 144 | + Other-->>Client: stream ready (to 127.0.0.1:22) |
| 145 | + Client-->>SOCKS: stream ready |
| 146 | + Proc-->>SOCKS: SSH bytes |
| 147 | + SOCKS-->>Client: forward bytes |
| 148 | + Client-->>Other: forward bytes |
| 149 | + Other-->>Client: response bytes |
| 150 | + Client-->>SOCKS: response bytes |
| 151 | + SOCKS-->>Proc: response bytes |
| 152 | +``` |
| 153 | + |
| 154 | +Approach B (server‑hosted): A REST‑hosted, per‑session SOCKS5 rendezvous that tunnels TCP streams over WebSockets. |
| 155 | + |
| 156 | +- **Session SOCKS5 Relay (REST‑hosted)** |
| 157 | + - The REST service exposes a SOCKS5 front‑end that does not reach the public internet. Instead, it maps "destinations" to registered peers (leader/followers) connected via WebSocket. |
| 158 | + - Peers register their local endpoints (e.g., `ssh: 127.0.0.1:22`) over a persistent WebSocket: `WS /api/v1/connect/socks/register?peerId=...&role=leader|follower`. |
| 159 | + - The SOCKS5 server accepts `CONNECT follower-01:22` from the leader’s SSH and forwards bytes over the follower’s WebSocket to its local `127.0.0.1:22`. |
| 160 | + - Similarly, Mutagen using SSH as transport routes through `ProxyCommand` to the session SOCKS5 server. |
| 161 | + |
| 162 | +- **Why this works without public IPs** |
| 163 | + - Both leader and followers initiate outbound WebSocket connections to the REST service. The service stitches the bytestreams, acting as a rendezvous. |
| 164 | + |
| 165 | +- **Client configuration** |
| 166 | + - SSH config on leader (example): |
| 167 | + ``` |
| 168 | + Host follower-01 |
| 169 | + HostName follower-01 |
| 170 | + ProxyCommand nc -X 5 -x socks.rest.example:1080 %h %p |
| 171 | + ``` |
| 172 | + - Mutagen: configure SSH to use the same `ProxyCommand`. |
| 173 | + |
| 174 | +Mermaid sequence (server‑hosted SOCKS5 rendezvous): |
| 175 | + |
| 176 | +```mermaid |
| 177 | +sequenceDiagram |
| 178 | + participant L as Leader SSH |
| 179 | + participant S as REST Session SOCKS5 |
| 180 | + participant WS as REST WS Hub |
| 181 | + participant F as Follower (WS agent) |
| 182 | +
|
| 183 | + F->>WS: WS register peerId=follower-01 targets={ssh:127.0.0.1:22} |
| 184 | + L->>S: SOCKS5 CONNECT follower-01:22 |
| 185 | + S->>WS: open stream to peer follower-01 (ssh) |
| 186 | + WS-->>S: stream ready |
| 187 | + S-->>L: SOCKS5 200 OK |
| 188 | + L-->>S: SSH bytes |
| 189 | + S-->>WS: forward bytes |
| 190 | + WS-->>F: forward to 127.0.0.1:22 |
| 191 | + F-->>WS: SSH response bytes |
| 192 | + WS-->>S: forward bytes |
| 193 | + S-->>L: SSH response bytes |
| 194 | +``` |
| 195 | + |
| 196 | +Note: In both approaches, HTTP CONNECT is not assumed available and public IPs are not required. |
| 197 | + |
| 198 | +### Userspace VPN SOCKS fallback (when TUN fails) |
| 199 | + |
| 200 | +If the Coordinator has provided ephemeral overlay credentials (NetBird/Tailscale), but VMs fail to create TUN interfaces, each VM launches a userspace VPN daemon that exposes a local SOCKS5 proxy. All VM processes (SSH, Mutagen, etc.) use this local SOCKS as above, and traffic is relayed either by the aw client (ad‑hoc fleets) or by the REST server (server‑hosted rendezvous). |
| 201 | + |
| 202 | + |
| 203 | + |
0 commit comments