Skip to content

Commit 3731df4

Browse files
committed
docs: more details about the multi OS testing and its networking aspects
1 parent 63908ef commit 3731df4

File tree

7 files changed

+324
-26
lines changed

7 files changed

+324
-26
lines changed

docs/agent-time-travel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ The initial implementation will focus on supporting regular FsSnapshot on copy-o
8282
- bash: `trap DEBUG` + `PROMPT_COMMAND` pair to delimit commands.
8383
- fish: `fish_preexec`/`fish_postexec` equivalents.
8484
- **Runtime Integration**: The runner emits session timeline events (SSE) at milestones; the snapshot manager aligns nearest FsSnapshot ≤ timestamp.
85-
- **Multi‑OS Sync Fence**: When multi‑OS testing is enabled, each execution cycle performs `fs_snapshot_and_sync` on the leader (create FsSnapshot, then fence Mutagen sessions to followers) before invoking `run_everywhere`. See `docs/multi-os-testing.md`.
85+
- **Multi‑OS Sync Fence**: When multi‑OS testing is enabled, each execution cycle performs `fs_snapshot_and_sync` on the leader (create FsSnapshot, then fence Mutagen sessions to followers) before invoking `run-everywhere`. See `docs/multi-os-testing.md`.
8686
- **Advanced (future)**: eBPF capture of PTY I/O and/or FS mutations; rr-based post‑facto reconstruction of session recordings; out of scope for v1 but compatible with this model.
8787

8888
### REST API Extensions

docs/cli-spec.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,16 @@ Mirrors `docs/configuration.md` including provenance, precedence, and Windows be
132132

133133
- `aw followers list` — List configured follower hosts and tags.
134134
- `aw followers sync-fence [--timeout <sec>] [--tag <k=v>]... [--host <name>]... [--all]` — Perform a synchronization fence, ensuring followers match the leader workspace state.
135-
- `aw run-everywhere <action> [args...] [--tag <k=v>]... [--host <name>]... [--all]` — Invoke project’s `.agents/run_everywhere` on selected followers.
135+
- `aw run-everywhere [--tag <k=v>]... [--host <name>]... [--all] [--] <command> [args...]` — Invoke run‑everywhere on selected followers.
136+
137+
#### 11) Connectivity (Overlay/Relay)
138+
139+
- `aw connect keys [--provider netbird|tailscale|auto] [--tag <name>]...` — Request session connectivity credentials.
140+
- `aw connect handshake --session <id> [--hosts <list>] [--timeout <sec>]` — Initiate and wait for follower acks; prints per‑host status.
141+
- Relay utilities (fallback):
142+
- `aw relay tail --session <id> --host <name> [--stream stdout|stderr|status]`
143+
- `aw relay send --session <id> --host <name> --control <json>`
144+
- `aw relay socks5 --session <id> --bind 127.0.0.1:1080` — Start a local SOCKS5 relay for this session (client‑hosted rendezvous).
136145

137146

138147
- `aw doctor` — Environment diagnostics (snapshot providers, multiplexer availability, docker/devcontainer, git).

docs/connectivity-layer.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
## Connectivity Layer — Overlay Networking for Leader/Followers
2+
3+
### Purpose
4+
5+
Provide reliable, low‑friction connectivity for run‑everywhere and Mutagen between the leader and follower hosts across Linux, macOS, and Windows.
6+
7+
Key properties:
8+
- One overlay per host/VM, not per task. All agent sessions reuse the same connectivity.
9+
- Prefer SSH as the execution transport; Mutagen spawns its agent over SSH.
10+
11+
### Actors (Roles)
12+
13+
- **Coordinator**: The controller that creates sessions, provisions followers, requests connectivity credentials, and orchestrates handshakes (typically the `aw` client or WebUI backend acting on behalf of the user).
14+
- **Leader**: The primary workspace host (Linux preferred) that owns FsSnapshots and initiates fences and run‑everywhere.
15+
- **Followers**: Secondary hosts (Windows/macOS/Linux) that execute commands and validate builds/tests.
16+
17+
### Recommended Options
18+
19+
- Tailscale (default)
20+
- WireGuard‑based mesh with automatic NAT traversal, MagicDNS, device tags/ACLs, and optional Tailscale SSH.
21+
- Simple SSO onboarding across OS’s. Suitable for parallel tasks because a single daemon/TUN per host serves all sessions.
22+
- Userspace mode for locked‑down containers: `tailscaled --tun=userspace-networking --socks5-server=127.0.0.1:1055` and route SSH/Mutagen via the SOCKS proxy.
23+
- Self‑hosted control plane: Headscale.
24+
25+
- Ephemeral nodes for short‑lived sessions:
26+
- Use ephemeral auth keys (or `--state=mem:`) so devices auto‑remove shortly after going offline; they receive a fresh IP each time.
27+
- Immediate cleanup: call `tailscale logout` on teardown.
28+
- Scope access via ACL tags (e.g., `tag:session-<id>``tag:session-<id>` only).
29+
30+
- ZeroTier (good alternative)
31+
- L2/L3 virtual network with NAT traversal and central controller. Easy multi‑OS setup.
32+
- Use allocated overlay IPs/Magic DNS in `.agents/hosts.json`.
33+
34+
- Raw WireGuard (minimal)
35+
- Fast and simple, but manual key/IP management and NAT traversal setup. Best for small/static topologies or when WG is already in place.
36+
37+
- SSH‑only (fallback)
38+
- Direct SSH over public/private networks, or reverse SSH tunnels if followers cannot accept inbound connections. More ops overhead but universally available.
39+
40+
### Operational Guidance
41+
42+
- Standardize on SSH
43+
- Mutagen can run over SSH; run‑everywhere executes remote commands via SSH.
44+
- Keep follower SSH access non‑root; prefer short‑lived keys or Tailscale SSH.
45+
46+
- Host Catalog
47+
- Store overlay addresses and metadata in `.agents/hosts.json` (or via REST):
48+
```json
49+
{
50+
"hosts": [
51+
{ "name": "win-01", "os": "windows", "address": "win-01.tailnet.example", "tags": ["os=windows"], "sshUser": "builder" },
52+
{ "name": "mac-01", "os": "macos", "address": "100.101.102.103", "tags": ["os=macos"], "sshUser": "builder" }
53+
]
54+
}
55+
```
56+
57+
- Security
58+
- Use overlay ACLs (Tailscale/ZeroTier) to restrict leader↔follower reachability.
59+
- Disable password auth on SSH; prefer keys/SSO; limit to non‑privileged users.
60+
- For NetBird: create Setup Keys with `ephemeral=true`, optional `reusable`, and `auto_groups` to isolate short‑lived peers; peers auto‑purge after inactivity. Use API `DELETE /api/peers/{id}` for immediate removal.
61+
62+
- Performance
63+
- Co‑locate followers when possible; validate MTU; monitor sync‑fence latency in CI multi‑OS smoke tests.
64+
65+
### NetBird Support (ephemeral peers)
66+
67+
Workflow:
68+
69+
1) Coordinator requests an ephemeral Setup Key from the Agents‑Workflow REST service (see REST section below).
70+
2) Followers run `netbird up --setup-key <KEY>`.
71+
3) Access is scoped via `auto_groups` and policies (only session peers can talk to each other and the leader).
72+
4) Teardown: stop the agent; peers auto‑purge after ~10 minutes of inactivity, or the coordinator calls `DELETE /api/peers/{id}` via NetBird API for immediate removal.
73+
74+
Notes:
75+
- Keys may be reusable to simplify parallel follower startup; groups isolate session scope.
76+
- Ephemeral peers should not be referenced by fixed IPs; use names/groups.
77+
78+
### Fallback Relay Mode (no overlay available)
79+
80+
When the server or client reports overlays are unavailable, the coordinator can relay messages/logs between leader and followers.
81+
82+
- Transport: Server‑Sent Events (SSE) for downstream (client subscribe) and HTTP POST for upstream (client publish).
83+
- Channels: Namespaced by `sessionId` and `host`.
84+
- Scope: run‑everywhere control messages, stdout/stderr, exit codes, and minimal Mutagen control signals (not bulk sync).
85+
86+
Relay behavior:
87+
- Followers subscribe to `/relay/{sessionId}/{host}/control` and `/relay/{sessionId}/{host}/stdin`.
88+
- Followers POST logs to `/relay/{sessionId}/{host}/stdout` and `/relay/{sessionId}/{host}/stderr`, and status to `/relay/{sessionId}/{host}/status`.
89+
- Leader/client multiplexes commands to followers and aggregates outputs.
90+
91+
### Handshake & Sync Confirmation
92+
93+
Goal: Confirm follower connectivity (overlay or relay) before first run‑everywhere, with a short timeout.
94+
95+
Sequence (overlay path):
96+
1) Client → REST: `POST /connect/keys` (request ephemeral keys for preferred provider in priority order: netbird → tailscale).
97+
2) Server → Client: returns available provider and session‑scoped credentials (e.g., NetBird setup key or Tailscale ephemeral auth key) plus ACL/group tags.
98+
3) Client: distributes credentials to follower instances; followers join overlay.
99+
4) Client → REST: `POST /connect/handshake` with the list of expected followers.
100+
5) Followers → REST: `POST /connect/handshake/ack` upon successful join (overlay reachability + SSH check).
101+
6) Server → Client: `200 OK` when all acks received or timeout reached; includes per‑host status.
102+
103+
Fallback (relay path): Same handshake but without overlay checks; followers instead establish SSE subscriptions; acks include relay stream readiness.
104+
105+
### Sequence Diagram (overlay path)
106+
107+
```mermaid
108+
sequenceDiagram
109+
participant Client as Coordinator (aw client)
110+
participant REST as REST Service
111+
participant F as Followers (per host)
112+
113+
Client->>REST: POST /connect/keys { providers: [netbird,tailscale] }
114+
REST-->>Client: { provider, credentials }
115+
Client->>F: distribute credentials (netbird/tailscale up)
116+
Client->>REST: POST /connect/handshake { sessionId, hosts }
117+
F-->>REST: POST /connect/handshake/ack (hostA)
118+
F-->>REST: POST /connect/handshake/ack (hostB)
119+
REST-->>Client: 200 OK { statuses }
120+
```
121+
122+
### Fallback Transport (no overlay, no public IPs)
123+
124+
Assumptions: No TUN devices (no Tailscale/NetBird dataplane), no inbound connectivity to the Coordinator (no public IP), and HTTP CONNECT is not available/useful.
125+
126+
Approach A (VM‑hosted SOCKS + client relay hub — default in ad‑hoc fleets):
127+
128+
- Each VM runs the agents‑workflow backend which exposes a local SOCKS5 proxy on 127.0.0.1 (inside the VM). All software in the VM (SSH, Mutagen) uses this local SOCKS to reach other fleet members.
129+
- The `aw` client maintains persistent control connections to each VM backend and relays per‑connection bytestreams between VMs (leader ↔ follower) over these client↔VM links.
130+
- This requires zero inbound connectivity and no public IPs; only outbound from the client to each VM.
131+
132+
Mermaid sequence (VM‑local SOCKS, client relay hub):
133+
134+
```mermaid
135+
sequenceDiagram
136+
participant Proc as VM Process (SSH/Mutagen)
137+
participant SOCKS as VM Local SOCKS5 (agents backend)
138+
participant Client as aw relay hub
139+
participant Other as Other VM Local SOCKS5
140+
141+
Proc->>SOCKS: CONNECT follower-01:22
142+
SOCKS->>Client: open stream to follower-01
143+
Client->>Other: open stream request
144+
Other-->>Client: stream ready (to 127.0.0.1:22)
145+
Client-->>SOCKS: stream ready
146+
Proc-->>SOCKS: SSH bytes
147+
SOCKS-->>Client: forward bytes
148+
Client-->>Other: forward bytes
149+
Other-->>Client: response bytes
150+
Client-->>SOCKS: response bytes
151+
SOCKS-->>Proc: response bytes
152+
```
153+
154+
Approach B (server‑hosted): A REST‑hosted, per‑session SOCKS5 rendezvous that tunnels TCP streams over WebSockets.
155+
156+
- **Session SOCKS5 Relay (REST‑hosted)**
157+
- The REST service exposes a SOCKS5 front‑end that does not reach the public internet. Instead, it maps "destinations" to registered peers (leader/followers) connected via WebSocket.
158+
- Peers register their local endpoints (e.g., `ssh: 127.0.0.1:22`) over a persistent WebSocket: `WS /api/v1/connect/socks/register?peerId=...&role=leader|follower`.
159+
- The SOCKS5 server accepts `CONNECT follower-01:22` from the leader’s SSH and forwards bytes over the follower’s WebSocket to its local `127.0.0.1:22`.
160+
- Similarly, Mutagen using SSH as transport routes through `ProxyCommand` to the session SOCKS5 server.
161+
162+
- **Why this works without public IPs**
163+
- Both leader and followers initiate outbound WebSocket connections to the REST service. The service stitches the bytestreams, acting as a rendezvous.
164+
165+
- **Client configuration**
166+
- SSH config on leader (example):
167+
```
168+
Host follower-01
169+
HostName follower-01
170+
ProxyCommand nc -X 5 -x socks.rest.example:1080 %h %p
171+
```
172+
- Mutagen: configure SSH to use the same `ProxyCommand`.
173+
174+
Mermaid sequence (server‑hosted SOCKS5 rendezvous):
175+
176+
```mermaid
177+
sequenceDiagram
178+
participant L as Leader SSH
179+
participant S as REST Session SOCKS5
180+
participant WS as REST WS Hub
181+
participant F as Follower (WS agent)
182+
183+
F->>WS: WS register peerId=follower-01 targets={ssh:127.0.0.1:22}
184+
L->>S: SOCKS5 CONNECT follower-01:22
185+
S->>WS: open stream to peer follower-01 (ssh)
186+
WS-->>S: stream ready
187+
S-->>L: SOCKS5 200 OK
188+
L-->>S: SSH bytes
189+
S-->>WS: forward bytes
190+
WS-->>F: forward to 127.0.0.1:22
191+
F-->>WS: SSH response bytes
192+
WS-->>S: forward bytes
193+
S-->>L: SSH response bytes
194+
```
195+
196+
Note: In both approaches, HTTP CONNECT is not assumed available and public IPs are not required.
197+
198+
### Userspace VPN SOCKS fallback (when TUN fails)
199+
200+
If the Coordinator has provided ephemeral overlay credentials (NetBird/Tailscale), but VMs fail to create TUN interfaces, each VM launches a userspace VPN daemon that exposes a local SOCKS5 proxy. All VM processes (SSH, Mutagen, etc.) use this local SOCKS as above, and traffic is relayed either by the aw client (ad‑hoc fleets) or by the REST server (server‑hosted rendezvous).
201+
202+
203+

docs/lima-vm-images.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Lima VM Setup — Linux Images for macOS Multi-OS Testing
2+
3+
## Summary
4+
5+
Define Lima VM image variants for agents-workflow multi-OS testing on macOS. All variants use Nix for agents-workflow components to ensure consistency across image types.
6+
7+
## VM Image Variants
8+
9+
### Alpine + Nix
10+
- **Base**: Alpine Linux (minimal footprint)
11+
- **Purpose**: Nix-first development environment
12+
- **Package management**: Nix for all development tools and agents-workflow components
13+
- **Target users**: Developers preferring declarative, reproducible environments
14+
15+
### Ubuntu LTS
16+
- **Base**: Ubuntu 22.04/24.04 LTS
17+
- **Purpose**: Maximum compatibility and familiar tooling
18+
- **Package management**: APT for system packages, Nix for agents-workflow components, wide range of pre-installed package managers and language version managers for quick set up specific dependencies.
19+
- **Target users**: General development teams wanting conventional Linux environment
20+
21+
## Common Requirements
22+
23+
All images include:
24+
- **Agents-workflow tooling**: Installed via Nix for version consistency
25+
- **Filesystem snapshots**: ZFS or Btrfs support for Agent Time-Travel
26+
- **Multi-OS integration**: SSH access, Tailscale/Netbird/overlay networking
27+
- **Development essentials**: Git, build tools, terminal multiplexers
28+
29+
## Build Components
30+
31+
### Shared Infrastructure
32+
- Common provisioning scripts (reused from Docker container setup)
33+
- Nix flake for agents-workflow tools

0 commit comments

Comments
 (0)