|
| 1 | +# Federation Multi-Node Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +PeerClaw supports a federated multi-node architecture that enables multiple gateway nodes to forward signaling messages, discover agents from remote peers, and relay WebRTC signals across geographically distributed instances. |
| 6 | + |
| 7 | +## Three-Layer Broker Architecture |
| 8 | + |
| 9 | +Signal distribution uses a layered broker model: |
| 10 | + |
| 11 | +``` |
| 12 | +┌──────────────────────────────────────────────────────┐ |
| 13 | +│ Agent A <-> WebSocket <-> Hub <-> Broker.Publish() │ |
| 14 | +│ | │ |
| 15 | +│ +------------+------------+ │ |
| 16 | +│ v v v │ |
| 17 | +│ LocalBroker RedisBroker FederationBroker |
| 18 | +│ (single) (cluster) (cross-region) |
| 19 | +└──────────────────────────────────────────────────────┘ |
| 20 | +``` |
| 21 | + |
| 22 | +- **LocalBroker** -- In-memory forwarding, single-node development |
| 23 | +- **RedisBroker** -- Redis Pub/Sub, multi-node within same cluster |
| 24 | +- **FederationBroker** -- HTTP forwarding, cross-network/cross-organization relay |
| 25 | + |
| 26 | +### Key Source Files |
| 27 | + |
| 28 | +| File | Purpose | |
| 29 | +|------|---------| |
| 30 | +| `internal/signaling/broker.go` | Broker interface definition | |
| 31 | +| `internal/signaling/local_broker.go` | Single-node in-memory broker | |
| 32 | +| `internal/signaling/redis_broker.go` | Multi-node Redis Pub/Sub broker | |
| 33 | +| `internal/signaling/federation_broker.go` | Hybrid local + federation broker | |
| 34 | +| `internal/signaling/ws.go` | WebSocket hub and agent connection management | |
| 35 | +| `internal/federation/federation.go` | Core federation service, peer management, signal forwarding | |
| 36 | +| `internal/federation/dns.go` | DNS SRV-based peer discovery | |
| 37 | + |
| 38 | +## Signal Forwarding Flow |
| 39 | + |
| 40 | +``` |
| 41 | +Agent A (Node 1, Beijing) sends signal to Agent B (Node 2, Shanghai) |
| 42 | +
|
| 43 | +1. Agent A sends SignalMessage{To: "agent-b"} via WebSocket |
| 44 | +2. Hub.Forward() -> Broker.Publish() |
| 45 | +3. FederationBroker checks: hub.HasAgent("agent-b")? |
| 46 | + +-- YES -> Local WebSocket delivery |
| 47 | + +-- NO -> Iterate all peer nodes |
| 48 | + POST /api/v1/federation/signal -> Node 2 |
| 49 | + Header: Authorization: Bearer <peer-token> |
| 50 | +4. Node 2 receives request: |
| 51 | + - Validates Bearer token (constant-time comparison) |
| 52 | + - Calls HandleIncomingSignal() |
| 53 | + - Hub.DeliverLocal() -> delivers to Agent B via WebSocket |
| 54 | +``` |
| 55 | + |
| 56 | +Automatic fallback: if federation forwarding fails, it falls back to the local broker. |
| 57 | + |
| 58 | +## Peer Discovery |
| 59 | + |
| 60 | +### Method 1: Static Configuration |
| 61 | + |
| 62 | +```yaml |
| 63 | +federation: |
| 64 | + enabled: true |
| 65 | + node_name: "beijing-1" |
| 66 | + auth_token: "${FEDERATION_TOKEN}" |
| 67 | + peers: |
| 68 | + - name: "shanghai-1" |
| 69 | + address: "https://sh.example.com" |
| 70 | + token: "${PEER_SH_TOKEN}" |
| 71 | + - name: "guangzhou-1" |
| 72 | + address: "https://gz.example.com" |
| 73 | + token: "${PEER_GZ_TOKEN}" |
| 74 | +``` |
| 75 | +
|
| 76 | +### Method 2: DNS SRV Auto-Discovery |
| 77 | +
|
| 78 | +```yaml |
| 79 | +federation: |
| 80 | + enabled: true |
| 81 | + dns_enabled: true |
| 82 | + dns_domain: "peerclaw.example.com" |
| 83 | +``` |
| 84 | +
|
| 85 | +Queries `_peerclaw._tcp.peerclaw.example.com` SRV records to automatically discover all node host:port pairs and join the federation. |
| 86 | + |
| 87 | +Implementation in `internal/federation/dns.go`: |
| 88 | +- `DiscoverPeers(domain)` performs `net.LookupSRV("peerclaw", "tcp", domain)` |
| 89 | +- Constructs HTTP URLs from SRV target and port |
| 90 | +- Called during initialization if `dns_enabled=true` |
| 91 | + |
| 92 | +## Redis Multi-Node Mode (Intra-Cluster) |
| 93 | + |
| 94 | +Unlike federation, Redis mode is for horizontal scaling of multiple PeerClaw instances **within the same cluster**: |
| 95 | + |
| 96 | +``` |
| 97 | + +--- Node 1 (peerclawd) ---+ |
| 98 | +Agent A -| |-- Redis Pub/Sub --+ |
| 99 | + +--------------------------+ channel: | |
| 100 | + "peerclaw: | |
| 101 | + +--- Node 2 (peerclawd) ---+ signaling" | |
| 102 | +Agent B -| |-------------------+ |
| 103 | + +--------------------------+ |
| 104 | +``` |
| 105 | + |
| 106 | +Key design decisions: |
| 107 | +- **Echo prevention**: Each node has a unique `NodeID` (UUID). On receiving a message, it skips messages it sent itself (`env.NodeID == b.nodeID`). |
| 108 | +- **256-message buffer**: Non-blocking channel prevents slow consumers from blocking. |
| 109 | +- **Automatic fallback**: If Redis is unavailable, falls back to LocalBroker. |
| 110 | + |
| 111 | +### Redis Envelope Format |
| 112 | + |
| 113 | +```go |
| 114 | +type redisEnvelope struct { |
| 115 | + NodeID string `json:"node_id"` |
| 116 | + Message signaling.SignalMessage `json:"message"` |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +Published to Redis channel `peerclaw:signaling`. |
| 121 | + |
| 122 | +## Federation API Endpoints |
| 123 | + |
| 124 | +| Endpoint | Method | Purpose | |
| 125 | +|----------|--------|---------| |
| 126 | +| `/api/v1/federation/signal` | POST | Receive signal forwarding from other nodes | |
| 127 | +| `/api/v1/federation/discover` | POST | Receive agent discovery queries from other nodes | |
| 128 | + |
| 129 | +Handlers are in `internal/server/http.go`: |
| 130 | +- `handleFederationSignal` -- validates Bearer token, parses SignalMessage, calls `HandleIncomingSignal()` |
| 131 | +- `handleFederationDiscover` -- queries local registry, returns matching agents |
| 132 | + |
| 133 | +## Cross-Node Agent Discovery |
| 134 | + |
| 135 | +```go |
| 136 | +// Node 1 wants to find agents with "translation" capability |
| 137 | +agents := fedService.QueryAgents(ctx, []string{"translation"}) |
| 138 | +// -> Concurrently queries all peers' /api/v1/federation/discover |
| 139 | +// -> Aggregates results from all nodes |
| 140 | +``` |
| 141 | + |
| 142 | +Implementation in `federation.go` `QueryAgents()`: |
| 143 | +1. Queries `/api/v1/discover` on each peer |
| 144 | +2. Filters by capabilities |
| 145 | +3. Aggregates results from all peers |
| 146 | +4. Returns combined agent list |
| 147 | + |
| 148 | +## Security Model |
| 149 | + |
| 150 | +| Layer | Mechanism | |
| 151 | +|-------|-----------| |
| 152 | +| Local signaling | Agent signature-based (Ed25519 public key verification) | |
| 153 | +| Federation | Bearer token per peer (`crypto/subtle.ConstantTimeCompare`) | |
| 154 | +| Redis | No authentication (internal trusted network) | |
| 155 | +| Federation HTTP client | TLS 1.2 minimum, 10-second timeout | |
| 156 | + |
| 157 | +## Core Data Structures |
| 158 | + |
| 159 | +### FederationService |
| 160 | + |
| 161 | +```go |
| 162 | +type FederationService struct { |
| 163 | + mu sync.RWMutex |
| 164 | + nodeName string |
| 165 | + peers map[string]*FederationPeer |
| 166 | + authToken string |
| 167 | + logger *slog.Logger |
| 168 | + client *http.Client |
| 169 | + onSignal func(ctx context.Context, msg signaling.SignalMessage) |
| 170 | + stopCh chan struct{} |
| 171 | +} |
| 172 | +``` |
| 173 | + |
| 174 | +### FederationPeer |
| 175 | + |
| 176 | +```go |
| 177 | +type FederationPeer struct { |
| 178 | + Name string |
| 179 | + Address string // URL to peer node |
| 180 | + Token string // Auth token for this peer |
| 181 | + Connected bool |
| 182 | + LastSync time.Time |
| 183 | +} |
| 184 | +``` |
| 185 | + |
| 186 | +### FederationBroker -- Hybrid Routing Logic |
| 187 | + |
| 188 | +```go |
| 189 | +func (fb *FederationBroker) Publish(ctx context.Context, msg SignalMessage) error { |
| 190 | + if fb.hub.HasAgent(msg.To) { |
| 191 | + return fb.local.Publish(ctx, msg) // Local delivery |
| 192 | + } |
| 193 | + if err := fb.federation.ForwardSignal(ctx, msg); err != nil { |
| 194 | + return fb.local.Publish(ctx, msg) // Fallback to local |
| 195 | + } |
| 196 | + return nil |
| 197 | +} |
| 198 | +``` |
| 199 | + |
| 200 | +## Deployment Topologies |
| 201 | + |
| 202 | +### Topology 1: Load-Balanced Cluster (Redis Mode) |
| 203 | + |
| 204 | +``` |
| 205 | + +--- Load Balancer ---+ |
| 206 | + | | |
| 207 | + +-----+-----+ +------+-----+ |
| 208 | + | Node 1 | | Node 2 | |
| 209 | + | peerclawd | | peerclawd | |
| 210 | + +-----+-----+ +------+-----+ |
| 211 | + | Shared Redis | |
| 212 | + +--------+-----------+ |
| 213 | + | |
| 214 | + +-----+-----+ |
| 215 | + | Redis | |
| 216 | + +-----+-----+ |
| 217 | + | |
| 218 | + +-----+------+ |
| 219 | + | PostgreSQL | |
| 220 | + +------------+ |
| 221 | +``` |
| 222 | + |
| 223 | +### Topology 2: Cross-Region Federation |
| 224 | + |
| 225 | +``` |
| 226 | + Beijing DC Shanghai DC |
| 227 | ++-------------+ HTTP/TLS +-------------+ |
| 228 | +| Node BJ-1 |<------------->| Node SH-1 | |
| 229 | +| + Redis | Federation | + Redis | |
| 230 | +| + PG | Signal | + PG | |
| 231 | +| Agents: | | Agents: | |
| 232 | +| A, B, C | | D, E, F | |
| 233 | ++-------------+ +-------------+ |
| 234 | + ^ ^ |
| 235 | + | DNS SRV: _peerclaw._tcp.corp | |
| 236 | + +---------- Auto-discovery ----+ |
| 237 | +``` |
| 238 | + |
| 239 | +### Topology 3: Hybrid (Redis + Federation) |
| 240 | + |
| 241 | +``` |
| 242 | + Cluster A (3 nodes + Redis) Cluster B (2 nodes + Redis) |
| 243 | ++-------------------------+ +-------------------------+ |
| 244 | +| Node1 <-Redis-> Node2 |<--->| Node4 <-Redis-> Node5 | |
| 245 | +| ^ | Fed | | |
| 246 | +| Node3 | +-------------------------+ |
| 247 | ++-------------------------+ |
| 248 | +``` |
| 249 | + |
| 250 | +Intra-cluster uses Redis broadcast, inter-cluster uses Federation HTTP forwarding. |
| 251 | + |
| 252 | +## Initialization Flow |
| 253 | + |
| 254 | +``` |
| 255 | +main.go |
| 256 | ++-- Load config |
| 257 | ++-- Initialize logger |
| 258 | ++-- Initialize OpenTelemetry |
| 259 | ++-- Initialize database |
| 260 | ++-- Initialize signaling hub (if enabled) |
| 261 | +| +-- SetBroker() |
| 262 | +| +-- LocalBroker (if no Redis) |
| 263 | +| +-- RedisBroker (if Redis available) |
| 264 | ++-- Initialize federation (if enabled) |
| 265 | +| +-- Create FederationService |
| 266 | +| +-- Add static peers from config |
| 267 | +| +-- Discover DNS SRV peers (if dns_enabled) |
| 268 | ++-- Create HTTP server |
| 269 | +| +-- SetFederation() |
| 270 | +| +-- Register routes: |
| 271 | +| +-- POST /api/v1/federation/signal |
| 272 | +| +-- POST /api/v1/federation/discover |
| 273 | ++-- Start servers and wait for shutdown |
| 274 | + +-- Clean shutdown: fedService.Close() |
| 275 | +``` |
| 276 | + |
| 277 | +## Comparison |
| 278 | + |
| 279 | +| Feature | LocalBroker | RedisBroker | FederationBroker | |
| 280 | +|---------|------------|-------------|-----------------| |
| 281 | +| Node count | 1 | N (same cluster) | N (cross-network) | |
| 282 | +| Transport | In-memory | Redis Pub/Sub | HTTPS | |
| 283 | +| Latency | ~0 | ~1ms | ~10-100ms | |
| 284 | +| Auth | None | None (trusted LAN) | Bearer Token | |
| 285 | +| Agent discovery | Local | Shared DB | HTTP query | |
| 286 | +| Use case | Development | Production cluster | Multi-region/multi-org | |
| 287 | + |
| 288 | +## Configuration Reference |
| 289 | + |
| 290 | +```yaml |
| 291 | +federation: |
| 292 | + enabled: false # Enable federation |
| 293 | + node_name: "node-1" # This node's unique name |
| 294 | + auth_token: "${FED_TOKEN}" # Auth token (required when enabled) |
| 295 | + dns_enabled: false # Enable DNS SRV discovery |
| 296 | + dns_domain: "" # Domain for SRV lookup |
| 297 | + peers: # Static peer list |
| 298 | + - name: "node-2" |
| 299 | + address: "https://node2.example.com" |
| 300 | + token: "${PEER_TOKEN}" |
| 301 | +``` |
| 302 | +
|
| 303 | +Validation: `federation.auth_token` is required when `federation.enabled = true`. |
| 304 | +Environment variable substitution (`${VAR}`) is supported for all token fields. |
0 commit comments