Add federation proposal for cross-server agent communication #8

khaliqgant · 2025-12-21T09:18:57Z

Comprehensive design document for extending agent-relay to support
federated multi-server deployments while preserving the core
differentiator: automatic message injection via tmux.

Key design decisions:

Separation of concerns: routing (network) vs injection (local)
Hybrid topology: optional hub for discovery, direct peer connections
Progressive enhancement: single-server unchanged, federation opt-in
WebSocket + TLS for peer-to-peer daemon communication
Message queuing for resilience during disconnects

Includes:

Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.)
Agent discovery and registry design
Security model (TLS, pre-shared tokens)
Configuration schema
CLI interface design
5-phase implementation plan (~4-5 weeks)

bd-TBD

Comprehensive design document for extending agent-relay to support federated multi-server deployments while preserving the core differentiator: automatic message injection via tmux. Key design decisions: - Separation of concerns: routing (network) vs injection (local) - Hybrid topology: optional hub for discovery, direct peer connections - Progressive enhancement: single-server unchanged, federation opt-in - WebSocket + TLS for peer-to-peer daemon communication - Message queuing for resilience during disconnects Includes: - Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.) - Agent discovery and registry design - Security model (TLS, pre-shared tokens) - Configuration schema - CLI interface design - 5-phase implementation plan (~4-5 weeks) bd-TBD

Identifies major gaps and risks in the federation design: HIGH SEVERITY: - No end-to-end delivery guarantee (sender doesn't know if agent received) - Registry consistency race conditions (split-brain on name collisions) - Message ordering not guaranteed across servers MEDIUM SEVERITY: - Token management doesn't scale (N² tokens for N servers) - No message-level authentication (spoofing possible) - No rate limiting (flood attacks possible) - Debugging distributed failures is hard (no tracing) - NAT/firewall traversal not addressed - Timeline underestimated (8-10 weeks realistic vs 4-5 proposed) Includes: - Specific failure scenarios for each issue - Recommendations for fixes - Alternative approaches (NATS, SSH tunnels) - Suggested MVP scope to ship faster bd-TBD

Major additions to address identified issues: DELIVERY CONFIRMATION (Section 7): - End-to-end ACK so sender knows message was injected - Detection via capture-pane after send-keys - Optional confirmation notification to sender REGISTRY CONSISTENCY (Section 5.3): - Fleet-wide unique names (no split-brain) - Quorum-based registration with Lamport timestamps - Clear error on name collision with suggestions AUTHENTICATION (Section 8.1-8.3): - Ed25519 asymmetric keys (scales better than N² tokens) - Challenge-response handshake - Per-message signing to prevent spoofing - TOFU, static config, or CA options FLOW CONTROL (Section 9): - Credit-based flow control - PEER_BUSY/PEER_READY backpressure signals - Token bucket rate limiting (per-peer, per-agent, fleet-wide) - Bounded queues with drop policies TRANSPORT ABSTRACTION (Section 11): - Pluggable PeerTransport interface - WebSocket implementation (default) - NATS JetStream implementation (optional) - Migration path from WebSocket to NATS TIMELINE (Section 14.3-14.4): - Realistic estimate: 8-10 weeks (not 4-5) - MVP option for 4-week delivery - Phase 6: Stabilization added OPEN QUESTIONS (Section 16): - 10 unresolved questions for discussion - Recommendations for each - Clear decision points before implementation bd-TBD

Addresses storage requirements for federated deployments by separating: - Ephemeral storage (routing): Memory or NATS JetStream for message queues - Durable storage (trajectories): File/SQLite local + PostgreSQL/S3 central References the trajectories proposal (PR #3) for detailed format specification. Includes configuration examples and federation impact analysis.

Tasks organized by phase (1-5) and assigned to agent roles: - Architect: Protocol design, testing, docs (3 tasks) - Network: PeerConnection, PeerManager, reconnection, flow control (8 tasks) - Router: FleetRegistry, routing, broadcast, delivery confirmation (8 tasks) - Security: TLS, Ed25519 authentication (2 tasks) - Storage: Message queues, trajectory storage (3 tasks) - CLI: Fleet commands, config, dashboard (4 tasks) Dependencies mapped to ensure correct build order. See docs/FEDERATION_PROPOSAL.md for full specification.

Changes: - Add collaborators for cross-boundary tasks (8 tasks now dual-assigned) - Fix fed-014 dependency (queue can start after fed-004, not fed-013) - Add fed-026a for PeerTransport interface before NATS adapter - Add Architect review to Security tasks (fed-019) - Lower priority on fed-012 (loop prevention can merge with routing) Dual assignments: - fed-001: Architect + Network (protocol types) - fed-011: Router + Network (federated router integration) - fed-014: Storage + Network (message queue) - fed-018: Security + Network (TLS) - fed-019: Security + Architect (Ed25519 crypto review) - fed-022: Router + Network (delivery confirmation) - fed-023: Network + Storage (flow control) - fed-026a: Architect + Network (transport interface) - fed-027: Architect + Network + Router (integration tests)

Control Plane Tasks (12): - ctrl-001: Design Control API (REST + WebSocket) - ctrl-002: Lead Agent orchestration - ctrl-003: Web dashboard v2 (fleet control) - ctrl-004: Human authentication (OAuth/magic link) - ctrl-005: Push notification service (APNs/FCM) - ctrl-006: iPhone app MVP - ctrl-007: Slack/Discord bot integration - ctrl-008: Human escalation queue - ctrl-009: Agent skills registry - ctrl-010: Code Graph integration (from ai-maestro) - ctrl-011: Agent health monitoring (from ai-maestro) - ctrl-012: Agent portability export/import (from ai-maestro) Competitive Analysis: - ai-maestro uses file-based messaging (human relay required) - agent-relay uses auto-injection (truly autonomous) - Learn from ai-maestro: Code Graphs, health monitoring, portability - Our advantage: real-time messaging backbone

- agent-relay owns ephemeral storage (routing queues, ACKs, flow control) - agent-trajectories owns durable storage (trajectories, knowledge workspace) - Add event emission interface for agent-relay → agent-trajectories integration - Remove duplicate trajectory storage details (now in agent-trajectories repo) - Update summary to reflect separation of concerns

Decision: Use Mem0 (github.com/mem0ai/mem0) as memory layer for agent-trajectories rather than building from scratch. Why Mem0: - 25k+ stars, YC-backed, active development - Multi-LLM support (not just OpenAI) - MCP integration exists for Claude Code - Self-hosted option (Apache 2.0) - +26% accuracy vs OpenAI Memory benchmarks What we build on top: - Task-based trajectory grouping - Inter-agent event capture - Fleet-wide knowledge workspace - .trajectory export format New tasks (mem-001 through mem-005): - Integrate Mem0 SDK - Configure MCP for Claude Code - Build trajectory layer - Implement knowledge workspace - Abstract MemoryBackend interface See docs/MEMORY_STACK_DECISION.md for full rationale.

- Add section explaining MCP-based approach where Claude Code IS the LLM - Update integration examples to use infer:false (no API needed) - Add direct Qdrant alternative for simpler implementation - Document embedding options without paid APIs (Ollama, FastEmbed) - Update next steps to reflect MCP-first approach Key insight: With MCP, the agent handles intelligence, Mem0 becomes pure storage + vector search. No Anthropic SDK required.

@memory

- Define pattern namespace system (@relay:, @memory:, @Custom:) - Add hook lifecycle events (onSessionStart, onOutput, etc.) - Document HookContext and programmatic API - Add relay.config.ts configuration format - Create 6 hook-* tasks for implementation roadmap Hooks enable: - Automatic memory prompts at session end - User-defined pattern handlers - Integration points for extensions See docs/HOOKS_API.md for full design.

- Add detailed spec for each lifecycle event: - onSessionStart: when, trigger point, code example, use cases - onOutput: polling mechanism, handler signature, performance notes - onIdle: threshold config, once-per-period firing - onMessageReceived: suppress/modify capability - onSessionEnd: SIGINT handling, wait for response - Add HookEmitter class design - Add Event Summary Table - Create 7 new granular tasks (hook-007 to hook-013) Tasks cover: HookEmitter, each lifecycle event, and types.

Hook Context (read-only): - agentId, agentName, sessionId, workingDir, projectName - recentOutput (last 50 chunks), recentMessages (last 20) - Timing: sessionStartTime, lastOutputTime, idleSeconds Hook Result (allowed actions): - inject: max 2000 chars, sanitized - suppress: for onMessageReceived only - stop: prevent other handlers - sendMessage: one per invocation, max 5000 chars - log: audit log entry Prohibited: - File system access - Shell execution - Network requests - Env modification - Full output access Capability escalation via explicit config grants. Added hook-014 task for sandboxing implementation.

@ticket

Examples cover: 1. Memory integration - load context, prompt to save 2. Error detection - alert coordinator on failures 3. Message filtering - suppress/highlight by priority 4. Custom pattern - @ticket: handler 5. Coordinator hooks - special behavior for lead agent 6. Minimal config - just session end prompt 7. Debug mode - log all events Each example demonstrates: - Using HookContext (read-only) - Returning HookResult (inject, sendMessage, log, suppress) - Staying within sandbox limits

onSessionStart (2 examples): - Inject project context - Role-based context by agent name onOutput (3 examples): - Error detection and alerting - Progress tracking (test results) - Security keyword alerting onIdle (3 examples): - Escalating idle prompts (30s gentle, 2min urgent) - Auto-save reminder - Silent coordinator notification onMessageReceived (4 examples): - Custom formatting with priority - Suppress broadcasts while focused - Filter by sender whitelist - Transform task assignments onSessionEnd (4 examples): - Memory save prompt - Notify team of departure with duration - Request summary before exit - Silent logging only

Add -d/--detach flag to start agents in background, allowing SSH users to disconnect without losing agent sessions. Includes attach/kill commands for session management.

…ts test The test was checking if dataDir exists, but listProjects() requires the .project marker file to be present.

…iqgant/agent-relay into claude/continue-pr-8-7tWzb

…continue-pr-8-7tWzb

feat: Add detached mode for long-running agent sessions

Port the hooks API design document from PR #8 with additional trajectory integration examples showing how hooks can work with the PDERO paradigm and trail CLI.

This document supersedes the original federation proposal with a realistic assessment of what's built today and a detailed roadmap for achieving the N-servers-per-org vision. Key sections: - Current state analysis with file references - Gap analysis comparing PR #8 proposal vs reality - Target architecture with org-centric model - 5-phase implementation roadmap (9 weeks total) - Per-user team pricing model - Technical specs for P2P protocol and agent registry Related: #8

Added Appendix B with detailed solutions for distributed systems challenges identified in PR #8's review: Critical (🔴): - End-to-end delivery confirmation via capture-pane verification - Registry consistency using cloud as authoritative source - Message deduplication with TTL-based seen set High Priority (🟡): - Backpressure with PEER_BUSY/PEER_READY and bounded queues - Distributed tracing with correlation IDs Medium Priority: - NAT/firewall traversal with hybrid topology - Clock skew handling via relative TTLs Also preserved PR #8's detailed protocol specification (PEER_HELLO, PEER_ROUTE, etc.) and hybrid topology recommendation. The document now serves as the authoritative architecture reference, superseding PR #8 while incorporating its valuable insights.

khaliqgant · 2026-01-07T06:20:15Z

Closing in favor of #91

claude and others added 23 commits December 21, 2025 07:47

feat: Add detached mode for long-running agent sessions

d540dc1

Add -d/--detach flag to start agents in background, allowing SSH users to disconnect without losing agent sessions. Includes attach/kill commands for session management.

test: Add tests for attach, kill commands and detach flag

5320172

fix(test): Check for .project marker instead of dataDir in listProjec…

5b7d60e

…ts test The test was checking if dataDir exists, but listProjects() requires the .project marker file to be present.

chore: Hide internal --_daemon flag from CLI help

ff51bdb

Merge branch 'claude/analyze-mcp-agent-mail-IXbNF' of github.com:khal…

75b43e9

…iqgant/agent-relay into claude/continue-pr-8-7tWzb

Merge branch 'main' of github.com:khaliqgant/agent-relay into claude/…

ed171e3

…continue-pr-8-7tWzb

Merge pull request #15 from khaliqgant/claude/continue-pr-8-7tWzb

bbcb4e9

feat: Add detached mode for long-running agent sessions

Merge main into feature branch

76cce4e

khaliqgant pushed a commit that referenced this pull request Dec 30, 2025

Add hooks API proposal with trajectory integration

6b01908

Port the hooks API design document from PR #8 with additional trajectory integration examples showing how hooks can work with the PDERO paradigm and trail CLI.

khaliqgant mentioned this pull request Jan 7, 2026

Add comprehensive multi-server architecture document #91

Merged

khaliqgant closed this Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add federation proposal for cross-server agent communication #8

Add federation proposal for cross-server agent communication #8

Uh oh!

khaliqgant commented Dec 21, 2025

Uh oh!

khaliqgant commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add federation proposal for cross-server agent communication #8

Add federation proposal for cross-server agent communication #8

Uh oh!

Conversation

khaliqgant commented Dec 21, 2025

Uh oh!

khaliqgant commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants