-
Notifications
You must be signed in to change notification settings - Fork 0
Add federation proposal for cross-server agent communication #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Comprehensive design document for extending agent-relay to support federated multi-server deployments while preserving the core differentiator: automatic message injection via tmux. Key design decisions: - Separation of concerns: routing (network) vs injection (local) - Hybrid topology: optional hub for discovery, direct peer connections - Progressive enhancement: single-server unchanged, federation opt-in - WebSocket + TLS for peer-to-peer daemon communication - Message queuing for resilience during disconnects Includes: - Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.) - Agent discovery and registry design - Security model (TLS, pre-shared tokens) - Configuration schema - CLI interface design - 5-phase implementation plan (~4-5 weeks) bd-TBD
Identifies major gaps and risks in the federation design: HIGH SEVERITY: - No end-to-end delivery guarantee (sender doesn't know if agent received) - Registry consistency race conditions (split-brain on name collisions) - Message ordering not guaranteed across servers MEDIUM SEVERITY: - Token management doesn't scale (N² tokens for N servers) - No message-level authentication (spoofing possible) - No rate limiting (flood attacks possible) - Debugging distributed failures is hard (no tracing) - NAT/firewall traversal not addressed - Timeline underestimated (8-10 weeks realistic vs 4-5 proposed) Includes: - Specific failure scenarios for each issue - Recommendations for fixes - Alternative approaches (NATS, SSH tunnels) - Suggested MVP scope to ship faster bd-TBD
Major additions to address identified issues: DELIVERY CONFIRMATION (Section 7): - End-to-end ACK so sender knows message was injected - Detection via capture-pane after send-keys - Optional confirmation notification to sender REGISTRY CONSISTENCY (Section 5.3): - Fleet-wide unique names (no split-brain) - Quorum-based registration with Lamport timestamps - Clear error on name collision with suggestions AUTHENTICATION (Section 8.1-8.3): - Ed25519 asymmetric keys (scales better than N² tokens) - Challenge-response handshake - Per-message signing to prevent spoofing - TOFU, static config, or CA options FLOW CONTROL (Section 9): - Credit-based flow control - PEER_BUSY/PEER_READY backpressure signals - Token bucket rate limiting (per-peer, per-agent, fleet-wide) - Bounded queues with drop policies TRANSPORT ABSTRACTION (Section 11): - Pluggable PeerTransport interface - WebSocket implementation (default) - NATS JetStream implementation (optional) - Migration path from WebSocket to NATS TIMELINE (Section 14.3-14.4): - Realistic estimate: 8-10 weeks (not 4-5) - MVP option for 4-week delivery - Phase 6: Stabilization added OPEN QUESTIONS (Section 16): - 10 unresolved questions for discussion - Recommendations for each - Clear decision points before implementation bd-TBD
Addresses storage requirements for federated deployments by separating: - Ephemeral storage (routing): Memory or NATS JetStream for message queues - Durable storage (trajectories): File/SQLite local + PostgreSQL/S3 central References the trajectories proposal (PR #3) for detailed format specification. Includes configuration examples and federation impact analysis.
Tasks organized by phase (1-5) and assigned to agent roles: - Architect: Protocol design, testing, docs (3 tasks) - Network: PeerConnection, PeerManager, reconnection, flow control (8 tasks) - Router: FleetRegistry, routing, broadcast, delivery confirmation (8 tasks) - Security: TLS, Ed25519 authentication (2 tasks) - Storage: Message queues, trajectory storage (3 tasks) - CLI: Fleet commands, config, dashboard (4 tasks) Dependencies mapped to ensure correct build order. See docs/FEDERATION_PROPOSAL.md for full specification.
Changes: - Add collaborators for cross-boundary tasks (8 tasks now dual-assigned) - Fix fed-014 dependency (queue can start after fed-004, not fed-013) - Add fed-026a for PeerTransport interface before NATS adapter - Add Architect review to Security tasks (fed-019) - Lower priority on fed-012 (loop prevention can merge with routing) Dual assignments: - fed-001: Architect + Network (protocol types) - fed-011: Router + Network (federated router integration) - fed-014: Storage + Network (message queue) - fed-018: Security + Network (TLS) - fed-019: Security + Architect (Ed25519 crypto review) - fed-022: Router + Network (delivery confirmation) - fed-023: Network + Storage (flow control) - fed-026a: Architect + Network (transport interface) - fed-027: Architect + Network + Router (integration tests)
Control Plane Tasks (12): - ctrl-001: Design Control API (REST + WebSocket) - ctrl-002: Lead Agent orchestration - ctrl-003: Web dashboard v2 (fleet control) - ctrl-004: Human authentication (OAuth/magic link) - ctrl-005: Push notification service (APNs/FCM) - ctrl-006: iPhone app MVP - ctrl-007: Slack/Discord bot integration - ctrl-008: Human escalation queue - ctrl-009: Agent skills registry - ctrl-010: Code Graph integration (from ai-maestro) - ctrl-011: Agent health monitoring (from ai-maestro) - ctrl-012: Agent portability export/import (from ai-maestro) Competitive Analysis: - ai-maestro uses file-based messaging (human relay required) - agent-relay uses auto-injection (truly autonomous) - Learn from ai-maestro: Code Graphs, health monitoring, portability - Our advantage: real-time messaging backbone
- agent-relay owns ephemeral storage (routing queues, ACKs, flow control) - agent-trajectories owns durable storage (trajectories, knowledge workspace) - Add event emission interface for agent-relay → agent-trajectories integration - Remove duplicate trajectory storage details (now in agent-trajectories repo) - Update summary to reflect separation of concerns
Decision: Use Mem0 (github.com/mem0ai/mem0) as memory layer for agent-trajectories rather than building from scratch. Why Mem0: - 25k+ stars, YC-backed, active development - Multi-LLM support (not just OpenAI) - MCP integration exists for Claude Code - Self-hosted option (Apache 2.0) - +26% accuracy vs OpenAI Memory benchmarks What we build on top: - Task-based trajectory grouping - Inter-agent event capture - Fleet-wide knowledge workspace - .trajectory export format New tasks (mem-001 through mem-005): - Integrate Mem0 SDK - Configure MCP for Claude Code - Build trajectory layer - Implement knowledge workspace - Abstract MemoryBackend interface See docs/MEMORY_STACK_DECISION.md for full rationale.
- Add section explaining MCP-based approach where Claude Code IS the LLM - Update integration examples to use infer:false (no API needed) - Add direct Qdrant alternative for simpler implementation - Document embedding options without paid APIs (Ollama, FastEmbed) - Update next steps to reflect MCP-first approach Key insight: With MCP, the agent handles intelligence, Mem0 becomes pure storage + vector search. No Anthropic SDK required.
- Define pattern namespace system (@relay:, @memory:, @Custom:) - Add hook lifecycle events (onSessionStart, onOutput, etc.) - Document HookContext and programmatic API - Add relay.config.ts configuration format - Create 6 hook-* tasks for implementation roadmap Hooks enable: - Automatic memory prompts at session end - User-defined pattern handlers - Integration points for extensions See docs/HOOKS_API.md for full design.
- Add detailed spec for each lifecycle event: - onSessionStart: when, trigger point, code example, use cases - onOutput: polling mechanism, handler signature, performance notes - onIdle: threshold config, once-per-period firing - onMessageReceived: suppress/modify capability - onSessionEnd: SIGINT handling, wait for response - Add HookEmitter class design - Add Event Summary Table - Create 7 new granular tasks (hook-007 to hook-013) Tasks cover: HookEmitter, each lifecycle event, and types.
Hook Context (read-only): - agentId, agentName, sessionId, workingDir, projectName - recentOutput (last 50 chunks), recentMessages (last 20) - Timing: sessionStartTime, lastOutputTime, idleSeconds Hook Result (allowed actions): - inject: max 2000 chars, sanitized - suppress: for onMessageReceived only - stop: prevent other handlers - sendMessage: one per invocation, max 5000 chars - log: audit log entry Prohibited: - File system access - Shell execution - Network requests - Env modification - Full output access Capability escalation via explicit config grants. Added hook-014 task for sandboxing implementation.
Examples cover: 1. Memory integration - load context, prompt to save 2. Error detection - alert coordinator on failures 3. Message filtering - suppress/highlight by priority 4. Custom pattern - @ticket: handler 5. Coordinator hooks - special behavior for lead agent 6. Minimal config - just session end prompt 7. Debug mode - log all events Each example demonstrates: - Using HookContext (read-only) - Returning HookResult (inject, sendMessage, log, suppress) - Staying within sandbox limits
onSessionStart (2 examples): - Inject project context - Role-based context by agent name onOutput (3 examples): - Error detection and alerting - Progress tracking (test results) - Security keyword alerting onIdle (3 examples): - Escalating idle prompts (30s gentle, 2min urgent) - Auto-save reminder - Silent coordinator notification onMessageReceived (4 examples): - Custom formatting with priority - Suppress broadcasts while focused - Filter by sender whitelist - Transform task assignments onSessionEnd (4 examples): - Memory save prompt - Notify team of departure with duration - Request summary before exit - Silent logging only
Add -d/--detach flag to start agents in background, allowing SSH users to disconnect without losing agent sessions. Includes attach/kill commands for session management.
…ts test The test was checking if dataDir exists, but listProjects() requires the .project marker file to be present.
…iqgant/agent-relay into claude/continue-pr-8-7tWzb
…continue-pr-8-7tWzb
feat: Add detached mode for long-running agent sessions
khaliqgant
pushed a commit
that referenced
this pull request
Dec 30, 2025
Port the hooks API design document from PR #8 with additional trajectory integration examples showing how hooks can work with the PDERO paradigm and trail CLI.
khaliqgant
pushed a commit
that referenced
this pull request
Jan 7, 2026
This document supersedes the original federation proposal with a realistic assessment of what's built today and a detailed roadmap for achieving the N-servers-per-org vision. Key sections: - Current state analysis with file references - Gap analysis comparing PR #8 proposal vs reality - Target architecture with org-centric model - 5-phase implementation roadmap (9 weeks total) - Per-user team pricing model - Technical specs for P2P protocol and agent registry Related: #8
khaliqgant
pushed a commit
that referenced
this pull request
Jan 7, 2026
Added Appendix B with detailed solutions for distributed systems challenges identified in PR #8's review: Critical (🔴): - End-to-end delivery confirmation via capture-pane verification - Registry consistency using cloud as authoritative source - Message deduplication with TTL-based seen set High Priority (🟡): - Backpressure with PEER_BUSY/PEER_READY and bounded queues - Distributed tracing with correlation IDs Medium Priority: - NAT/firewall traversal with hybrid topology - Clock skew handling via relative TTLs Also preserved PR #8's detailed protocol specification (PEER_HELLO, PEER_ROUTE, etc.) and hybrid topology recommendation. The document now serves as the authoritative architecture reference, superseding PR #8 while incorporating its valuable insights.
Collaborator
Author
|
Closing in favor of #91 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Comprehensive design document for extending agent-relay to support
federated multi-server deployments while preserving the core
differentiator: automatic message injection via tmux.
Key design decisions:
Includes:
bd-TBD