Skip to content

Conversation

@khaliqgant
Copy link
Collaborator

Comprehensive design document for extending agent-relay to support
federated multi-server deployments while preserving the core
differentiator: automatic message injection via tmux.

Key design decisions:

  • Separation of concerns: routing (network) vs injection (local)
  • Hybrid topology: optional hub for discovery, direct peer connections
  • Progressive enhancement: single-server unchanged, federation opt-in
  • WebSocket + TLS for peer-to-peer daemon communication
  • Message queuing for resilience during disconnects

Includes:

  • Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.)
  • Agent discovery and registry design
  • Security model (TLS, pre-shared tokens)
  • Configuration schema
  • CLI interface design
  • 5-phase implementation plan (~4-5 weeks)

bd-TBD

claude and others added 23 commits December 21, 2025 07:47
Comprehensive design document for extending agent-relay to support
federated multi-server deployments while preserving the core
differentiator: automatic message injection via tmux.

Key design decisions:
- Separation of concerns: routing (network) vs injection (local)
- Hybrid topology: optional hub for discovery, direct peer connections
- Progressive enhancement: single-server unchanged, federation opt-in
- WebSocket + TLS for peer-to-peer daemon communication
- Message queuing for resilience during disconnects

Includes:
- Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.)
- Agent discovery and registry design
- Security model (TLS, pre-shared tokens)
- Configuration schema
- CLI interface design
- 5-phase implementation plan (~4-5 weeks)

bd-TBD
Identifies major gaps and risks in the federation design:

HIGH SEVERITY:
- No end-to-end delivery guarantee (sender doesn't know if agent received)
- Registry consistency race conditions (split-brain on name collisions)
- Message ordering not guaranteed across servers

MEDIUM SEVERITY:
- Token management doesn't scale (N² tokens for N servers)
- No message-level authentication (spoofing possible)
- No rate limiting (flood attacks possible)
- Debugging distributed failures is hard (no tracing)
- NAT/firewall traversal not addressed
- Timeline underestimated (8-10 weeks realistic vs 4-5 proposed)

Includes:
- Specific failure scenarios for each issue
- Recommendations for fixes
- Alternative approaches (NATS, SSH tunnels)
- Suggested MVP scope to ship faster

bd-TBD
Major additions to address identified issues:

DELIVERY CONFIRMATION (Section 7):
- End-to-end ACK so sender knows message was injected
- Detection via capture-pane after send-keys
- Optional confirmation notification to sender

REGISTRY CONSISTENCY (Section 5.3):
- Fleet-wide unique names (no split-brain)
- Quorum-based registration with Lamport timestamps
- Clear error on name collision with suggestions

AUTHENTICATION (Section 8.1-8.3):
- Ed25519 asymmetric keys (scales better than N² tokens)
- Challenge-response handshake
- Per-message signing to prevent spoofing
- TOFU, static config, or CA options

FLOW CONTROL (Section 9):
- Credit-based flow control
- PEER_BUSY/PEER_READY backpressure signals
- Token bucket rate limiting (per-peer, per-agent, fleet-wide)
- Bounded queues with drop policies

TRANSPORT ABSTRACTION (Section 11):
- Pluggable PeerTransport interface
- WebSocket implementation (default)
- NATS JetStream implementation (optional)
- Migration path from WebSocket to NATS

TIMELINE (Section 14.3-14.4):
- Realistic estimate: 8-10 weeks (not 4-5)
- MVP option for 4-week delivery
- Phase 6: Stabilization added

OPEN QUESTIONS (Section 16):
- 10 unresolved questions for discussion
- Recommendations for each
- Clear decision points before implementation

bd-TBD
Addresses storage requirements for federated deployments by separating:
- Ephemeral storage (routing): Memory or NATS JetStream for message queues
- Durable storage (trajectories): File/SQLite local + PostgreSQL/S3 central

References the trajectories proposal (PR #3) for detailed format specification.
Includes configuration examples and federation impact analysis.
Tasks organized by phase (1-5) and assigned to agent roles:
- Architect: Protocol design, testing, docs (3 tasks)
- Network: PeerConnection, PeerManager, reconnection, flow control (8 tasks)
- Router: FleetRegistry, routing, broadcast, delivery confirmation (8 tasks)
- Security: TLS, Ed25519 authentication (2 tasks)
- Storage: Message queues, trajectory storage (3 tasks)
- CLI: Fleet commands, config, dashboard (4 tasks)

Dependencies mapped to ensure correct build order.
See docs/FEDERATION_PROPOSAL.md for full specification.
Changes:
- Add collaborators for cross-boundary tasks (8 tasks now dual-assigned)
- Fix fed-014 dependency (queue can start after fed-004, not fed-013)
- Add fed-026a for PeerTransport interface before NATS adapter
- Add Architect review to Security tasks (fed-019)
- Lower priority on fed-012 (loop prevention can merge with routing)

Dual assignments:
- fed-001: Architect + Network (protocol types)
- fed-011: Router + Network (federated router integration)
- fed-014: Storage + Network (message queue)
- fed-018: Security + Network (TLS)
- fed-019: Security + Architect (Ed25519 crypto review)
- fed-022: Router + Network (delivery confirmation)
- fed-023: Network + Storage (flow control)
- fed-026a: Architect + Network (transport interface)
- fed-027: Architect + Network + Router (integration tests)
Control Plane Tasks (12):
- ctrl-001: Design Control API (REST + WebSocket)
- ctrl-002: Lead Agent orchestration
- ctrl-003: Web dashboard v2 (fleet control)
- ctrl-004: Human authentication (OAuth/magic link)
- ctrl-005: Push notification service (APNs/FCM)
- ctrl-006: iPhone app MVP
- ctrl-007: Slack/Discord bot integration
- ctrl-008: Human escalation queue
- ctrl-009: Agent skills registry
- ctrl-010: Code Graph integration (from ai-maestro)
- ctrl-011: Agent health monitoring (from ai-maestro)
- ctrl-012: Agent portability export/import (from ai-maestro)

Competitive Analysis:
- ai-maestro uses file-based messaging (human relay required)
- agent-relay uses auto-injection (truly autonomous)
- Learn from ai-maestro: Code Graphs, health monitoring, portability
- Our advantage: real-time messaging backbone
- agent-relay owns ephemeral storage (routing queues, ACKs, flow control)
- agent-trajectories owns durable storage (trajectories, knowledge workspace)
- Add event emission interface for agent-relay → agent-trajectories integration
- Remove duplicate trajectory storage details (now in agent-trajectories repo)
- Update summary to reflect separation of concerns
Decision: Use Mem0 (github.com/mem0ai/mem0) as memory layer for
agent-trajectories rather than building from scratch.

Why Mem0:
- 25k+ stars, YC-backed, active development
- Multi-LLM support (not just OpenAI)
- MCP integration exists for Claude Code
- Self-hosted option (Apache 2.0)
- +26% accuracy vs OpenAI Memory benchmarks

What we build on top:
- Task-based trajectory grouping
- Inter-agent event capture
- Fleet-wide knowledge workspace
- .trajectory export format

New tasks (mem-001 through mem-005):
- Integrate Mem0 SDK
- Configure MCP for Claude Code
- Build trajectory layer
- Implement knowledge workspace
- Abstract MemoryBackend interface

See docs/MEMORY_STACK_DECISION.md for full rationale.
- Add section explaining MCP-based approach where Claude Code IS the LLM
- Update integration examples to use infer:false (no API needed)
- Add direct Qdrant alternative for simpler implementation
- Document embedding options without paid APIs (Ollama, FastEmbed)
- Update next steps to reflect MCP-first approach

Key insight: With MCP, the agent handles intelligence, Mem0 becomes
pure storage + vector search. No Anthropic SDK required.
- Define pattern namespace system (@relay:, @memory:, @Custom:)
- Add hook lifecycle events (onSessionStart, onOutput, etc.)
- Document HookContext and programmatic API
- Add relay.config.ts configuration format
- Create 6 hook-* tasks for implementation roadmap

Hooks enable:
- Automatic memory prompts at session end
- User-defined pattern handlers
- Integration points for extensions

See docs/HOOKS_API.md for full design.
- Add detailed spec for each lifecycle event:
  - onSessionStart: when, trigger point, code example, use cases
  - onOutput: polling mechanism, handler signature, performance notes
  - onIdle: threshold config, once-per-period firing
  - onMessageReceived: suppress/modify capability
  - onSessionEnd: SIGINT handling, wait for response
- Add HookEmitter class design
- Add Event Summary Table
- Create 7 new granular tasks (hook-007 to hook-013)

Tasks cover: HookEmitter, each lifecycle event, and types.
Hook Context (read-only):
- agentId, agentName, sessionId, workingDir, projectName
- recentOutput (last 50 chunks), recentMessages (last 20)
- Timing: sessionStartTime, lastOutputTime, idleSeconds

Hook Result (allowed actions):
- inject: max 2000 chars, sanitized
- suppress: for onMessageReceived only
- stop: prevent other handlers
- sendMessage: one per invocation, max 5000 chars
- log: audit log entry

Prohibited:
- File system access
- Shell execution
- Network requests
- Env modification
- Full output access

Capability escalation via explicit config grants.
Added hook-014 task for sandboxing implementation.
Examples cover:
1. Memory integration - load context, prompt to save
2. Error detection - alert coordinator on failures
3. Message filtering - suppress/highlight by priority
4. Custom pattern - @ticket: handler
5. Coordinator hooks - special behavior for lead agent
6. Minimal config - just session end prompt
7. Debug mode - log all events

Each example demonstrates:
- Using HookContext (read-only)
- Returning HookResult (inject, sendMessage, log, suppress)
- Staying within sandbox limits
onSessionStart (2 examples):
- Inject project context
- Role-based context by agent name

onOutput (3 examples):
- Error detection and alerting
- Progress tracking (test results)
- Security keyword alerting

onIdle (3 examples):
- Escalating idle prompts (30s gentle, 2min urgent)
- Auto-save reminder
- Silent coordinator notification

onMessageReceived (4 examples):
- Custom formatting with priority
- Suppress broadcasts while focused
- Filter by sender whitelist
- Transform task assignments

onSessionEnd (4 examples):
- Memory save prompt
- Notify team of departure with duration
- Request summary before exit
- Silent logging only
Add -d/--detach flag to start agents in background, allowing SSH users
to disconnect without losing agent sessions. Includes attach/kill commands
for session management.
…ts test

The test was checking if dataDir exists, but listProjects() requires
the .project marker file to be present.
…iqgant/agent-relay into claude/continue-pr-8-7tWzb
feat: Add detached mode for long-running agent sessions
khaliqgant pushed a commit that referenced this pull request Dec 30, 2025
Port the hooks API design document from PR #8 with additional
trajectory integration examples showing how hooks can work with
the PDERO paradigm and trail CLI.
khaliqgant pushed a commit that referenced this pull request Jan 7, 2026
This document supersedes the original federation proposal with a realistic
assessment of what's built today and a detailed roadmap for achieving the
N-servers-per-org vision.

Key sections:
- Current state analysis with file references
- Gap analysis comparing PR #8 proposal vs reality
- Target architecture with org-centric model
- 5-phase implementation roadmap (9 weeks total)
- Per-user team pricing model
- Technical specs for P2P protocol and agent registry

Related: #8
khaliqgant pushed a commit that referenced this pull request Jan 7, 2026
Added Appendix B with detailed solutions for distributed systems challenges
identified in PR #8's review:

Critical (🔴):
- End-to-end delivery confirmation via capture-pane verification
- Registry consistency using cloud as authoritative source
- Message deduplication with TTL-based seen set

High Priority (🟡):
- Backpressure with PEER_BUSY/PEER_READY and bounded queues
- Distributed tracing with correlation IDs

Medium Priority:
- NAT/firewall traversal with hybrid topology
- Clock skew handling via relative TTLs

Also preserved PR #8's detailed protocol specification (PEER_HELLO,
PEER_ROUTE, etc.) and hybrid topology recommendation.

The document now serves as the authoritative architecture reference,
superseding PR #8 while incorporating its valuable insights.
@khaliqgant
Copy link
Collaborator Author

Closing in favor of #91

@khaliqgant khaliqgant closed this Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants