Skip to content

Distributed mesh LLM: ensemble-of-experts inference engine #54

@jeremymanning

Description

@jeremymanning

Description

This is the largest remaining feature — the distributed mesh LLM described in the whitepaper and issue #27. The mesh LLM is an inter-model Mixture-of-Experts system where GPU donor nodes each run a small language model, a distributed router selects K-of-N experts per token, and the system self-prompts to improve the cluster.

This issue supersedes #27 and provides the detailed implementation breakdown.

Architecture (from whitepaper)

  • Each GPU donor runs a complete small model (LLaMA-3-8B at 4-bit quantization, ~4-6GB VRAM)
  • Distributed router selects K-of-N expert nodes per output token
  • Each expert returns top-256 (token_id, logit) pairs (~1.5KB) — 99%+ bandwidth reduction
  • Router aggregates sparse logit distributions to produce next token
  • At K=4, 100ms latency: ~3.2 tokens/second (adequate for autonomous agents, not interactive chat)
  • LLaMA-3 tokenizer standardized (128,256 tokens)

Three uses (from #27)

  1. Resource: continually growing/improving language model, free to everyone
  2. Self-improvement: guides development and improvements of the network itself
  3. Security: carries out regular security audits and spot checks

Fractal scaling (from #27)

  • Max intelligence: ALL nodes → one "super" LLM
  • Intermediate: resources allocated by problem complexity
  • Minimum: single modest-hardware node as simple model

Components (from spec Phase 9, T111-T119)

  1. Router (src/agent/mesh_llm/router.rs): K-of-N expert selection per token, LLaMA-3 tokenizer
  2. Expert node (src/agent/mesh_llm/expert.rs): registration, health tracking, capacity reporting
  3. Aggregator (src/agent/mesh_llm/aggregator.rs): sparse logit aggregation, weighted average, sampling
  4. Self-prompting loop (src/agent/mesh_llm/self_prompt.rs): autonomous agent generating improvement tasks
  5. Agent subsetting (src/agent/mesh_llm/subset.rs): independent parallel agent subsets for concurrent tasks
  6. Safety system (src/agent/mesh_llm/safety.rs): action tier classification, governance kill switch
  7. gRPC service (proto/mesh_llm.proto): RegisterExpert, GetRouterStatus, SubmitSelfTask, HaltMesh

Action tiers (from whitepaper)

Tier Examples Approval
Read-only Analyze metrics, generate reports None
Suggest Draft config changes, governance motions Human review
Sandbox-test A/B experiment on 1% of traffic Automated validation
Deploy-minor Update non-critical config 2-of-3 governance quorum
Deploy-major Change scheduler algorithm Full governance vote + 24h review

Phased rollout

Phase Nodes Capability
0-1 0-500 Centralized model; read-only + suggest only
2 ~280-1,000 Distributed ensemble; sandbox-test after 30-day stability
3 ~1,000 3-7 parallel domain streams; deploy-minor
4 ~5,000+ 37+ parallel streams; deploy-major

Requirements

  • Router model with K-of-N expert selection
  • Sparse logit aggregation (top-256 logits per expert)
  • Expert node registration and health monitoring
  • Self-prompting autonomous agent loop (1-24 hour cycle)
  • Action tier classification with safety enforcement
  • Governance kill switch (cannot be overridden by mesh itself)
  • gRPC service for mesh management
  • Support for heterogeneous GPU hardware (different model sizes/fine-tunes, same tokenizer)
  • Graceful degradation below 280 nodes (fall back to centralized model)

Success Criteria

  • Router selects K-of-N experts and dispatches in parallel
  • Sparse logit aggregation produces coherent text
  • Expert registration and health tracking functional
  • Self-prompting loop generates actionable improvement tasks
  • Action tier classification correctly gates operations
  • Governance kill switch immediately halts all inference
  • gRPC service exposes all management operations
  • 3.2+ tokens/second at K=4, 100ms inter-node latency
  • Integration test: multi-node token generation via sparse aggregation

Testing (Principle V)

  • Deploy 4+ GPU nodes with LLaMA-3-8B (4-bit) → verify token generation
  • Measure tokens/second at various K values and latencies
  • Test kill switch → verify immediate halt
  • Test self-prompting loop → verify actionable output
  • Test action tier escalation → verify governance gating
  • Test with heterogeneous models (different sizes, same tokenizer)
  • Test graceful degradation with fewer than 280 nodes
  • Bandwidth measurement: verify <2KB per expert per token

Notes

This is a major undertaking that should be broken into sub-tasks during planning. The phased rollout means Phase 0-1 (centralized model, read-only) can ship first, with distributed ensemble features enabled at each phase transition via governance vote.

References:

  • Whitepaper: §Mesh LLM: Distributed Self-Improvement
  • Issue Explore and implement distributed LLM #27: parallel_mesh_of_diffusers_whitepaper.pdf
  • research/09-mesh-llm.md
  • research/10-prior-art-distributed-inference.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions