Skip to content

eval: 5000-turn long horizon results — pre-built DB regression + federated 100-agent OOM #2871

@rysweet

Description

@rysweet

5000-Turn Long Horizon Eval — Distributed Hive Mind

Architecture

All agents share a single DistributedHiveGraph backed by a DHT (consistent hash ring). Facts are sharded across agents with replication factor R=3. Queries route to shard owners, not all agents.

graph TB
    subgraph "Single DHT Ring — All N Agents"
        direction LR
        R["Consistent Hash Ring<br/>64 virtual nodes per agent<br/>Facts hashed to ring positions"]
    end

    subgraph "Agent Shards (each holds ~F/N facts)"
        A0["Agent 0<br/>Shard: ~50 facts"]
        A1["Agent 1<br/>Shard: ~48 facts"]
        A2["Agent 2<br/>Shard: ~52 facts"]
        AN["Agent N<br/>Shard: ~50 facts"]
    end

    R --> A0
    R --> A1
    R --> A2
    R --> AN
    A0 <-.->|"bloom filter gossip"| A1
    A1 <-.->|gossip| A2
    A2 <-.->|gossip| AN
Loading

Query Flow

sequenceDiagram
    participant Q as Question
    participant DHT as DHT Router
    participant S1 as Shard Owner 1
    participant S2 as Shard Owner 2
    participant C as Consensus

    Q->>DHT: "What is Sarah Chen birthday?"
    DHT->>DHT: hash key terms → find shard owners
    DHT->>S1: search shard (facts found)
    DHT->>S2: search shard (facts found)
    S1-->>C: "March 15"
    S2-->>C: "March 15"
    C->>Q: Consensus answer
Loading

Learning Flow

sequenceDiagram
    participant T as Turn Pool (5000)
    participant W as ThreadPool (10 workers)
    participant A as Agent (learns 50 turns)
    participant LLM as LLM
    participant DHT as DHT Ring

    T->>W: Round-robin distribute turns
    W->>A: learn batch (parallel)
    A->>LLM: extract facts
    LLM-->>A: structured facts
    A->>A: Store in local Kuzu DB (256MB)
    A->>DHT: promote_fact → replicate to R=3 agents
    Note over W: 10 agents learn simultaneously (9x speedup)
Loading

Eval Results

Single Agent Baseline: 94.1%

Level Score Questions
L1 direct recall 97.9% 31
L2 multi-source synthesis 100% 5
L3 temporal reasoning 83.3% 3
L4 procedural 100% 2
L5 contradiction 87.5% 2
L6 incremental update 100% 2
L7 teaching 83.3% 2
L8 confidence 89.2% 2
L9 causal 100% 2
L10 counterfactual 70.8% 2
L11 novel skill 100% 1
L12 far transfer 62.5% 2

Runtime: 21.7h (21.6h learning, 84s Q&A+grading). Model: claude-sonnet-4-5-20250929.

Federated Hive Progression

Version Median Stddev Agents Notes
v1 (naive, longest-wins) 40.0% 100 No routing, query all agents
v3 (consensus+routing, broken) 34.9% 31.2% 100 Empty root hive, random fallback
v3 Opus 3.6% 15.5% 100 + rate limit errors swallowed
Single DHT smoke test 58.8% 4.3% 10 Correct routing, stable
Single DHT full (pending) TBD TBD 100 Running now (~3h remaining)

Key Bugs Found and Fixed

P0: Empty Root Hive (fixed in PR #18)

Facts stored in per-group hives during learning but queries routed through empty root hive → random agent fallback → 31% stddev. Fix: single DHT ring with all agents.

P0: Kuzu mmap OOM (fixed in PR #11, #2876)

kuzu.Database() defaults to 80% of system RAM + 8TB mmap per DB. 100 agents = crash. Fix: bounded to 256MB per agent.

P1: Sequential Learning (fixed in PR #17)

5000 turns learned one at a time despite 100 agents. Fix: ThreadPoolExecutor with parallel batches → 9x speedup (21.6h → 2.4h).

P1: Swallowed Errors

_synthesize_with_llm() catches all exceptions silently, masking rate limits as "internal error". Opus scored 3.6% because most answers failed with masked 429 errors.

P2: Longest-Answer-Wins (fixed in PR #17)

Querying 100 agents and picking the longest response. Fix: expertise routing via DHT + no-info filtering + Jaccard consensus.


PRs

Repo PR Status Change
amplihack-memory-lib #11 Merged buffer_pool_size param on CognitiveMemory
amplihack-agent-eval #17 Merged DHT, parallel learning, consensus, median-of-3
amplihack-agent-eval #18 Open Single DHT ring fix (routing bug)
amplihack #2876 Open DistributedHiveGraph, DHT, bloom, docs

Release Assets

Tag Repo Contents
dataset-5000t-seed42-v1.0 eval Pre-built single-agent 5000t Kuzu DB
federated-100agent-5000t-v1.0 eval 100 federated agent Kuzu DBs

Success Criteria (from #2866)

  • Single agent >75% — 94.1%
  • 100-agent hive ≥ single agent — smoke test 58.8%, full eval pending
  • No OOM with 100 DBs — fixed (12.3s, 4.8GB)
  • Parallel learning speedup — 9x (21.6h → 2.4h)
  • Gossip convergence >90% — pending
  • Variance < 10% stddev — 4.3% (was 31.2%)

Related


Next Steps Issues

  • #2890 — Task delegation between hive agents
  • #2891 — Agent discovery and health monitoring
  • #2892 — Persistent storage (NFS-backed Kuzu)
  • #2893 — Per-fact embedding routing for 100-agent scale

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions