eval: 5000-turn long horizon results — pre-built DB regression + federated 100-agent OOM

## 5000-Turn Long Horizon Eval — Distributed Hive Mind

### Architecture

All agents share a single `DistributedHiveGraph` backed by a DHT (consistent hash ring). Facts are sharded across agents with replication factor R=3. Queries route to shard owners, not all agents.

```mermaid
graph TB
 subgraph "Single DHT Ring — All N Agents"
 direction LR
 R["Consistent Hash Ring 64 virtual nodes per agent Facts hashed to ring positions"]
 end

 subgraph "Agent Shards (each holds ~F/N facts)"
 A0["Agent 0 Shard: ~50 facts"]
 A1["Agent 1 Shard: ~48 facts"]
 A2["Agent 2 Shard: ~52 facts"]
 AN["Agent N Shard: ~50 facts"]
 end

 R --> A0
 R --> A1
 R --> A2
 R --> AN
 A0 <-.->|"bloom filter gossip"| A1
 A1 <-.->|gossip| A2
 A2 <-.->|gossip| AN
```

### Query Flow

```mermaid
sequenceDiagram
 participant Q as Question
 participant DHT as DHT Router
 participant S1 as Shard Owner 1
 participant S2 as Shard Owner 2
 participant C as Consensus

 Q->>DHT: "What is Sarah Chen birthday?"
 DHT->>DHT: hash key terms → find shard owners
 DHT->>S1: search shard (facts found)
 DHT->>S2: search shard (facts found)
 S1-->>C: "March 15"
 S2-->>C: "March 15"
 C->>Q: Consensus answer
```

### Learning Flow

```mermaid
sequenceDiagram
 participant T as Turn Pool (5000)
 participant W as ThreadPool (10 workers)
 participant A as Agent (learns 50 turns)
 participant LLM as LLM
 participant DHT as DHT Ring

 T->>W: Round-robin distribute turns
 W->>A: learn batch (parallel)
 A->>LLM: extract facts
 LLM-->>A: structured facts
 A->>A: Store in local Kuzu DB (256MB)
 A->>DHT: promote_fact → replicate to R=3 agents
 Note over W: 10 agents learn simultaneously (9x speedup)
```

---

### Eval Results

#### Single Agent Baseline: 94.1%

| Level | Score | Questions |
|-------|-------|-----------|
| L1 direct recall | 97.9% | 31 |
| L2 multi-source synthesis | 100% | 5 |
| L3 temporal reasoning | 83.3% | 3 |
| L4 procedural | 100% | 2 |
| L5 contradiction | 87.5% | 2 |
| L6 incremental update | 100% | 2 |
| L7 teaching | 83.3% | 2 |
| L8 confidence | 89.2% | 2 |
| L9 causal | 100% | 2 |
| L10 counterfactual | 70.8% | 2 |
| L11 novel skill | 100% | 1 |
| L12 far transfer | 62.5% | 2 |

Runtime: 21.7h (21.6h learning, 84s Q&A+grading). Model: claude-sonnet-4-5-20250929.

#### Federated Hive Progression

| Version | Median | Stddev | Agents | Notes |
|---------|--------|--------|--------|-------|
| v1 (naive, longest-wins) | 40.0% | — | 100 | No routing, query all agents |
| v3 (consensus+routing, broken) | 34.9% | 31.2% | 100 | Empty root hive, random fallback |
| v3 Opus | 3.6% | 15.5% | 100 | + rate limit errors swallowed |
| **Single DHT smoke test** | **58.8%** | **4.3%** | 10 | Correct routing, stable |
| **Single DHT full (pending)** | **TBD** | **TBD** | 100 | Running now (~3h remaining) |

---

### Key Bugs Found and Fixed

#### P0: Empty Root Hive (fixed in PR #18)
Facts stored in per-group hives during learning but queries routed through empty root hive → random agent fallback → 31% stddev. Fix: single DHT ring with all agents.

#### P0: Kuzu mmap OOM (fixed in PR #11, #2876)
`kuzu.Database()` defaults to 80% of system RAM + 8TB mmap per DB. 100 agents = crash. Fix: bounded to 256MB per agent.

#### P1: Sequential Learning (fixed in PR #17)
5000 turns learned one at a time despite 100 agents. Fix: `ThreadPoolExecutor` with parallel batches → 9x speedup (21.6h → 2.4h).

#### P1: Swallowed Errors
`_synthesize_with_llm()` catches all exceptions silently, masking rate limits as "internal error". Opus scored 3.6% because most answers failed with masked 429 errors.

#### P2: Longest-Answer-Wins (fixed in PR #17)
Querying 100 agents and picking the longest response. Fix: expertise routing via DHT + no-info filtering + Jaccard consensus.

---

### PRs

| Repo | PR | Status | Change |
|------|-----|--------|--------|
| amplihack-memory-lib | [#11](https://github.com/rysweet/amplihack-memory-lib/pull/11) | **Merged** | `buffer_pool_size` param on CognitiveMemory |
| amplihack-agent-eval | [#17](https://github.com/rysweet/amplihack-agent-eval/pull/17) | **Merged** | DHT, parallel learning, consensus, median-of-3 |
| amplihack-agent-eval | [#18](https://github.com/rysweet/amplihack-agent-eval/pull/18) | **Open** | Single DHT ring fix (routing bug) |
| amplihack | [#2876](https://github.com/rysweet/amplihack/pull/2876) | **Open** | DistributedHiveGraph, DHT, bloom, docs |

### Release Assets

| Tag | Repo | Contents |
|-----|------|----------|
| [dataset-5000t-seed42-v1.0](https://github.com/rysweet/amplihack-agent-eval/releases/tag/dataset-5000t-seed42-v1.0) | eval | Pre-built single-agent 5000t Kuzu DB |
| [federated-100agent-5000t-v1.0](https://github.com/rysweet/amplihack-agent-eval/releases/tag/federated-100agent-5000t-v1.0) | eval | 100 federated agent Kuzu DBs |

### Success Criteria (from #2866)

- [x] Single agent >75% — **94.1%**
- [ ] 100-agent hive ≥ single agent — smoke test 58.8%, full eval pending
- [x] No OOM with 100 DBs — fixed (12.3s, 4.8GB)
- [x] Parallel learning speedup — 9x (21.6h → 2.4h)
- [ ] Gossip convergence >90% — pending
- [x] Variance < 10% stddev — **4.3%** (was 31.2%)

### Related
- #2866 — Original eval spec
- PR #2717 — Distributed hive mind implementation
- [Architecture docs](https://github.com/rysweet/amplihack/blob/feat/distributed-hive-mind/docs/distributed_hive_mind.md)

---

## Next Steps Issues

- [#2890](https://github.com/rysweet/amplihack/issues/2890) — Task delegation between hive agents
- [#2891](https://github.com/rysweet/amplihack/issues/2891) — Agent discovery and health monitoring
- [#2892](https://github.com/rysweet/amplihack/issues/2892) — Persistent storage (NFS-backed Kuzu)
- [#2893](https://github.com/rysweet/amplihack/issues/2893) — Per-fact embedding routing for 100-agent scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: 5000-turn long horizon results — pre-built DB regression + federated 100-agent OOM #2871

5000-Turn Long Horizon Eval — Distributed Hive Mind

Architecture

Query Flow

Learning Flow

Eval Results

Single Agent Baseline: 94.1%

Federated Hive Progression

Key Bugs Found and Fixed

P0: Empty Root Hive (fixed in PR #18)

P0: Kuzu mmap OOM (fixed in PR #11, #2876)

P1: Sequential Learning (fixed in PR #17)

P1: Swallowed Errors

P2: Longest-Answer-Wins (fixed in PR #17)

PRs

Release Assets

Success Criteria (from #2866)

Related

Next Steps Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Level	Score	Questions
L1 direct recall	97.9%	31
L2 multi-source synthesis	100%	5
L3 temporal reasoning	83.3%	3
L4 procedural	100%	2
L5 contradiction	87.5%	2
L6 incremental update	100%	2
L7 teaching	83.3%	2
L8 confidence	89.2%	2
L9 causal	100%	2
L10 counterfactual	70.8%	2
L11 novel skill	100%	1
L12 far transfer	62.5%	2

Version	Median	Stddev	Agents	Notes
v1 (naive, longest-wins)	40.0%	—	100	No routing, query all agents
v3 (consensus+routing, broken)	34.9%	31.2%	100	Empty root hive, random fallback
v3 Opus	3.6%	15.5%	100	+ rate limit errors swallowed
Single DHT smoke test	58.8%	4.3%	10	Correct routing, stable
Single DHT full (pending)	TBD	TBD	100	Running now (~3h remaining)

Repo	PR	Status	Change
amplihack-memory-lib	#11	Merged	`buffer_pool_size` param on CognitiveMemory
amplihack-agent-eval	#17	Merged	DHT, parallel learning, consensus, median-of-3
amplihack-agent-eval	#18	Open	Single DHT ring fix (routing bug)
amplihack	#2876	Open	DistributedHiveGraph, DHT, bloom, docs

Tag	Repo	Contents
dataset-5000t-seed42-v1.0	eval	Pre-built single-agent 5000t Kuzu DB
federated-100agent-5000t-v1.0	eval	100 federated agent Kuzu DBs

eval: 5000-turn long horizon results — pre-built DB regression + federated 100-agent OOM #2871

Description

5000-Turn Long Horizon Eval — Distributed Hive Mind

Architecture

Query Flow

Learning Flow

Eval Results

Single Agent Baseline: 94.1%

Federated Hive Progression

Key Bugs Found and Fixed

P0: Empty Root Hive (fixed in PR #18)

P0: Kuzu mmap OOM (fixed in PR #11, #2876)

P1: Sequential Learning (fixed in PR #17)

P1: Swallowed Errors

P2: Longest-Answer-Wins (fixed in PR #17)

PRs

Release Assets

Success Criteria (from #2866)

Related

Next Steps Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions