π¬ The Core Discovery: Competition alone produces 94% of specializationβdiversity genuinely emerges.
π‘ The Practical Implication: Prompt-based specialization achieves the same 100% ceiling as fine-tuningβbut at $0 cost, with instant deployment, and full reversibility.
"The Darwin's Finches moment for LLM populations."
Overview β’ Key Results β’ Theory β’ Rules β’ Quick Start β’ Experiments β’ π Deep Dive β’ Citation
This is Paper 2 in the Emergent Specialization research series:
| Paper | Focus | Domain | Repository |
|---|---|---|---|
| Paper 1 | Learner Populations | Time Series (Rule-based) | NichePopulation |
| Paper 2 | Preference Specialization | Synthetic Rules (LLM) | This repo |
| Paper 3 | Tool Specialization | Real Tools (LLM) | Emergent-Tool-Specialization |
Title: Emergent Preference Specialization in LLM Agent Populations Through Competitive Selection
Author: Yuhao Li
Institution: University of Pennsylvania
Email: li88@sas.upenn.edu
This repository contains the complete implementation, experiments, and theoretical analysis for research on emergent preference specialization using 8 synthetic rule domains.
Note: This paper uses synthetic rules (not real-world tools) to provide a controlled experimental environment where we can rigorously prove that specialization genuinely emerges from competition, rather than being engineered.
Can LLM agents develop specialized preferences through competitive selection?
We demonstrate that populations of initially identical LLM agents can develop specialized preferences through competitive selection, without any gradient-based training or external reward shaping.
Figure: From identical agents (left) through competitive selection (center) to specialized populations (right).
- First causal demonstration of prompt-based specialization: 70.7% causality rate (95% CI: [68.3%, 73.1%])
- Complete theoretical framework with 3 proven theorems and equilibrium analysis
- Complete specialization validated: Evolved specialists achieve theoretical ceiling (100%) on matched tasks
- Maximum value unlocked: Oracle routing yields +64.2% Β± 2.3% improvement (n=5, 95% CI: [61.3, 67.0]) with 5-7 task break-even
- Cross-LLM validation: Mechanism works across Gemini, GPT-4, and Claude
Figure: All 8 rule specialists emerge and reach Level 3 within 50 generations. Shaded bands show variance across seeds.
| Metric | Value | Interpretation |
|---|---|---|
| Swap Test Pass Rate | 70.7% | Strong causality proven |
| 95% Confidence Interval | [68.3%, 73.1%] | Tight bounds (4.8% width) |
| Cohen's d | 2.66 | Large effect size |
| Seeds | 10 (unified gemini-2.5-flash) | All consistent |
Figure: Prompt swap test heatmap β diagonal (matched) shows high accuracy (green), off-diagonal (mismatched) shows low accuracy (purple).
| Condition | Accuracy | Improvement |
|---|---|---|
| NO_PROMPT | 5.0% | -- |
| RANDOM_PROMPT | 15.0% | +10% |
| WRONG_PROMPT | 20.0% | +15% |
| CORRECT_PROMPT | 100.0% | +95% |
Figure: The specialization mechanism works across all major LLM providers with consistent performance gaps.
| Model | Provider | Diagonal | Off-Diagonal | Gap |
|---|---|---|---|---|
| gemini-2.5-flash | 0.91 | 0.20 | 70.7% β | |
| GPT-4o-mini | OpenAI | 0.90 | 0.37 | 58.6% β |
| Claude 3 Haiku | Anthropic | 0.92 | 0.45 | 50.9% β |
Figure: Specialists with oracle routing achieve 100% accuracy β a +64.2% improvement over generalists.
| Condition | Accuracy | Improvement |
|---|---|---|
| Single Generalist | 35.8% | -- |
| Oracle Routing | 100.0% | +64.2% |
| Confidence Routing | 41.7% | +5.9% |
| Ensemble | 42.5% | +6.7% |
A key question: does our exclusivity mechanism force specialization, or does it emerge naturally from competition?
| Condition | SCI | Coverage | Super-Agents | Interpretation |
|---|---|---|---|---|
| Full System | 0.818 | 96.2% | 0.0 | Best overall |
| No Exclusivity | 0.818 | 100.0% | 0.7 | Works, but super-agents appear |
| No Fitness Sharing | 0.816 | 91.2% | 0.0 | Works, slightly less diverse |
| Competition Only | 0.773 | 100.0% | 1.2 | β Still strong specialization! |
Key Finding: Competition alone produces SCI = 0.773 (94% of full system). This proves:
- β Specialization is genuinely emergent from competition
- β Exclusivity is a safety net (prevents super-agents), not the source
- β Our core claim is validated: competition is sufficient for diversity
We provide a complete theoretical framework with three proven theorems:
The expected total strategy level E[L(t)] is monotonically non-decreasing.
Under fitness sharing, the system reaches k β₯ β(1-Ξ³)Rβ distinct L3 specialists within O(NΓRΓlog(1/Ξ΅)) generations.
The stationary distribution Ο(S*) β₯ 1-Ξ΅ for sufficiently large N.
- Equilibrium Characterization: Uniqueness (up to permutation), stability, optimality
- Thompson Sampling Connection: Links to Paper 1's belief-based mechanism
- Carrying Capacity: Optimal N* β 3R (24-32 agents for 8 rules)
This paper directly extends the NichePopulation algorithm from Paper 1 of this research series. Both mechanisms produce niche partitioning through competition aloneβwithout explicit diversity incentives.
| Paper 1: NichePopulation | Paper 2: Prompt Evolution |
|---|---|
| Regimes (environmental states) | Rules (task types) |
| Beta belief distributions | Strategy levels (L0βL3 prompts) |
| Thompson Sampling posteriors | Accumulated prompt knowledge |
| Niche affinity Ξ± β Ξ^R | Exclusivity (L3 lock) |
| Niche bonus Ξ» | Fitness sharing 1/βn |
| Winner-take-all updates | Winner-take-all updates |
Both papers demonstrate that strict competitive exclusion is a structural necessity:
Soft Competition (proportional updates):
β All agents update proportionally to performance
β Good strategies propagate to ALL agents
β Result: HOMOGENIZATION
Winner-Take-All (this work):
β Only winner updates beliefs/strategies
β Winners accumulate expertise in winning niche
β Losers unchanged, must find other niches
β Result: DIFFERENTIATION (emergent specialization)
This explains why standard MARL methods (QMIX, MAPPO, IQL) fail to induce specialization: they use shared critics/value functions that drive convergence rather than divergence. Paper 1 shows MARL achieves SI < 0.2 versus our SI = 0.75.
This paper uses 8 synthetic rule domains with cognitive science grounding. Synthetic rules are essential for:
- Controlled experiments: No prior LLM knowledge contaminates results
- Verifiable causality: We can prove prompts cause specialization
- Clean ablations: Isolate competition's effect from other factors
| Category | Rules | Characteristic |
|---|---|---|
| Purely Arbitrary | POSITION, PATTERN, MATH_MOD | No prior knowledge helps |
| Semi-Arbitrary | RHYME, ALPHABET, VOWEL_START | Requires rule application |
| Knowledge-Aided | ANIMATE, INVERSE | Leverages categorical knowledge |
| Rule | Description | Cognitive Source |
|---|---|---|
| POSITION | Answer at position B | Serial Position Effect |
| PATTERN | ABAB alternation | Gestalt Psychology |
| INVERSE | Opposite of obvious | Propositional Logic |
| VOWEL_START | Starts with A,E,I,O,U | Phonemic Awareness |
| RHYME | Rhymes with CAT | Phonological Processing |
| ALPHABET | First letter closest to M | Orthographic Processing |
| MATH_MOD | Length mod 3 = 1 | Number Cognition |
| ANIMATE | Living thing (animal) | Category-Specific Processing |
For real-world tool specialization, see Paper 3: Emergent-Tool-Specialization.
git clone https://github.com/HowardLiYH/Emergent-Prompt-Evolution.git
cd Emergent-Prompt-Evolution
pip install -r requirements.txt
# Set API key in .env file
echo "GOOGLE_API_KEY=your-key" > .env# Phase 2: Causality Test (main result)
python experiments/exp_phase2_enhanced.py
# 5-Condition Practical Benefit
python experiments/exp_practical_benefit.py
# Fitness Sharing Ablation
python experiments/exp_fitness_sensitivity.py
# N=48 Scalability Investigation
python experiments/exp_n48_investigation.py| Phase | Experiment | Question | File |
|---|---|---|---|
| 0 | Rule Validation | Are rules distinct? | exp_rule_validation.py |
| 1 | Preference Emergence | Do agents specialize? | exp_preference_main.py |
| 2 | Causality Test | Do prompts cause it? | exp_phase2_enhanced.py |
| 3 | Ablation | Which components matter? | exp_preference_ablation.py |
| 4 | MMLU Validation | Transfer to real tasks? | exp_mmlu_validation.py |
| 5 | Practical Benefit | Population vs generalist? | exp_practical_benefit.py |
| 6 | Cost-Benefit | When does it pay off? | exp_cost_benefit.py |
| 7 | Bridge | Synthetic vs real transfer? | exp_bridge.py |
| 8 | Falsification | Preference vs capability? | exp_falsification.py |
- Strategy Accumulation: Winners gain rule knowledge (Level 0β1β2β3)
- Exclusivity: Level 3 agents specialize in one rule only
- Confidence-based Competition: Highest confidence among correct wins
- Fitness Sharing: 1/βn penalty promotes diversity
- Seeded Initialization: Each agent starts with L1 in one random rule (cold-start solution)
emergent_prompt_evolution/
βββ src/genesis/
β βββ synthetic_rules.py # 8 rules + categories
β βββ rule_strategies.py # 3-level strategies
β βββ preference_agent.py # Agent with exclusivity
β βββ competition_v3.py # Confidence-based competition
β βββ llm_client.py # Unified LLM wrapper
β βββ theory.py # 3 theorems + proofs
β βββ real_tasks.py # Multi-domain tasks
β βββ routing.py # 4 routing methods
β βββ statistics_complete.py # Full statistical rigor
β βββ hero_visualization.py # Publication figures
β βββ analysis.py # Bootstrap CIs (10k)
β βββ neurips_metrics.py # SCI, HHI, Gini
βββ experiments/
β βββ exp_phase2_enhanced.py # Main causality test
β βββ exp_practical_benefit.py# 5-condition comparison
β βββ exp_falsification.py # Preference vs capability
β βββ exp_cost_benefit.py # ROI analysis
β βββ exp_bridge.py # Mechanism transfer
β βββ exp_fitness_sensitivity.py # Penalty ablation
β βββ exp_n48_investigation.py# Scalability analysis
β βββ ... # Other experiments
βββ paper/
β βββ main.tex # Full NeurIPS submission
β βββ arxiv_submission.zip # Ready for arXiv
β βββ neurips_2025.sty # NeurIPS style file
β βββ figures/ # Publication figures
βββ results/
β βββ unified_gemini25/ # 10-seed results
β βββ practical_benefit/ # 5-condition results
β βββ fitness_sensitivity/ # Ablation results
β βββ ... # Other results
βββ docs/
β βββ DEEP_DIVE.md # Comprehensive methodology
β βββ PREFERENCE_DEFINITION.md# Formal definition
β βββ COGNITIVE_FRAMING.md # Revised framing
β βββ AUDIT_LOG.md # Data integrity
βββ CHANGELOG.md # Version history
βββ README.md # This file
All results include complete statistical analysis:
| Requirement | Status |
|---|---|
| Cohen's d for all claims | β |
| 95% Confidence Intervals | β |
| Bootstrap CIs (10k resamples) | β |
| Holm-Bonferroni correction | β |
| Power analysis (10 seeds) | β |
| Welch's t-test | β |
New to this project? Read our comprehensive Deep Dive Document β a ground-up mathematical explanation of the entire methodology.
The Deep Dive covers:
- Part I: The Problem and Why It Matters
- Part II: Mathematical Foundations (entropy, fitness sharing, Markov chains)
- Part III: The Mechanism (rules, strategies, competition)
- Part IV: Theoretical Analysis (3 theorems with proofs)
- Part V: Experimental Validation (causality tests, statistics)
- Part VI: Practical Applications (deployment, ROI)
- Part VII: What Makes This Impressive
Prerequisites: Basic probability theory and familiarity with LLMs. All advanced concepts are developed from first principles.
| Paper | Project | Relationship |
|---|---|---|
| Paper 1 | NichePopulation | Foundation: Introduces NichePopulation algorithm with Thompson Sampling + competitive exclusion. Validated across 6 real-world domains. Mean SI = 0.747, Cohen's d > 20. |
| Paper 2 | This Repository | Extension: Adapts NichePopulation for LLM prompt evolution. RulesβRegimes, Strategy LevelsβBeta posteriors. Produces human-readable specialists with 70.7% causality validation. |
| Paper 3 | Emergent-Tool-Specialization | Extension: Applies emergent specialization to real LLM tools (Vision, Code, RAG, Web). +83% specialist advantage on tool-gated tasks. |
Paper 1: NichePopulation (Foundation)
βββ Domain: Real-world time series prediction
βββ Agents: Rule-based learners with Beta beliefs
βββ Result: Competition induces specialization (SI=0.75)
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KEY MECHANISM: Winner-take-all competitive exclusion β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Paper 2: Preference Specialization (This Work)
βββ Domain: LLM task specialization (synthetic rules)
βββ Agents: LLM instances with accumulating prompts
βββ Result: Prompts cause preference (70.7% causality)
β
Paper 3: Tool Specialization (Extension)
βββ Domain: Real LLM tools (Vision, Code, RAG, Web)
βββ Agents: LLM instances with real API access
βββ Result: +83% specialist advantage
@article{li2025emergent,
title={Emergent Preference Specialization in LLM Agent Populations
Through Competitive Selection},
author={Li, Yuhao},
journal={arXiv preprint},
year={2025}
}MIT License - See LICENSE for details.
Part of the Emergent Specialization Research Series
Paper 2 of 3





