The LLM Team Composition Paradox: Why the Right Team Isn't Always Diverse: Developing a Contingency Model on the ARC-AGI Benchmark with an OODA-Belbin Framework Inspired by Multicultural Human Teams
This repository contains the official code for the research paper: "The LLM Team Composition Paradox: Why the Right Team Isn't Always Diverse: Developing a Contingency Model on the ARC-AGI Benchmark with an OODA-Belbin Framework Inspired by Multicultural Human Teams".
The experiment investigates the optimal team architecture for Large Language Model (LLM) agents tasked with solving complex "wicked problems" from the Abstraction and Reasoning Corpus (ARC).
This research challenges the prevailing assumption that architecturally complex, diverse, or adaptive teams are universally superior for problem-solving. The central finding is a data-driven contingency model, which demonstrates that the optimal agentic architecture is contingent on the type of problem being solved.
Our findings reveal a series of paradoxes:
- The Simplicity Hypothesis: A significant portion of tasks requiring holistic, abstract reasoning are most effectively solved by a single agent (
Baseline_Single_Agent), where the overhead of teamwork is a detriment. - The "Expert Anomaly": A non-diverse, homogeneous team of 'Implementer' agents (
Baseline_Static_Homogeneous) uniquely excels at high-precision procedural tasks ("Local Geometric Analysis"). - The "Cognitive Labyrinth": The most algorithmically complex problems are mastered not by the most complex adaptive system, but by a cognitively diverse, static heterogeneous team (
Baseline_Static_Heterogeneous). - The "Cost of Complexity": The most complex, real-time adaptive framework (
Adaptive_CP-ATS_Budget_Pool) was the poorest-performing condition, revealing a clear "cost of complexity" in agentic reasoning.
The code in this repository implements the final Hierarchical OODA-Belbin Framework, a two-layer system designed to operationalize our contingency model.
-
Strategic Layer (The Selector): A top-level
StrategicSelectorAgentfirst analyzes (or "Orients" to) the problem's features. Instead of picking agents, it picks the entire problem-solving architecture (e.g., "Single Agent" or "Homogeneous Team") best suited for that problem type. -
Tactical Layer (The Solver): If an adaptive strategy is chosen, a
TeamManagerAgentmanages the team. If the agents get stuck (a state we call "Generative Exhaustion"), a Self-Correction Loop engages, using "Chain-of-Thought" and "Meta-cognitive" prompts to force the agent to debug its own flawed logic.
Here is how the core concepts from the paper map directly to the code in this repository:
| Concept (from Paper) | Code Implementation | File |
|---|---|---|
| Strategic Selector Agent | StrategicSelectorAgent class |
src/agents/team_manager.py |
| "OODA Reconnaissance" | extract_problem_features()calculate_innovation_score() |
src/agents/team_manager.py |
| Strategic Deployment Rules | select_strategy() method |
src/agents/team_manager.py |
| Agentic Team Manager (CP-ATS) | TeamManagerAgent class |
src/agents/team_manager.py |
| Belbin Role Mapping | BELBIN_ROLE_MAP dictionary |
src/agents/team_manager.py |
| Cultural Personas | PERSONAS_DATA dictionary |
src/prompts/personas.py |
| Dynamic Team Recomposition | diagnose_failure() & recompose_team() |
src/agents/team_manager.py |
| Tactical Self-Correction Loop | _expand_and_correct() method |
src/search/ab_mcts.py |
| "Chain-of-Thought" Prompt | _create_correction_prompt() |
src/agents/specialist_agent.py |
| "Meta-cognitive" Prompt | _create_meta_cognitive_prompt() |
src/agents/specialist_agent.py |
| Gated Scoring Framework | evaluate_solution() (Gate 1 & 2 checks) |
src/solution_evaluator.py |
| Multi-Objective Fitness Function | get_fitness_vector() method |
src/solution_evaluator.py |
| Search Algorithm | ABMCTS class |
src/search/ab_mcts.py |
| Main Experiment Orchestrator | main.py |
main.py |
.
βββ Data tables/
β βββ Comprehensive successful solutions data and analysis/
β β βββ Successful solutions data and analysis.csv
β βββ chi square tables.pdf
β βββ Summary-Data contamination per condition.csv
β βββ Summary-Ghost problem contamination (V1 but no V2 version).csv
β βββ Summary-Percentage of high quality successful solves.csv
β βββ Summary-Percentage of successful solves per framework-1.csv
β βββ Summary-Percentage of successful solves per framework.csv
β βββ Summary-Real success data for the 120 problems experiment.csv
β βββ Summary-Real successes, Percentage of high quality solutions and data integrity.csv
β βββ Raw successful solutions data and Gemini Pro 2.5 analysis.csv
βββ Outputs/
β βββ output_100_problems_total_failure.zip
β βββ outputs_120_problems.zip
β βββ outputs_5_final_runs_20_problems.zip
β βββ outputs_initial_5_problems.zip
β βββ outputs_repeated_5_problems.zip
βββ data - copy ARC dataset inside/
β βββ (ARC-AGI dataset JSON files go here)
βββ src/
β βββ agents/
β β βββ specialist_agent.py
β β βββ team_manager.py
β βββ prompts/
β β βββ personas.py
β βββ search/
β β βββ ab_mcts.py
β βββ utils/
β β βββ code_executor.py
β β βββ logger.py
β βββ llm_client.py
β βββ problem_loader.py
β βββ solution_evaluator.py
βββ main.py
βββ requirements.txtFollow these steps to set up and run the full experiment.
First, clone the repository and install the required Python packages.
git clone https://github.com/your-username/llm-team-paradox.git
cd llm-team-paradoxThe requirements.txt file lists all necessary libraries. Install them using pip:
pip install -r requirements.txtThe system uses the DeepSeek Coder model via its API. You must set your API key as an environment variable, which src/llm_client.py will read.
# On macOS/Linux
export DEEPSEEK_API_KEY="your-api-key-here"
# On Windows (in Command Prompt)
set DEEPSEEK_API_KEY="your-api-key-here"The code expects the ARC-AGI dataset to be in the folder named data at the root of the repository.
Your directory structure must look like this:
llm-team-paradox/
β
βββ data/
β βββ arc-agi_training_challenges.json
β βββ arc-agi_training_solutions.json
β βββ ... (any other challenge/solution files)
β
βββ src/
β βββ ... (all source code)
β
βββ main.py
βββ requirements.txtYou can configure the experiment by editing the global variables at the top of main.py:
# main.py
# --- CONFIGURATION ---
DATASET_PATH = "./data" # Make sure this matches your folder name
CHALLENGES_FILE = "json challenges filename"
SOLUTIONS_FILE = "json solutions filename"
OUTPUTS_DIR = Path("outputs")
MAX_ADAPTATION_CYCLES = 5
BUDGET_PER_CYCLE = 50
MAX_CONCURRENT_PROBLEMS = 4 # Warning: Setting this too high may exceed API rate limits
BATCH_SIZE = MAX_CONCURRENT_PROBLEMSOnce configured, you can start the experiment by running main.py:
python main.pyThe script will:
- Initialize all components (LLM client, evaluator, etc.).
- Check the
OUTPUTS_DIRfor aconsolidated_results.csvfile to find and skip any problems that have already been processed. The experiment is fully resumable. - Load the remaining problems and process them in batches.
- For each problem, the
StrategicSelectorAgentwill choose the optimal strategy. - The chosen framework will run, and all results will be logged to
consolidated_results.csv. - Detailed logs and all generated code for every attempt will be saved in the
OUTPUTS_DIRsubfolders.
This repository includes the complete data used to generate the paper's findings, separated into two categories:
This folder contains the final, processed analysis tables and statistical tests. This is the best place to see the evidence for the paper's conclusions.
Comprehensive successful solutions data and analysis/: Contains the master CSV file detailing every successful solve and, most importantly, the qualitative code audit that classified it as a "real success" or "false positive".Summary-*.csv: Various summary tables detailing data contamination, high-quality success rates, real success percentages, etc..chi square tables.pdf: The statistical test results, such as the one showing no significant difference in raw success rates.
This folder contains the complete raw experimental runs as .zip files. Each file corresponds to a specific Design Science Research (DSR) cycle from the paper. Inside each zip, you will find the raw .txt log files, all generated Python code for every attempt, and the CSV results for that specific run.
outputs_initial_5_problems.zip: DSR Cycle 1. This run produced Anomaly #1, the "a priori team selection problem".outputs_repeated_5_problems.zip: DSR Cycle 2. This run produced Anomaly #2, the "Universal Failure" against the official metric, which proved the need for real-time adaptation.outputs_120_problems.zip: The main experiment (120 problems) that generated the core contingency model and the four problem archetypes.output_100_problems_total_failure.zip: DSR Cycle 3. This run against the "Frontier Challenge" (100 eval problems) produced Anomaly #3, "The Local Optima Trap".outputs_5_final_runs_20_problems.zip: The final validation run (5 runs on 20 "trap" problems) testing theSelf-Correction Loop, which achieved the first-ever robust solve on a "trap" problem.
If you use this research or code in your work, please cite the original paper.
@article{Hariri2025LLMParadox,
title = {The LLM Team Composition Paradox: Why the Right Team Isn't Always Diverse},
author = {Youssef Hariri},
journal = {Rennes School of Business and Upgrad},
year = {2025}
}