-
Notifications
You must be signed in to change notification settings - Fork 0
Expand scoring evaluation into arXiv preprint #108
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
The scoring evaluation in docs/scoring-evaluation.md has compelling findings (Copeland-Borda 86% concordance, weighted as outlier) that deserve a proper academic write-up. An arXiv preprint would:
- Establish thinktank as a research-backed tool, not just another CLI
- Make the work discoverable by researchers working on ensemble code generation
- Create a citable reference for the social-choice-theory approach to agent selection
- Attract academic contributors and reviewers
Proposed paper structure
Title
"Ensemble AI Coding: Applying Social Choice Theory to Multi-Agent Code Selection"
Sections
- Introduction — the ensemble coding hypothesis, pass@k evidence
- Related Work — AlphaCode, CodeT, MBR-Exec, SWE-bench, Kambhampati LLM-Modulo
- System Design — thinktank architecture, worktree isolation, convergence analysis
- Scoring Methods — Weighted Sum, Copeland, Borda (formal definitions)
- Experimental Setup — controlled experiments with fixed N=5 across diverse tasks
- Results — agreement rates, Friedman test, Kendall's W, effect sizes
- Discussion — why pairwise methods outperform, limitations, when weighted is appropriate
- Conclusion — Copeland as default, future work (LLM-as-judge, cross-project)
What needs to happen
- Run controlled experiments: 30+ tasks with fixed N=5 agents each
- Diverse task set: bug fixes, features, refactors across multiple languages
- Formal Friedman test with Nemenyi post-hoc
- Kendall's W for inter-method concordance
- Figures: agreement heatmaps, score distributions, convergence vs accuracy plots
- LLM-as-judge ground truth for a subset of runs
- LaTeX formatting per arXiv cs.SE guidelines
- The repo README links to the arXiv paper
- The paper links to the repo as reference implementation
References to cite
- Arrow (1951) Social Choice and Individual Values
- Merlin & Valognes (2004) Condorcet-Borda coincidence
- Tetlock & Gardner (2015) Superforecasting
- Kambhampati (2024) LLM-Modulo framework
- Li et al (2022) AlphaCode
- Chen et al (2022) CodeT
- Wang et al (2022) Self-Consistency
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request