Skip to content

Expand scoring evaluation into arXiv preprint #108

@that-github-user

Description

@that-github-user

Summary

The scoring evaluation in docs/scoring-evaluation.md has compelling findings (Copeland-Borda 86% concordance, weighted as outlier) that deserve a proper academic write-up. An arXiv preprint would:

  1. Establish thinktank as a research-backed tool, not just another CLI
  2. Make the work discoverable by researchers working on ensemble code generation
  3. Create a citable reference for the social-choice-theory approach to agent selection
  4. Attract academic contributors and reviewers

Proposed paper structure

Title

"Ensemble AI Coding: Applying Social Choice Theory to Multi-Agent Code Selection"

Sections

  1. Introduction — the ensemble coding hypothesis, pass@k evidence
  2. Related Work — AlphaCode, CodeT, MBR-Exec, SWE-bench, Kambhampati LLM-Modulo
  3. System Design — thinktank architecture, worktree isolation, convergence analysis
  4. Scoring Methods — Weighted Sum, Copeland, Borda (formal definitions)
  5. Experimental Setup — controlled experiments with fixed N=5 across diverse tasks
  6. Results — agreement rates, Friedman test, Kendall's W, effect sizes
  7. Discussion — why pairwise methods outperform, limitations, when weighted is appropriate
  8. Conclusion — Copeland as default, future work (LLM-as-judge, cross-project)

What needs to happen

  • Run controlled experiments: 30+ tasks with fixed N=5 agents each
  • Diverse task set: bug fixes, features, refactors across multiple languages
  • Formal Friedman test with Nemenyi post-hoc
  • Kendall's W for inter-method concordance
  • Figures: agreement heatmaps, score distributions, convergence vs accuracy plots
  • LLM-as-judge ground truth for a subset of runs
  • LaTeX formatting per arXiv cs.SE guidelines
  • The repo README links to the arXiv paper
  • The paper links to the repo as reference implementation

References to cite

  • Arrow (1951) Social Choice and Individual Values
  • Merlin & Valognes (2004) Condorcet-Borda coincidence
  • Tetlock & Gardner (2015) Superforecasting
  • Kambhampati (2024) LLM-Modulo framework
  • Li et al (2022) AlphaCode
  • Chen et al (2022) CodeT
  • Wang et al (2022) Self-Consistency

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions