Multi-Agent Architectures for LLM-Based Code Generation: Evaluating Adaptive Routing and Model Specialization
π For complete project details, methodology, and experimental results, see the full report:
project/
βββ data/ # HumanEval dataset and experiment logs
βββ docs/ # Detailed method documentation
βββ notebook/ # Jupyter notebooks for experiments
βββ report/ # LaTeX source of the scientific paper
β βββ figures/ # Architecture diagrams and charts
βββ src/ # Source code
β βββ __init__.py
β βββ agents/ # LLM interaction layer
β β βββ client.py # HuggingFace API client with retry logic
β β βββ llm.py # Architecture & model configurations
β βββ data/ # Dataset handling
β β βββ task_loader.py # HumanEval task loader
β βββ evaluation/ # Metrics computation
β β βββ __init__.py
β βββ graph/ # LangGraph state machine
β β βββ config.py # Story points β tier mapping
β β βββ graph.py # Graph builder & runner
β β βββ nodes.py # Agent node implementations
β β βββ state.py # GraphState TypedDict definition
β βββ models/ # Prompts & schemas
β βββ llm_responses.py # Pydantic response models
β βββ prompts.py # System/user prompts for all agents
βββ tests/ # Unit tests for the pipeline
βββ main.py # Entry point
βββ requirements.txt # Dependencies
This project is optimized to run in the cloud using Kaggle Notebooks for reproducible execution.
-
Clone the Repository Clone this repository to your local machine or download the zip.
-
Import to Kaggle
- Create a new Notebook on Kaggle.
- Upload the project files (specifically the
notebook/folder contents) into the Kaggle environment.
-
Configure Environment
- Enable GPU: In the Notebook settings, verify that a GPU (e.g., T4 x2) is enabled.
- Set Secrets: Add your API keys in the Kaggle "Secrets" menu:
HF_TOKEN: Your HuggingFace API key (required).LANGSMITH_API_KEY: Your LangSmith key (optional, for tracing).
-
Run the Notebook Select one of the notebooks below and execute the cells.
The project includes pre-configured notebooks for each architectural experiment:
| Notebook | Description |
|---|---|
architecture-a.ipynb |
Baseline: Single-agent code generation (Qwen-7B). |
architecture-b.ipynb |
Multi-Agent: Full pipeline (Plan/Code/Test/Review) using a single model (Qwen-7B). |
architecture-c.ipynb |
Adaptive: Multi-agent with specialized models (1.5B/7B/32B) routed by difficulty. |
architecture-c1.ipynb |
Always-Large: Benchmark using the largest model (32B) for all tasks. |
ablation-no-s-model.ipynb |
Ablation Study (C2): Removes Tier S to test reliability impact. |
*-pr.ipynb |
Prompt Repetition: Variants of A, B, and C using the Prompt Repetition technique. |
This project presents a systematic empirical comparison of Multi-Agent Systems (MAS) for automated code generation using Large Language Models. We evaluate complex agentic pipelines that mimic real-world software engineering processes: Planning, Routing, Development, Testing, and Code Review.
Our research addresses the following Research Questions:
| RQ | Question |
|---|---|
| RQ1 | Does a multi-agent pipeline with role separation (Planner, Developer, Tester, Reviewer) improve functional correctness compared to a single-agent approach? |
| RQ2 | Do specialized models assigned to each role provide measurable benefits over using a single model for all roles? |
| RQ3 | Can adaptive routingβselecting developer model capacity based on estimated task difficultyβreduce computational cost while maintaining or improving quality? |
| RQ4 | Does prompt repetition improve code generation accuracy for the models used in this study? |
The study benchmarks performance using the HumanEval dataset (164 tasks) across various architectural configurations.
The system decomposes code generation into five specialized roles, each implemented as a node in a LangGraph state machine:
- Planner: Analyzes task specifications and estimates difficulty using Scrum-style story points
- Router: Directs tasks to appropriately-sized developer models based on difficulty
- Developer: Generates implementation code based on task description and feedback
- Tester: Executes code against assertion-based tests in a sandboxed Python subprocess (10s timeout)
- Reviewer: Analyzes test failures and provides actionable feedback for iteration
Tasks are dynamically routed to different model capacities based on estimated difficulty:
- Story Points 1-2 β Tier S (Qwen 1.5B) β Simple tasks
- Story Points 3-5 β Tier M (Qwen 7B) β Medium complexity
- Story Point 8 β Tier L (Qwen 32B) β Complex tasks
When tests fail, the pipeline doesn't terminate. Instead:
- The Reviewer analyzes error messages and identifies the root cause
- Actionable feedback is generated for the Developer
- The Developer receives: previous code + error logs + reviewer feedback
- A new implementation attempt is made with full context
If a model fails to produce passing code, the system escalates to the next tier:
- Tier S fails β Escalate to Tier M
- Tier M fails β Escalate to Tier L
- Maximum 2 escalations allowed per task
| Category | Technologies |
|---|---|
| Core | Python 3.10+, LangGraph, LangChain |
| LLM Inference | HuggingFace Inference API (Serverless) |
| Models | Qwen2.5-Coder (1.5B, 7B, 32B), Llama-3-8B |
| Analysis | Radon (Cyclomatic Complexity) |
| Observability | LangSmith (Tracing & Debugging) |
The system is built on a directed cyclic graph architecture using LangGraph. It decomposes code generation into specialized roles, moving from a monolithic "black box" approach to an interactive pipeline.
Comparison of the four main architectures: A (Single-Agent), B (Multi-Agent Single-Model), C (Adaptive Multi-Model), and C1 (Always-Large).
The Planner assigns Fibonacci-based story points (1, 2, 3, 5, 8) to estimate task difficulty. The Router maps these to Developer Tiers (S/M/L), matching model capacity to task complexity.
When tests fail, the system escalates to the next tier. The escalating Developer receives: previous code, error messages, and Reviewer feedback. Maximum 2 escalations per task.
Baseline configuration: one model (Qwen-7B), one attempt, no retry. The task flows directly to the Developer, code is tested, and the pipeline terminates regardless of result. Establishes the performance floor.
Full multi-agent pipeline using the same model (Qwen-7B) for all roles. Includes Planner, Router, Developer, Tester, and Reviewer with retry loops. Isolates the effect of role separation from model capacity differences.
Multi-agent pipeline with specialized models: Llama-3-8B for Planner/Reviewer, Qwen models (1.5B/7B/32B) for Developers. Tasks are routed based on story point estimates, with escalation on failure.
Multi-agent pipeline that bypasses adaptive routing entirely. All tasks are assigned to the Tier L Developer (Qwen-32B) regardless of estimated difficulty. Serves as the upper-bound reference for accuracy.
Architecture C2 is an ablation study that removes Tier S (1.5B model) entirely from the adaptive routing configuration. Tasks estimated as "easy" (Story Points 1-2) are routed directly to Tier M instead.
Purpose: To validate whether the smallest model (Qwen-1.5B) is a bottleneck causing cascading failures and context pollution in the adaptive configuration.
Results: C2 achieved 72.0% pass rate (vs. 62.2% for C), with escalations dropping from 152 to 74, confirming that Tier S was the primary source of failures.
Technique proposed by Leviathan et al. that duplicates user prompts to improve token attention. Tokens in the second occurrence can attend to all tokens in the first. Tested across all architectures (A-PR, B-PR, C-PR).
- Ivan Necerini
- Jacopo Rialti
- Emanuele Romano
- Marco Donatucci
- Ferdinando Del Vecchio