This project implements a simple code fixing agent using:
- LangGraph: Agent framework.
- Qwen3-0.6B: LLM for reasoning and code generation.
- HumanEvalPack: Dataset with buggy code (Python).
- ReAct: Agent pattern.
- Reasoning: Analyze and reason about the issue
- Acting: Propose a fix
- Observation: Test/Check the fix
- (Optional) Iteration
conda create -n fixagent python=3.10 -y
conda activate fixagent
pip install -r requirements.txt
Start the agent, the model and dataset will be automatically downloaded locally.
python fixagent.py
The results will be saved as csv in the results/.
llm-code-fix-agent/
├── reference/ # pdfs
├── results/ # test results on full dataset (164 cases)
├── evaluation.ipynb # visualize results (pass@1, latency)
├── fixagent.py # the code fix agent
├── README.md
└── requirements.txt
| Pass@1(%) | Avg. Latency(s) |
|---|---|
| 33.54 | 24.25 |
See more details in evaluation.ipynb.
(Due to time limit, here are my thoughts on improving accuracy and evaluation.)
Evaluation
- Use more metrics, such as pass@k, token usage, etc.
- Extend to more programming language, and other benchmark datasets.
- Conduct multiple experimental runs to ensure consistency of evaluation results, as LLM model may not (re-)produce the same results.
- Test the model on multiple code generation tasks, such as code explanation and synthesis.
LLM
- Apply fine-tuning or instruction-tuning on domain-specific datasets (e.g., buggy–fixed code pairs) to improve accuracy and robustness.
Agent
- Test more prompt strategies: zero-shot, one-shot, few-shot
- Integrate additional tools or components, such as multi agents, retrieval-augmented generation (RAG), or external APIs.
- Structured output. This script simply use regex to extract code solution, but LangGraph offers built-in structured output functions.
- Test agent patterns: ReAct, Plan-and-Solve, Reflection, etc.