We introduce QualityFlow, a dynamic agentic workflow for program synthesis. Given the English description of a programming problem and a set of unit tests, the model's goal is to synthesize the correct program that solves the problem and passes the tests. QualityFlow includes large language model (LLM) agents resembling a software development team, including code generation, testing, and self-debugging. We propose the LLM Quality Checker, which explicitly ``imagines'' whether the synthesized programs' execution would conform to the unit tests. The Quality Checks dynamically control the workflow, including actions to submit the final answer, clarify the problem statement, and revert previous workflow steps. Our experiments show that the Quality Checker can precisely accept any correct program, mitigate faulty synthesized tests, and prevent potential workflow deviation. QualityFlow establishes the state-of-the-art results on four program synthesis benchmarks: MBPP, HumanEval, and stricter evaluations from MBPP-EvalPlus and HumanEval-EvalPlus.
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks https://arxiv.org/pdf/2501.17167
Install dependencies
conda create -n agentic python=3.12
conda activate agentic
pip3 install openai boto3 tqdm pandas datasets sqlalchemy anthropic psutil transformers deepdiff seaborn pymysql jsonlines
pip3 install simple_parsing scikit-learn
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Install mxeval: https://github.com/amazon-science/mxeval Install codegeex: https://github.com/THUDM/CodeGeeX
Set your Anthropic API KEY as environment variable
export ANTHROPIC_API_KEY=[YOUR KEY HERE]
Add CodeGeeX to your python path
export PYTHONPATH=.:lib/CodeGeeX/
For example, running QualityFlow on MBPP:
### without cacher (after installation above, this command should run directly)
python run_sota_mbpp_without_cacher.py
### with MariaDB cacher
python run_sota_mbpp.py
Cacher is a MariaDB server. Set up your own server with address, user, password at class MariaDBCacherFactory. The cacher will cache all LLM API calls to save time and money, allowing rerun or continuation of past experiments. You can disable cache in command line, which allows you to run this project without database setup.
For example, running custom experiments through commandline
### disable cacher
python run_args.py --global_model opus --dataset humaneval --use_cache False
### use MariaDB cacher
python run_args.py --global_model opus --dataset humaneval
The experiments to reproduce the paper are in
run_paper_experiments.py
Requirements.txt is provided for ubuntu 20.04, Nvidia A10G, but it's advised to install the libraries manually following instructions above without relying on requirements.txt.
sudo apt update
sudo apt install mariadb-server
sudo systemctl start mariadb
sudo systemctl enable mariadb
sudo mysql_secure_installation
sudo mysql -u root -p
Create DB
sudo mysql -u root -p
CREATE DATABASE your_db_name;
CREATE USER 'your_username'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON your_db_name.* TO 'your_username'@'localhost';
FLUSH PRIVILEGES;
EXIT;
Testing
mysql -u your_username -p your_db_name
qualityflow contains the source code
workflow.py contains the QualityFlow workflow
step1_new_programmer.py is the QualityFlow programmer
step2_new_test_designer.py is the QualityFlow test designer
step3_self_debug.py is the self-debugger
com1_code_quality_checker.py is the code qualty checker
com2_reinterpretation.py is the clarifier and re-interpreter
com3_test_quality_checker.py is the optional test quality checker
Initial preparation for public release
ACL submission code base
AAAI submission code base
@article{hu2025qualityflow,
title={Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks},
author={Hu, Yaojie and Zhou, Qiang and Chen, Qihong and Li, Xiaopeng and Liu, Linbo and Zhang, Dejiao and Kachroo, Amit and Oz, Talha and Tripp, Omer},
journal={arXiv preprint arXiv:2501.17167},
year={2025}
}