ACEBench: Who Wins the Match Point in Tool Usage?

📃 Paper · 🏆 Leaderboard (Continuously Updated)

English | 中文

📚 Content

1. Abstract
2. Benchmark Statistics
3. Leaderboard
4. Setup
5. Data
6. Inference
- 6.1. Inference Script
- 6.2. Inference Examples
7. Evaluation
Citation

📘 1. Abstract [Back to Top]

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.

📊 2.Benchmark Data Analysis [Back to Top]

Domain of APIs

ACEBench covers 8 major domains and 68 sub-domains, including technology, finance, entertainment, society, health, culture, environment, and more.
It includes a total of 4,538 APIs in both Chinese and English.
The distribution of APIs across domains is visualized in the figure below:

Data Composition

ACEBench consists of three main categories of test samples:
- Normal: Basic tool-use scenarios.
- Agent: Multi-turn interactions involving users and environments.
- Special: Complex scenarios requiring multiple steps or handling infeasible tool calls.
The data composition is visualized below, showcasing the comprehensive coverage of tool-use capabilities:

🏆 3. Leaderboard [Back to Top]

Model	normal	special	agent	overall
close-source model
gpt-4o-2024-11-20	0.927	0.933	0.715	0.896
gpt-4-turbo-2024-04-09	0.917	0.913	0.725	0.886
qwen-max	0.887	0.740	0.685	0.817
o1-preview	0.830	0.793	0.735	0.806
deepseek-chat	0.926	0.733	0.350	0.785
gpt-4o-mini-2024-07-18	0.834	0.813	0.390	0.760
claude-3-5-sonnet-20241022	0.835	0.820	0.350	0.756
gemini-1.5-pro	0.822	0.800	0.250	0.728
o1-mini	0.774	0.673	0.610	0.722
doubao-pro-32k	0.750	0.593	0.235	0.628
open-source model
Qwen2.5-Coder-32B-Instruct-local	0.908	0.813	0.715	0.853
Qwen2.5-32B-Instruct-local	0.852	0.747	0.690	0.799
Qwen2.5-72B-Instruct-local	0.873	0.773	0.525	0.793
Qwen2.5-Coder-14B-Instruct-local	0.868	0.647	0.525	0.756
Qwen2.5-14B-Instruct-local	0.790	0.540	0.250	0.640
Llama-3.1-70B-Instruct-local	0.753	0.473	0.435	0.629
Qwen2.5-7B-Instruct-local	0.759	0.447	0.125	0.578
DeepSeek-Coder-V2-Lite-Instruct-local	0.688	0.413	0.015	0.511
Qwen2.5-Coder-7B-Instruct-local	0.735	0.193	0.125	0.496
watt-tool-8B-local	0.763	0.100	0.040	0.474
ToolACE-8B-local	0.782	0.013	0.040	0.462
Hammer2.1-7b-local	0.627	0.260	0.185	0.461
Meta-Llama-3.1-8B-Instruct-local	0.450	0.267	0.040	0.338
Qwen2.5-Coder-3B-Instruct-local	0.495	0.100	0.065	0.323
Phi-3-mini-128k-instruct-local	0.389	0.253	0.015	0.295
Qwen2.5-3B-Instruct-local	0.408	0.127	0.065	0.280
Llama-3.2-3B-Instruct-local	0.327	0.100	0.000	0.216
xLAM-7b-r-local	0.187	0.013	0.075	0.123
Hammer2.1-3b-local	0.118	0.013	0.015	0.074

🛠️ 4. Setup [Back to Top]

Execute the following command to install the required dependencies for inference and evaluation:

pip install -r requirements.txt

🗂️ 5. Data [Back to Top]

All data is stored in the data_all directory, divided into English and Chinese parts, which are located in the data_en and data_zh folders respectively. Each folder contains multiple JSON files, named in the format data_{category}.json, where category represents the type of data.

data_all/
├── possible_answer_en/        
│   ├── data_{normal}.json
│   ├── data_{special}.json
│   ├── data_{agent}.json
├── possible_answer_zh/        
│   ├── data_{normal}.json
│   ├── data_{special}.json
│   ├── data_{agent}.json
...

🧠 6. Inference [Back to Top]

6.1 Inference Script

To run inference with cmodels, use the generate.py script. This script supports various models, categories, and languages.

Basic Usage

python generate.py  --model <model_name>  --model_path <model_path>  
--category <category> --language <language>

Arguments:

--model: Specifies the model to use for inference.
--model_path: Specifies the local path to the model (only for open-source models).
--category: Defines the category of tasks or datasets to evaluate. Available categories can be found in eval_checker/eval_checker_constant.py.
--language: Specifies the language of the input/output. Supported languages: "en" (English), "zh" (Chinese)

6.2. Inference Examples

for closed-source model

python generate.py --model qwen-max --category test_all --language zh

for local model

python generate.py --model Qwen2.5-3B-Instruct-local --model-path /mnt/nas/ckpt/Qwen2.5-3B-Instruct --category test_all --language zh

6.3. Precautions

Before running the program, ensure that the environment variable .env file is correctly configured. To invoke OpenAI, you need to use the external network. Configure the environment variables https_proxy and http_proxy. To use the gemini model, you need to use the Japanese proxy.
The model to be evaluated needs to be mapped in model_inference/inference_map.py. The model invoked through OpenAI can be added to the APIModelInference list, and the customized inference model can be added to the CommonInference list. The name of a local model ends with -local.
To add a customized evaluation model, add the model class to model_dict by referring to model_inference/model_infer.py.
Evaluate the open-source model on Hugging Face. You are advised to use LLaMA-Factory to combine LoRA weights and then infer.

📈 7. Evaluation [Back to Top]

To evaluate the performance of the models, use the eval_main.py script. This script supports various evaluation metrics and can be used for both open-source and closed-source models.

Basic Usage

python eval_main.py --model <model_name> --category <category> --language <language>

📄 Citation

If you find our paper and resources useful, please consider citing our paper:

@article{chen2025acebench,
  title={ACEBench: Who Wins the Match Point in Tool Learning?},
  author={Chen, Chen and Hao, Xinlong and Liu, Weiwen and Huang, Xu and Zeng, Xingshan and Yu, Shuai and Li, Dexun and Wang, Shuai and Gan, Weinan and Huang, Yuefeng and others},
  journal={arXiv preprint arXiv:2501.12851},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data_all		data_all
docs		docs
environment		environment
fig		fig
model_eval		model_eval
model_inference		model_inference
packages		packages
.env.template		.env.template
.gitignore		.gitignore
BUILD		BUILD
LICENSE.md		LICENSE.md
README.md		README.md
README_CN.md		README_CN.md
category.py		category.py
check_lint.sh		check_lint.sh
check_scores.py		check_scores.py
eval_main.py		eval_main.py
generate.py		generate.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements-base.txt		requirements-base.txt
requirements-vllm.txt		requirements-vllm.txt
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ACEBench: Who Wins the Match Point in Tool Usage?

📚 Content

📘 1. Abstract [Back to Top]

📊 2.Benchmark Data Analysis [Back to Top]

Domain of APIs

Data Composition

🏆 3. Leaderboard [Back to Top]

🛠️ 4. Setup [Back to Top]

🗂️ 5. Data [Back to Top]

🧠 6. Inference [Back to Top]

6.1 Inference Script

Basic Usage

6.2. Inference Examples

6.3. Precautions

📈 7. Evaluation [Back to Top]

Basic Usage

📄 Citation

About

Uh oh!

Releases

Packages

Languages

License

reflectionai/ACEBench

Folders and files

Latest commit

History

Repository files navigation

ACEBench: Who Wins the Match Point in Tool Usage?

📚 Content

📘 1. Abstract [Back to Top]

📊 2.Benchmark Data Analysis [Back to Top]

Domain of APIs

Data Composition

🏆 3. Leaderboard [Back to Top]

🛠️ 4. Setup [Back to Top]

🗂️ 5. Data [Back to Top]

🧠 6. Inference [Back to Top]

6.1 Inference Script

Basic Usage

6.2. Inference Examples

6.3. Precautions

📈 7. Evaluation [Back to Top]

Basic Usage

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages