🤔 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression 📊
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application.
We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans:
- 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval)
- 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT)
- 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill)
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.
git clone https://github.com/pprp/ACBench
cd ACBench
pip install -r requirements.txt
pip install -e .ACBench builds upon and extends several excellent agentic benchmarks and compression toolkits. We integrate these benchmarks into our evaluation pipeline while preserving their original settings. For efficient model serving and evaluation, we utilize VLLM to deploy the compressed language models.
For detailed implementation and usage instructions, please refer to the corresponding subfolders in the thirdpartys directory. Each subfolder contains the original benchmark code along with our modifications to support compressed model evaluation. For experiment result on WorfBench, we have integrated it in acbench.
Taking WorfBench as example, we can run using the following scripts:
#!/bin/bash
MODEL=$1
TEMP=$2
QUANT=$3
DEVICE=${4:-6}
export CUDA_VISIBLE_DEVICES=$DEVICE
tasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)
MODEL_NAME=$(basename $MODEL)
for task in ${tasks[@]}; do
python acbench/node_eval.py \
--task gen_workflow \
--model_name ${MODEL} \
--gold_path ./data/gold_traj/${task}/graph_eval.json \
--pred_path ./data/pred_traj/${MODEL_NAME}/${task}/${MODEL_NAME}/graph_eval_two_shot.json \
--task_type ${task} \
--few_shot \
--temperature ${TEMP} \
--quantization ${QUANT}
done
For Agentic Tasks:
- WorfBench: Benchmarking Agentic Workflow Generation
- AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
- KVCache-Factory: Unified KV Cache Compression Methods for Auto-Regressive Models
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
- T-Eval: Evaluating Tool Utilization Capability of LLMs Step by Step
For Compression:
- Wanda: A Simple and Effective Pruning Approach for Large Language Models
- LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit
- QLLM-Eval: Evaluating Quantized Large Language Models
For fast serving, we employ for evaluation.
Energy-based analysis:
![]() |
![]() |
![]() |
|---|---|---|
![]() |
![]() |
![]() |
Logits Visualization:
![]() |
![]() |
![]() |
|---|---|---|
![]() |
![]() |
![]() |
Needle Visualization:
![]() |
![]() |
![]() |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|
![]() |
![]() |
If you use our work, please cite:
@inproceedings{dong2025compressed,
title = {Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression},
author = {Peijie Dong and Zhenheng Tang and Xiang Liu and Lujun Li and Xiaowen Chu and Bo Li},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2025}
}








































