简体中文 | English
📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat (微信) | 🫨 Discord
- [25/11/05] 我们发布V0.2版本,支持了包括GLM-4.6/Qwen3-VL/Qwen3-Omni等更多模型的量化,开源投机采样Eagle3训练框架,更新Diffusion模型量化工具。
- [25/09/30] 我们开源了思考早退新算法 SpecExit [论文] | [文档] | [vLLM代码]🔥🔥🔥
- [25/09/30] 我们发布了三值量化新算法 Tequila [论文] | [代码]。🔥🔥🔥
- [25/09/24] 我们支持了Qwen3系列模型的NVFP4的PTQ量化,我们还开源了Qwen3-32B-NVFP4、Qwen3-235B-A22B-NVFP4权重。
历史更新
- [25/09/01] 我们支持了Hunyuan-MT-7B翻译开源模型的FP8量化;支持了Eagle3的Torch推理及Benchmark评测流程。
- [25/08/06] 我们支持了
Hunyuan 0.5B/1.8B/4B/7B和Qwen2.5VL 3B/7B/32B/72B的FP8、INT4量化,支持了DeepSeek-R1/V3和Kimi-K2模型的W4A8-FP8量化。我们还开源了Hunyuan 1.8B/4B/7B系列模型的Eagle3权重。 - [25/07/04] 我们支持了
Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen等模型的量化,包含INT8、FP8、INT4等算法。 我们还开源了Qwen3系列模型的Eagle3权重。
- 高度集成化:本工具将主流的压缩算法集成到工具,开发者可一键式调用,具有很好的易用性。
- 持续算法创新:本工具除了集成工业界使用最广的算法,还持续自研更好的压缩算法,并且会陆续开源。
- 追求极致性能:在模型压缩流程、压缩算法部署方面,本工具持续端到端优化,例如单卡GPU可量化Qwen3-235B和Deepseek-R1。
| 场景 | 模型 | 压缩策略 | ||
|---|---|---|---|---|
| 量化 | 投机采样 | 其他技术 | ||
| 文生文(LLM) |
|
|||
| 图/视频生文(VLM) |
|
|||
| 文生图/视频/3D(Diffusion) | - |
|
||
| 语音(TTS/ASR) |
|
|
||
推荐使用pip直接安装最新稳定版AngelSlim:
pip install angelslim也可以选择克隆代码仓库后,以可编辑的方式从源代码安装:
cd AngelSlim && python setup.py install更详细的安装说明可参考安装文档。
完成安装AngelSlim后,您可以通过以下脚本快速开始,完成Qwen3-1.7B模型的静态FP8量化:
1、一键式启动
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml该示例将会加载HugggingFace模型进行PTQ量化校准,最终量化产出模型权重.
2、源码启动
例如对Qwen3-1.7B完成动态FP8量化:
from angelslim.engine import Engine
slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B")
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")详情请参考量化快速开始文档。
完成安装AngelSlim后,您可以通过以下脚本快速开始Eagle3训练:
# 启动vLLM server
bash scripts/speculative/run_vllm_server.sh
# 生成训练数据
bash scripts/speculative/generate_data_for_target_model.sh
# 进行Eagle3模型的在线训练
bash scripts/speculative/train_eagle3_online.sh详细训练配置,以及Eagle3的Pytorch性能测试,详情请参考投机采样快速开始文档。
使用 scripts/diffusion/run_diffusion.py 脚本进行量化与推理:
# 在线量化并运行推理
python scripts/diffusion/run_diffusion.py \
--model-name-or-path black-forest-labs/FLUX.1-schnell \
--quant-type fp8-per-tensor \
--prompt "A cat holding a sign that says hello world" \
--height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0更多量化推理方式请参考Diffusion模型量化文档。
通过transformers加载量化模型离线推理:
python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"其中 MODEL_PATH 为量化产出模型路径。
支持通过以下推理框架部署 OpenAI 兼容的 API 服务:
-
vLLM
vLLM 服务启动脚本,建议版本
vllm>=0.8.5.post1,部署MOE INT8量化模型需要vllm>=0.9.2。bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096其中
-d为可见设备,-t为张量并行度,-p为流水线并行度,-g为显存使用率。 -
SGLang
SGLang 服务启动脚本,建议版本
sglang>=0.4.6.post1:bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
通过 OpenAI 格式 接口发起请求:
bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."其中-p为输入prompt
使用 lm-evaluation-harness 评估量化模型精度,建议版本lm-eval>=0.4.8
执行脚本细节
bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH其中RESULT_PATH为测试结果保存目录,-b为batch size大小,--tasks为评测任务,-n为few-shot数量
详细操作指南请参阅部署文档。
下面只展示了部分模型的效果测试情况,完整Benchmark可以参考Benchmark文档
Hunyuan-Instruct的BF16、FP8、INT4-GPTQ、INT4-AWQ在OlympiadBench、AIME 2024、DROP、GPQA-Diamond上的评测结果如下:
| Model | Quantization | OlympiadBench | AIME 2024 | DROP | GPQA-Diamond |
|---|---|---|---|---|---|
| Hunyuan-A13B-Instruct | BF16 | 82.7 | 87.30 | 91.1 | 71.2 |
| FP8-Static | 83.0 | 86.7 | 91.1 | - | |
| Int4-GPTQ | 82.7 | 86.7 | 91.1 | - | |
| Int4-AWQ | 82.6 | 85.6 | 91.0 | - | |
| Hunyuan-7B-Instruct | BF16 | 76.5 | 81.1 | 85.9 | 60.1 |
| FP8-Static | 76.6 | 80.9 | 86.0 | 60.1 | |
| Int4-GPTQ | 76.2 | 81.0 | 85.7 | 60.0 | |
| Int4-AWQ | 76.4 | 80.9 | 85.9 | 60.1 | |
| Hunyuan-4B-Instruct | BF16 | 73.1 | 78.3 | 78.2 | 61.1 |
| FP8-Static | 73.1 | 76.6 | 78.3 | 60.2 | |
| Int4-GPTQ | 72.9 | - | 78.1 | 58.1 | |
| Int4-AWQ | 72.8 | - | 78.2 | - | |
| Hunyuan-1.8B-Instruct | BF16 | 63.4 | 56.7 | 76.7 | 47.2 |
| FP8-Static | 62.5 | 55.2 | 75.1 | 47.7 | |
| Int4-GPTQ | 60.9 | - | 73.0 | 44.4 | |
| Int4-AWQ | 61.7 | - | 71.7 | 43.6 | |
| Hunyuan-0.5B-Instruct | BF16 | 29.6 | 17.2 | 52.8 | 23.3 |
| FP8-Static | 29.6 | 17.2 | 51.6 | 22.5 | |
| Int4-GPTQ | 26.8 | - | 50.9 | 23.3 | |
| Int4-AWQ | 26.3 | - | 48.9 | 23.3 |
Qwen3系列模型的BF16、FP8-Static、FP8-Dynamic、INT8-Dynamic、INT4-GPTQ、INT4-AWQ在CEVAL、MMLU、GSM8K、HUMANEVAL上的评测结果如下:
| Model | Quantization | CEVAL | MMLU | GSM8K | HUMANEVAL |
|---|---|---|---|---|---|
| Qwen3-0.6B | BF16 | 45.84 | 47.21 | 42.99 | 19.51 |
| FP8-Static | 45.99 | 46.87 | 38.06 | 18.90 | |
| FP8-Dynamic | 45.99 | 46.93 | 38.29 | 20.73 | |
| INT8-Dynamic | 45.17 | 46.95 | 41.17 | 21.34 | |
| Qwen3-8B | BF16 | 79.27 | 74.78 | 87.79 | 63.41 |
| FP8-Static | 78.23 | 74.79 | 86.96 | 62.20 | |
| FP8-Dynamic | 78.45 | 74.75 | 87.64 | 62.80 | |
| INT8-Dynamic | 78.01 | 74.84 | 86.96 | 67.07 | |
| INT4-GPTQ | 77.19 | 73.26 | 86.43 | 62.20 | |
| INT4-AWQ | 76.15 | 73.59 | 86.96 | 63.41 | |
| Qwen3-14B | BF16 | 83.06 | 78.90 | 88.40 | 55.49 |
| FP8-Static | 82.62 | 78.57 | 89.46 | 57.32 | |
| FP8-Dynamic | 82.24 | 78.92 | 88.32 | 52.44 | |
| INT8-Dynamic | 81.87 | 78.13 | 86.28 | 56.10 | |
| INT4-GPTQ | 81.05 | 78.02 | 87.34 | 57.93 | |
| INT4-AWQ | 82.02 | 77.68 | 84.23 | 61.59 | |
| Qwen3-32B | BF16 | 86.55 | 82.00 | 74.53 | 37.80 |
| FP8-Static | 86.92 | 81.78 | 70.20 | 39.63 | |
| FP8-Dynamic | 86.55 | 81.89 | 70.43 | 38.41 | |
| INT4-GPTQ | 86.18 | 81.01 | - | 43.29 | |
| INT4-AWQ | 86.18 | 81.54 | - | 36.59 | |
| Qwen3-30B-A3B | BF16 | 83.66 | 79.36 | 89.99 | 31.71 |
| FP8-Static | 83.95 | 79.47 | 89.01 | 31.10 | |
| FP8-Dynamic | 84.10 | 79.40 | 89.16 | 32.93 | |
| INT8-Dynamic | 83.36 | 79.48 | 89.16 | 34.15 | |
| Qwen3-235B-A22B | BF16 | 89.60 | 86.28 | 85.29 | 27.44 |
| FP8-Static | 89.67 | 86.19 | 86.96 | 27.44 | |
| FP8-Dynamic | 89.67 | 86.18 | 85.22 | 28.05 | |
| INT8-Dynamic | 88.93 | 86.20 | 86.20 | 23.78 |
DeepSeek-R1-0528模型的FP8-Block-Wise、W4A8-FP8在GPQA Diamond、AIME 2024、SimpleQA、LiveCodeBench上的评测结果如下:
| Model | Quantization | GPQA Diamond | AIME 2024 | SimpleQA | LiveCodeBench |
|---|---|---|---|---|---|
| DeepSeek-R1-0528 | FP8-Block-Wise | 78.28 | 88.67 | 27.8 | 77.1 |
| W4A8-FP8 | 77.37 | 88.67 | 26.83 | 78.86 |
备注
- 以上评测结果使用TRT-LLM框架部署测试5次求平均
- 评测时使用的超参如下:
{ "top_k": 20, "top_p": 0.6, "temperature": 0.7, "output_seq_len": 32768, "max_input_seq_len": 16384 }
Qwen3-VL Benchmark
Qwen3VL系列模型的BF16、FP8-Static、FP8-Dynamic在MMMU_VAL、DocVQA_VAL、ChartQA_TEST上的评测结果如下:
| Model | Quantization | MMMU_VAL | DocVQA_VAL | ChartQA_TEST |
|---|---|---|---|---|
| Qwen3-VL-32B-Instruct | BF16 | 60.11 | 96.08 | 94.64 |
| FP8-Static | 61.22 | 96.00 | 94.64 | |
| FP8-Dynamic | 60.78 | 96.19 | 94.72 | |
| Qwen3-VL-30B-A3B-Instruct | BF16 | 50.44 | 95.28 | 95.36 |
| FP8-Dynamic | 50.67 | 95.25 | 95.20 |
Qwen2.5VL Benchmark
Qwen2.5VL系列模型的BF16、FP8-Static、FP8-Dynamic、INT4-GPTQ、INT4-AWQ在MMMU_VAL、DocVQA_VAL、ChartQA_TEST上的评测结果如下:
| Model | Quantization | MMMU_VAL | MMLDocVQA_VALU | ChartQA_TEST |
|---|---|---|---|---|
| Qwen2.5VL-3B | BF16 | 47.11 | 78.57 | 80.32 |
| FP8-Static | 47.33 | 79.34 | 79.68 | |
| FP8-Dynamic | 45.99 | 46.93 | 38.29 | |
| INT4-GPTQ | 46.56 | 77.20 | 78.96 | |
| INT4-AWQ | 45.78 | - | 79.60 | |
| Qwen2.5VL-7B | BF16 | 45.44 | 89.71 | 84.64 |
| FP8-Static | 47.00 | 89.83 | 85.92 | |
| FP8-Dynamic | 47.22 | 89.80 | 88.64 | |
| INT4-GPTQ | 46.67 | 90.45 | - | |
| INT4-AWQ | 45.67 | 89.28 | - | |
| Qwen2.5VL-32B | BF16 | 57.00 | 90.03 | - |
| FP8-Static | 57.00 | 89.88 | - | |
| FP8-Dynamic | 56.44 | 89.88 | - | |
| INT4-GPTQ | 55.22 | 89.80 | - | |
| INT4-AWQ | 55.22 | 90.30 | - | |
| Qwen2.5VL-72B | BF16 | 58.78 | 94.39 | 85.60 |
| FP8-Static | 57.89 | 94.41 | 85.84 | |
| FP8-Dynamic | 58.67 | 94.38 | 85.60 | |
| INT4-GPTQ | 57.56 | 94.46 | 86.48 | |
| INT4-AWQ | 58.78 | 94.19 | 87.28 |
Qwen3-Omni Text to Text Benchmark
Qwen3-Omni系列模型的BF16、FP8-Static、FP8-Dynamic在aime25、gpqa_diamond、mmlu_redux上的评测结果如下:
| Model | Quantization | aime25 | gpqa_diamond | mmlu_redux |
|---|---|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | BF16 | 73.32 | 56.77 | 88.09 |
| FP8-Static | 71.33 | 56.57 | 87.91 | |
| FP8-Dynamic | 73.33 | 55.15 | 88.07 |
备注
- 以上评测结果使用vllm框架部署测试5次求平均(vllm只支持thinker部分)
- 评测时使用的超参如下:
{ "top_p": 0.95, "temperature": 0.6, "do_sample": true, "max-model-len 65536": 65536 }
其他模型比如GLM、Qwen2.5、Seed-OSS等模型利用FP8-Static、FP8-Dynamic、INT4-GPTQ、INT4-AWQ量化等策略在CEVAL、MMLU、GSM8K上进行了评测。
Benchmark实验细节
| Model | Quantization | CEVAL | MMLU | GSM8K |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct | BF16 | 67.01 | 60.05 | 54.28 |
| FP8-Static | 66.27 | 60.23 | - | |
| FP8-Dynamic | 66.79 | 60.08 | 51.71 | |
| Qwen2.5-7B-Instruct | BF16 | 81.20 | 74.55 | 79.98 |
| FP8-Static | 81.13 | 74.03 | 79.30 | |
| FP8-Dynamic | 80.31 | 74.07 | 79.00 | |
| INT4-GPTQ | 79.05 | 73.05 | 74.75 | |
| INT4-AWQ | 79.35 | 73.22 | 79.38 | |
| Qwen2.5-32B-Instruct | BF16 | 87.30 | 83.21 | 81.73 |
| FP8-Static | 87.59 | 83.08 | 81.58 | |
| FP8-Dynamic | 87.30 | 83.04 | 81.58 | |
| INT4-GPTQ | 86.70 | 82.45 | 82.03 | |
| INT4-AWQ | 87.00 | 82.64 | - | |
| DeepSeek-R1-Distill-Qwen-7B | BF16 | 53.49 | 53.80 | 75.74 |
| FP8-Static | 53.57 | 54.17 | 76.19 | |
| FP8-Dynamic | 52.97 | 54.13 | 74.15 | |
| INT4-GPTQ | 51.86 | 52.44 | 75.89 | |
| INT4-AWQ | 53.49 | 53.70 | - | |
| DeepSeek-R1-Distill-Qwen-14B | BF16 | 77.71 | 74.28 | 85.67 |
| FP8-Static | 77.56 | 74.66 | 86.73 | |
| FP8-Dynamic | 76.82 | 74.63 | 87.11 | |
| INT4-GPTQ | 74.29 | 72.37 | 84.61 | |
| INT4-AWQ | 74.81 | 73.00 | 86.05 | |
| DeepSeek-R1-Distill-Qwen-32B | BF16 | 84.18 | 80.89 | 87.41 |
| FP8-Static | 83.43 | 80.90 | 87.57 | |
| FP8-Dynamic | 83.73 | 81.10 | 86.43 | |
| INT4-GPTQ | 84.10 | 79.80 | 86.73 | |
| INT4-AWQ | 82.84 | 80.15 | 87.19 |
Qwen3系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
| MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
| T=0 | Qwen3-1.7B | 2.05x | 2.81 | 2.07x | 2.93 | 2.11x | 2.98 | 1.93x | 2.69 | 2.04x | 2.85 |
| Qwen3-4B | 2.21x | 3.01 | 2.36x | 3.24 | 2.42x | 3.13 | 2.32x | 2.75 | 2.33x | 3.03 | |
| Qwen3-8B | 2.63x | 3.65 | 2.76x | 3.85 | 2.82x | 3.90 | 2.62x | 3.48 | 2.70x | 3.72 | |
| Qwen3-14B | 2.23x | 3.30 | 2.53x | 3.74 | 2.56x | 3.79 | 2.16x | 3.13 | 2.37x | 3.49 | |
| Qwen3-32B | 2.39x | 2.78 | 2.37x | 2.81 | 2.47x | 2.92 | 2.42x | 2.53 | 2.41x | 2.76 | |
| Qwen3-30B-A3B | 2.84x | 3.63 | 2.27x | 3.09 | 2.64x | 3.42 | 2.83x | 3.56 | 2.64x | 3.42 | |
| T=1 | Qwen3-1.7B | 1.74x | 2.53 | 1.86x | 2.70 | 1.82x | 2.69 | 1.72x | 2.46 | 1.93x | 2.60 |
| Qwen3-4B | 1.93x | 2.60 | 2.00x | 2.84 | 2.11x | 2.82 | 2.34x | 2.50 | 1.75x | 2.69 | |
| Qwen3-8B | 1.98x | 2.75 | 2.25x | 3.11 | 2.31x | 3.15 | 2.10x | 2.76 | 2.90x | 2.94 | |
| Qwen3-14B | 1.71x | 2.61 | 1.95x | 2.87 | 2.04x | 3.08 | 1.68x | 2.55 | 2.90x | 2.78 | |
| Qwen3-32B | 1.62x | 1.91 | 1.71x | 2.05 | 1.78x | 2.10 | 1.80x | 1.95 | 1.62x | 2.00 | |
| Qwen3-30B-A3B | 1.91x | 2.46 | 2.00x | 2.64 | 1.90x | 2.53 | 1.80x | 2.32 | 1.90x | 2.48 | |
Hunyuan系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
| MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
| T=0 | Hunyuan-1.8B-Instruct | 1.97x | 2.90 | 2.58x | 3.73 | 2.61x | 3.71 | 1.71x | 2.43 | 2.22x | 3.19 |
| Hunyuan-4B-Instruct | 1.77x | 2.60 | 2.64x | 3.35 | 2.14x | 3.17 | 1.72x | 2.57 | 2.07x | 2.92 | |
| Hunyuan-7B-Instruct | 2.22x | 3.58 | 3.59x | 5.47 | 2.96x | 4.68 | 1.64x | 2.56 | 2.60x | 4.07 | |
| T=1 | Hunyuan-1.8B-Instruct | 1.58x | 2.36 | 2.35x | 3.56 | 2.23x | 3.38 | 1.26x | 1.87 | 1.86x | 2.79 |
| Hunyuan-4B-Instruct | 1.36x | 2.05 | 1.97x | 2.86 | 1.72x | 2.68 | 1.14x | 1.76 | 1.55x | 2.34 | |
| Hunyuan-7B-Instruct | 1.90x | 3.11 | 3.12x | 5.09 | 2.74x | 4.34 | 1.47x | 2.39 | 2.31x | 3.73 | |
本项目的代码依照 License for AngelSlim 协议开源。
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={7},
url={https://github.com/Tencent/AngelSlim},
}
- AngelSlim正在快速迭代更新中,后续会推出更多的功能,有问题或建议欢迎通过GitHub Issues给我们提issue,或者加入微信技术交流群。