Skip to content

Commit 3fe48c2

Browse files
gushiqiaogushiqiao
andauthored
Update features (#210)
Co-authored-by: gushiqiao <[email protected]>
1 parent 837576a commit 3fe48c2

File tree

8 files changed

+218
-34
lines changed

8 files changed

+218
-34
lines changed

README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@
2323

2424
**Chinese doc** is [here](https://llmc-zhcn.readthedocs.io/en/latest/).
2525

26-
**docker hub** is [here](https://hub.docker.com/r/llmcompression/llmc).
26+
**Docker hub** is [here](https://hub.docker.com/r/llmcompression/llmc).
2727

28-
**aliyun docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]`
28+
**Aliyun docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]`
2929

3030
You can download the Docker image that can run llmc with the following command. Users in mainland China are recommended to use Alibaba Cloud Docker.
3131

@@ -48,6 +48,10 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
4848

4949
## Latest News
5050

51+
- **Nov 20, 2024:** 🔥 We now fully support the quantization of ✨`DeepSeekv2(2.5)` and other `MOE` models, as well as ✨`Qwen2VL`, `Llama3.2`, and other `VLM` models. Supported quantization methods include ✅integer quantization, ✅floating-point quantization, and advanced algorithms like ✅AWQ, ✅GPTQ, ✅SmoothQuant, and ✅Quarot.
52+
53+
- **Nov 12, 2024:** 🔥 We have added support for 💥`static per-tensor activation quantization` across various models and algorithms, covering ✅integer quantization and ✅floating-point quantization to further optimize performance and efficiency. Additionally, we now support exporting ✨`real quantized models` and using the [VLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) backends for inference acceleration. For more details, refer to the [VLLM documentation](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html) and [SGLang documentation](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html).
54+
5155
- **Sep 26, 2024:** 🔥 We now support exporting 💥`FP8 quantized(E4M3, E5M2)` models from 🚀`LLMC` to advanced inference backends such as [VLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang). For detailed usage, please refer to the [VLLM documentation](https://llmc-en.readthedocs.io/en/latest/backend/vllm.html) and [SGLang documentation](https://llmc-en.readthedocs.io/en/latest/backend/sglang.html).
5256

5357
- **Sep 24, 2024:** 🔥 We have officially released ✅INT4 and ✅INT8 models of ✨`Llama-3.1-405B`, quantized using 🚀`LLMC` in `save_lightllm` mode. You can download the model parameters [here](https://huggingface.co/Dongz/llama31-405b-quant).
@@ -106,11 +110,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
106110

107111
- 💥**Supported Formats**: Supports both ✨`quantization` (integer and floating-point) and ✨`sparsity`, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity.
108112

109-
- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE and ✅VLM models (see [Supported Model List](#supported-model-list)).
113+
- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen-vl) models (see [Supported Model List](#supported-model-list)).
110114

111115
- 💥**Multi-backend Compatibility**: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section `Backend` [here](https://llmc-en.readthedocs.io/en/latest/)).
112116

113-
- 💥**Performance Efficiency**: Enables quantization of large LLMs, such as ✨`Llama3.1-405B` and ✨`OPT-175B`, with PPL evaluation on a `single A100/H100/H800 GPU`.
117+
- 💥**Performance Efficiency**: Enables quantization of large LLMs, such as ✨`Llama3.1-405B` and ✨`DeepSeekV2-236B`, with PPL evaluation on a `single A100/H100/H800 GPU`.
114118

115119
## Usage
116120

@@ -156,6 +160,14 @@ Please refer to the 🚀`Quick Start` section in the [documentation](https://llm
156160

157161
[SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
158162

163+
[DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)
164+
165+
[LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)
166+
167+
[Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
168+
169+
[Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
170+
159171
You can add your own model type referring to files under `llmc/models/*.py`.
160172

161173
## Supported Backend List

README_ja.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
4848

4949
## 最新情報
5050

51+
- **2024年11月20日:** 🔥 私たちは現在、✨`DeepSeekv2(2.5)`などの`MOE`モデルおよび✨`Qwen2VL``Llama3.2`などの`VLM`モデルの量子化を完全にサポートしています。対応する量子化手法には、✅整数量子化、✅浮動小数点量子化、さらに✅AWQ、✅GPTQ、✅SmoothQuant、✅Quarotといった高度なアルゴリズムが含まれます。
52+
53+
- **2024年11月12日:** 🔥 私たちは💥`アクティベーション静的per-tensor`量子化のサポートを、様々なモデルやアルゴリズムに追加しました。これにより、✅整数量子化および✅浮動小数点量子化をカバーし、性能と効率をさらに最適化します。また、✨`真の量子化モデル`のエクスポートをサポートし、[VLLM](https://github.com/vllm-project/vllm)および[SGLang](https://github.com/sgl-project/sglang)バックエンドを使用した推論の高速化も可能です。詳細は[VLLMドキュメント](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)および[SGLangドキュメント](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)をご参照ください。
54+
5155
- **2024年9月26日:** 🔥 `LLMC`からの✨ `FP8量子化(E4M3、E5M2)`モデルを、VLLMやSGLangのような高度な推理バックエンドにエクスポートできるようになりました。🚀 詳細な使用方法については、[VLLMのドキュメント](https://llmc-en.readthedocs.io/en/latest/backend/vllm.html)[SGLangのドキュメント](https://llmc-en.readthedocs.io/en/latest/backend/sglang.html)を参照してください。
5256

5357
- **2024年9月24日:** 🔥 私たちは正式に ✨`Llama-3.1-405B` の ✅INT4 と ✅INT8 モデルをリリースしました。これらは 🚀`LLMC``save_lightllm` モードを使用して量子化されています。モデルパラメータは[こちら](https://huggingface.co/Dongz/llama31-405b-quant)からダウンロードできます。
@@ -104,11 +108,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
104108

105109
- 💥**サポートされているフォーマット**: ✨`量子化`(整数および浮動小数点)と ✨`疎性` の両方をサポートし、具体的には ✅重量-活性化、✅重量のみ、✅混合精度量子化、および ✅構造化疎性 と ✅非構造化疎性 を含みます。
106110

107-
- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅MOE モデルや ✅VLM モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。
111+
- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen-vl) モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。
108112

109113
- 💥**マルチバックエンドの互換性**: 複数のバックエンドとシームレスに統合し、展開の柔軟性を強化します。さまざまな量子化設定およびモデルフォーマットが、✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM、✅AutoAWQ など、幅広いバックエンドおよびハードウェアプラットフォームと互換性があり、高い柔軟性を実現しています(`Backend`セクションは[こちら](https://llmc-en.readthedocs.io/en/latest/)をご覧ください)。
110114

111-
- 💥**パフォーマンス効率**: ✨`Llama3.1-405B` や ✨`OPT-175B` などの大規模LLMの量子化をサポートし、`単一の A100/H100/H800 GPU` でPPL評価を可能にします。
115+
- 💥**パフォーマンス効率**: ✨`Llama3.1-405B` や ✨`DeepSeekV2-236B` などの大規模LLMの量子化をサポートし、`単一の A100/H100/H800 GPU` でPPL評価を可能にします。
112116

113117
## 使用方法
114118

@@ -154,6 +158,14 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
154158

155159
[SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
156160

161+
[DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)
162+
163+
[LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)
164+
165+
[Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
166+
167+
[Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
168+
157169
独自のモデルタイプを追加するには、`llmc/models/*.py` ファイルを参照してください。
158170

159171
## サポートされているバックエンドリスト

README_zh.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
**中文文档**[此处](https://llmc-zhcn.readthedocs.io/en/latest/)
2525

26-
**docker hub**[此处](https://hub.docker.com/r/llmcompression/llmc)
26+
**Docker hub**[此处](https://hub.docker.com/r/llmcompression/llmc)
2727

2828
**阿里云docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]`
2929

@@ -48,11 +48,15 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
4848

4949
## 最新消息
5050

51+
- **2024年11月20日:** 🔥 我们现已全面支持✨`DeepSeekv2(2.5)``MOE`模型以及✨`Qwen2VL``Llama3.2``VLM`模型的量化。支持的量化方案包括✅整型量化、✅浮点量化,以及✅AWQ、✅GPTQ、✅SmoothQuant 和 ✅Quarot 等先进算法。
52+
53+
- **2024年11月12日:** 🔥 我们新增对各种模型和算法的💥`激活静态 per-tensor量化`支持,涵盖✅整型量化和✅浮点量化,进一步优化性能和效率。同时支持导出`✨真实量化模型`,并使用 [VLLM](https://github.com/vllm-project/vllm)[SGLang](https://github.com/sgl-project/sglang)后端进行推理加速,具体请参阅[VLLM文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)[SGLang文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)
54+
5155
- **2024年9月26日:** 🔥 我们现在支持从🚀 `LLMC`导出💥 `FP8 量化(E4M3,E5M2)`模型到一些先进的推理后端,例如[VLLM](https://github.com/vllm-project/vllm)[SGLang](https://github.com/sgl-project/sglang)。关于详细使用方法,请参阅[VLLM文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)[SGLang文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)
5256

5357
- **2024年9月24日:** 🔥 我们正式发布了 ✨`Llama-3.1-405B` 的 ✅INT4 和 ✅INT8 模型,这些模型通过 🚀`LLMC` 使用 `save_lightllm` 模式进行量化。你可以在[此处](https://huggingface.co/Dongz/llama31-405b-quant)下载模型参数。
5458

55-
- **2024年9月23日:** 🔥 我们现在支持从 🚀`LLMC` 导出 ✨`真正量化的(INT4, INT8)` 模型到高级推理后端,例如 [VLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), 和 [MLC-LLM](https://github.com/mlc-ai/mlc-llm) 用于量化推理部署,从而实现 ✨`减少内存使用` 和 ✨`加快推理速度`
59+
- **2024年9月23日:** 🔥 我们现在支持从 🚀`LLMC` 导出 ✨`真正量化的(INT4, INT8)` 模型到先进推理后端,例如 [VLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), 和 [MLC-LLM](https://github.com/mlc-ai/mlc-llm) 用于量化推理部署,从而实现 ✨`减少内存使用` 和 ✨`加快推理速度`
5660
详细使用方法,请参考 [VLLM 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)[SGLang 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)[AutoAWQ 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/autoawq.html)[MLC-LLM 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/mlcllm.html)
5761

5862
- **2024年9月9日:** 🔥 我们提供了一些最佳实践配置,帮助提升性能(参见最佳实践[此处](https://llmc-zhcn.readthedocs.io/en/latest/))。
@@ -104,11 +108,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
104108

105109
- 💥**支持的格式**: 支持 ✨`量化`(整型和浮点)和 ✨`稀疏化`,具体包括 ✅权重激活量化、✅权重量化、✅混合精度量化,以及 ✅结构化 和 ✅非结构化稀疏化。
106110

107-
- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE 和 ✅VLM 模型(参见[支持的模型列表](#supported-model-list))。
111+
- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen-vl) 模型(参见[支持的模型列表](#supported-model-list))。
108112

109113
- 💥**多后端兼容性**: 无缝集成多个后端,增强部署灵活性。多种量化设置和模型格式兼容广泛的后端和硬件平台,例如 ✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM 和 ✅AutoAWQ,使其高度灵活(参见✨`推理后端` 章节 [此处](https://llmc-zhcn.readthedocs.io/en/latest/))。
110114

111-
- 💥**性能效率**: 支持大规模LLM的量化,例如 ✨`Llama3.1-405B` 和 ✨`OPT-175B`,并可在 `单个 A100/H100/H800 GPU` 上评估 PPL。
115+
- 💥**性能效率**: 支持大规模LLM的量化,例如 ✨`Llama3.1-405B` 和 ✨`DeepSeekV2-236B`,并可在 `单个 A100/H100/H800 GPU` 上评估 PPL。
112116

113117
## 使用指南
114118

@@ -154,6 +158,14 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
154158

155159
[SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
156160

161+
[DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)
162+
163+
[LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)
164+
165+
[Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
166+
167+
[Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
168+
157169
你可以参考 `llmc/models/*.py` 文件添加自己的模型类型。
158170

159171
## 支持的后端列表
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
base:
2+
seed: &seed 42
3+
model:
4+
type: model_type
5+
path: model path
6+
tokenizer_mode: slow
7+
torch_dtype: auto
8+
calib:
9+
name: pileval
10+
download: False
11+
path: calib data path
12+
n_samples: 128
13+
bs: -1
14+
seq_len: 512
15+
preproc: general
16+
seed: *seed
17+
eval:
18+
eval_pos: [fake_quant]
19+
name: wikitext2
20+
download: False
21+
path: eval data path
22+
seq_len: 2048
23+
# For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
24+
# For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
25+
bs: 1
26+
inference_per_block: False
27+
quant:
28+
method: Awq
29+
quant_type: float-quant
30+
weight:
31+
# Support ["e4m3", "e5m2"]
32+
bit: e4m3
33+
symmetric: True
34+
granularity: per_tensor
35+
use_qtorch: True
36+
act:
37+
# Support ["e4m3", "e5m2"]
38+
bit: e4m3
39+
symmetric: True
40+
granularity: per_tensor
41+
use_qtorch: True
42+
static: True
43+
special:
44+
trans: True
45+
trans_version: v2
46+
weight_clip: True
47+
quant_out: True
48+
save:
49+
save_sgl: True
50+
save_path: /path/to/save_for_sgl_awq_fp8/

docs/en/source/backend/sglang.md

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -102,10 +102,9 @@ quant:
102102
Additionally, if AWQ does not meet accuracy requirements, we recommend using the **Quarot + GPTQ** combined algorithm as introduced in [this section](https://llmc-en.readthedocs.io/en/latest/practice/quarot_gptq.html) to further improve accuracy. The corresponding [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/w8a8_combin) is also provided.
103103

104104

105-
**FP8**
106-
107-
In the FP8 quantization, it typically offers marginally better precision than INT8. In some cases, the use of the RTN (Round to Nearest) algorithm is sufficient. However, we still recommend utilizing the AWQ algorithm for enhanced quantization accuracy. The specific implementation can be referenced from the AWQ FP8 [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8.yml).
105+
**FP8-Dynamic**
108106

107+
In FP8 quantization, **LLMC** supports weight quantization per-channel and activation quantization dynamically per-token. In this case, the RTN (Round to Nearest) algorithm is sufficient. However, we recommend using the AWQ algorithm for better quantization accuracy. For implementation details, refer to the AWQ FP8 [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8.yml).
109108

110109
```yaml
111110
# configs/quantization/backend/sglang/fp8/awq_fp8.yml
@@ -131,14 +130,38 @@ quant:
131130
quant_out: True
132131
```
133132

134-
Please ensure that the `quant_type` is set to `float_quant`, which represents floating-point quantization. Additionally, set `use_qtorch` to `True`, as `LLMC`'s floating-point quantization implementation relies on functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library.
133+
Ensure that `quant_type` is set to `float_quant` to indicate floating-point quantization. Additionally, set `use_qtorch` to `True`, as **LLMC**'s FP8 implementation depends on certain functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library.
135134

136-
You can install [QPyTorch](https://github.com/Tiiiger/QPyTorch) using the following command:
135+
Install [QPyTorch](https://github.com/Tiiiger/QPyTorch) with the following command:
137136

138137
```bash
139138
pip install qtorch
140139
```
141140

141+
**FP8-Static**
142+
143+
In FP8 quantization, **LLMC** also supports weight quantization per-tensor and activation quantization statically per-tensor. In this case, we recommend using the AWQ algorithm while adjusting the activation ranges. Refer to the AWQ FP8 static quantization [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8_static.yml).
144+
145+
```yaml
146+
# configs/quantization/backend/sglang/fp8/awq_fp8_static.yml
147+
quant:
148+
method: Awq
149+
quant_type: float-quant
150+
weight:
151+
# Support ["e4m3", "e5m2"]
152+
bit: e4m3
153+
symmetric: True
154+
granularity: per_tensor
155+
use_qtorch: True
156+
act:
157+
# Support ["e4m3", "e5m2"]
158+
bit: e4m3
159+
symmetric: True
160+
granularity: per_tensor
161+
use_qtorch: True
162+
static: True
163+
```
164+
142165
### 1.3.3 Exporting Real Quantized Model
143166

144167
```yaml

0 commit comments

Comments
 (0)