diff --git a/README.md b/README.md index 8163c1cd8..3a0849d17 100644 --- a/README.md +++ b/README.md @@ -23,9 +23,9 @@ **Chinese doc** is [here](https://llmc-zhcn.readthedocs.io/en/latest/). -**docker hub** is [here](https://hub.docker.com/r/llmcompression/llmc). +**Docker hub** is [here](https://hub.docker.com/r/llmcompression/llmc). -**aliyun docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]` +**Aliyun docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]` You can download the Docker image that can run llmc with the following command. Users in mainland China are recommended to use Alibaba Cloud Docker. @@ -48,6 +48,10 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ## Latest News +- **Nov 20, 2024:** 🔥 We now fully support the quantization of ✨`DeepSeekv2(2.5)` and other `MOE` models, as well as ✨`Qwen2VL`, `Llama3.2`, and other `VLM` models. Supported quantization methods include ✅integer quantization, ✅floating-point quantization, and advanced algorithms like ✅AWQ, ✅GPTQ, ✅SmoothQuant, and ✅Quarot. + +- **Nov 12, 2024:** 🔥 We have added support for 💥`static per-tensor activation quantization` across various models and algorithms, covering ✅integer quantization and ✅floating-point quantization to further optimize performance and efficiency. Additionally, we now support exporting ✨`real quantized models` and using the [VLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) backends for inference acceleration. For more details, refer to the [VLLM documentation](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html) and [SGLang documentation](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html). + - **Sep 26, 2024:** 🔥 We now support exporting 💥`FP8 quantized(E4M3, E5M2)` models from 🚀`LLMC` to advanced inference backends such as [VLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang). For detailed usage, please refer to the [VLLM documentation](https://llmc-en.readthedocs.io/en/latest/backend/vllm.html) and [SGLang documentation](https://llmc-en.readthedocs.io/en/latest/backend/sglang.html). - **Sep 24, 2024:** 🔥 We have officially released ✅INT4 and ✅INT8 models of ✨`Llama-3.1-405B`, quantized using 🚀`LLMC` in `save_lightllm` mode. You can download the model parameters [here](https://huggingface.co/Dongz/llama31-405b-quant). @@ -106,11 +110,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**Supported Formats**: Supports both ✨`quantization` (integer and floating-point) and ✨`sparsity`, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity. -- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE and ✅VLM models (see [Supported Model List](#supported-model-list)). +- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen-vl) models (see [Supported Model List](#supported-model-list)). - 💥**Multi-backend Compatibility**: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section `Backend` [here](https://llmc-en.readthedocs.io/en/latest/)). -- 💥**Performance Efficiency**: Enables quantization of large LLMs, such as ✨`Llama3.1-405B` and ✨`OPT-175B`, with PPL evaluation on a `single A100/H100/H800 GPU`. +- 💥**Performance Efficiency**: Enables quantization of large LLMs, such as ✨`Llama3.1-405B` and ✨`DeepSeekV2-236B`, with PPL evaluation on a `single A100/H100/H800 GPU`. ## Usage @@ -156,6 +160,14 @@ Please refer to the 🚀`Quick Start` section in the [documentation](https://llm ✅ [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) +✅ [DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) + +✅ [LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) + +✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) + +✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) + You can add your own model type referring to files under `llmc/models/*.py`. ## Supported Backend List diff --git a/README_ja.md b/README_ja.md index c8eb45123..f48871d94 100644 --- a/README_ja.md +++ b/README_ja.md @@ -48,6 +48,10 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ## 最新情報 +- **2024年11月20日:** 🔥 私たちは現在、✨`DeepSeekv2(2.5)`などの`MOE`モデルおよび✨`Qwen2VL`、`Llama3.2`などの`VLM`モデルの量子化を完全にサポートしています。対応する量子化手法には、✅整数量子化、✅浮動小数点量子化、さらに✅AWQ、✅GPTQ、✅SmoothQuant、✅Quarotといった高度なアルゴリズムが含まれます。 + +- **2024年11月12日:** 🔥 私たちは💥`アクティベーション静的per-tensor`量子化のサポートを、様々なモデルやアルゴリズムに追加しました。これにより、✅整数量子化および✅浮動小数点量子化をカバーし、性能と効率をさらに最適化します。また、✨`真の量子化モデル`のエクスポートをサポートし、[VLLM](https://github.com/vllm-project/vllm)および[SGLang](https://github.com/sgl-project/sglang)バックエンドを使用した推論の高速化も可能です。詳細は[VLLMドキュメント](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)および[SGLangドキュメント](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)をご参照ください。 + - **2024年9月26日:** 🔥 `LLMC`からの✨ `FP8量子化(E4M3、E5M2)`モデルを、VLLMやSGLangのような高度な推理バックエンドにエクスポートできるようになりました。🚀 詳細な使用方法については、[VLLMのドキュメント](https://llmc-en.readthedocs.io/en/latest/backend/vllm.html)と[SGLangのドキュメント](https://llmc-en.readthedocs.io/en/latest/backend/sglang.html)を参照してください。 - **2024年9月24日:** 🔥 私たちは正式に ✨`Llama-3.1-405B` の ✅INT4 と ✅INT8 モデルをリリースしました。これらは 🚀`LLMC` の `save_lightllm` モードを使用して量子化されています。モデルパラメータは[こちら](https://huggingface.co/Dongz/llama31-405b-quant)からダウンロードできます。 @@ -104,11 +108,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**サポートされているフォーマット**: ✨`量子化`(整数および浮動小数点)と ✨`疎性` の両方をサポートし、具体的には ✅重量-活性化、✅重量のみ、✅混合精度量子化、および ✅構造化疎性 と ✅非構造化疎性 を含みます。 -- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅MOE モデルや ✅VLM モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。 +- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen-vl) モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。 - 💥**マルチバックエンドの互換性**: 複数のバックエンドとシームレスに統合し、展開の柔軟性を強化します。さまざまな量子化設定およびモデルフォーマットが、✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM、✅AutoAWQ など、幅広いバックエンドおよびハードウェアプラットフォームと互換性があり、高い柔軟性を実現しています(`Backend`セクションは[こちら](https://llmc-en.readthedocs.io/en/latest/)をご覧ください)。 -- 💥**パフォーマンス効率**: ✨`Llama3.1-405B` や ✨`OPT-175B` などの大規模LLMの量子化をサポートし、`単一の A100/H100/H800 GPU` でPPL評価を可能にします。 +- 💥**パフォーマンス効率**: ✨`Llama3.1-405B` や ✨`DeepSeekV2-236B` などの大規模LLMの量子化をサポートし、`単一の A100/H100/H800 GPU` でPPL評価を可能にします。 ## 使用方法 @@ -154,6 +158,14 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ✅ [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) +✅ [DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) + +✅ [LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) + +✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) + +✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) + 独自のモデルタイプを追加するには、`llmc/models/*.py` ファイルを参照してください。 ## サポートされているバックエンドリスト diff --git a/README_zh.md b/README_zh.md index 1f7d6ea3a..8ff248a46 100644 --- a/README_zh.md +++ b/README_zh.md @@ -23,7 +23,7 @@ **中文文档**在[此处](https://llmc-zhcn.readthedocs.io/en/latest/)。 -**docker hub**在[此处](https://hub.docker.com/r/llmcompression/llmc)。 +**Docker hub**在[此处](https://hub.docker.com/r/llmcompression/llmc)。 **阿里云docker**: `registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:[tag]` @@ -48,11 +48,15 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ## 最新消息 +- **2024年11月20日:** 🔥 我们现已全面支持✨`DeepSeekv2(2.5)`等`MOE`模型以及✨`Qwen2VL`、`Llama3.2`等`VLM`模型的量化。支持的量化方案包括✅整型量化、✅浮点量化,以及✅AWQ、✅GPTQ、✅SmoothQuant 和 ✅Quarot 等先进算法。 + +- **2024年11月12日:** 🔥 我们新增对各种模型和算法的💥`激活静态 per-tensor量化`支持,涵盖✅整型量化和✅浮点量化,进一步优化性能和效率。同时支持导出`✨真实量化模型`,并使用 [VLLM](https://github.com/vllm-project/vllm)和[SGLang](https://github.com/sgl-project/sglang)后端进行推理加速,具体请参阅[VLLM文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)和[SGLang文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)。 + - **2024年9月26日:** 🔥 我们现在支持从🚀 `LLMC`导出💥 `FP8 量化(E4M3,E5M2)`模型到一些先进的推理后端,例如[VLLM](https://github.com/vllm-project/vllm)和[SGLang](https://github.com/sgl-project/sglang)。关于详细使用方法,请参阅[VLLM文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)和[SGLang文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)。 - **2024年9月24日:** 🔥 我们正式发布了 ✨`Llama-3.1-405B` 的 ✅INT4 和 ✅INT8 模型,这些模型通过 🚀`LLMC` 使用 `save_lightllm` 模式进行量化。你可以在[此处](https://huggingface.co/Dongz/llama31-405b-quant)下载模型参数。 -- **2024年9月23日:** 🔥 我们现在支持从 🚀`LLMC` 导出 ✨`真正量化的(INT4, INT8)` 模型到高级推理后端,例如 [VLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), 和 [MLC-LLM](https://github.com/mlc-ai/mlc-llm) 用于量化推理部署,从而实现 ✨`减少内存使用` 和 ✨`加快推理速度`。 +- **2024年9月23日:** 🔥 我们现在支持从 🚀`LLMC` 导出 ✨`真正量化的(INT4, INT8)` 模型到先进推理后端,例如 [VLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), 和 [MLC-LLM](https://github.com/mlc-ai/mlc-llm) 用于量化推理部署,从而实现 ✨`减少内存使用` 和 ✨`加快推理速度`。 详细使用方法,请参考 [VLLM 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/vllm.html)、[SGLang 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/sglang.html)、[AutoAWQ 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/autoawq.html) 和 [MLC-LLM 文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/mlcllm.html)。 - **2024年9月9日:** 🔥 我们提供了一些最佳实践配置,帮助提升性能(参见最佳实践[此处](https://llmc-zhcn.readthedocs.io/en/latest/))。 @@ -104,11 +108,11 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**支持的格式**: 支持 ✨`量化`(整型和浮点)和 ✨`稀疏化`,具体包括 ✅权重激活量化、✅权重量化、✅混合精度量化,以及 ✅结构化 和 ✅非结构化稀疏化。 -- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE 和 ✅VLM 模型(参见[支持的模型列表](#supported-model-list))。 +- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen-vl) 模型(参见[支持的模型列表](#supported-model-list))。 - 💥**多后端兼容性**: 无缝集成多个后端,增强部署灵活性。多种量化设置和模型格式兼容广泛的后端和硬件平台,例如 ✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM 和 ✅AutoAWQ,使其高度灵活(参见✨`推理后端` 章节 [此处](https://llmc-zhcn.readthedocs.io/en/latest/))。 -- 💥**性能效率**: 支持大规模LLM的量化,例如 ✨`Llama3.1-405B` 和 ✨`OPT-175B`,并可在 `单个 A100/H100/H800 GPU` 上评估 PPL。 +- 💥**性能效率**: 支持大规模LLM的量化,例如 ✨`Llama3.1-405B` 和 ✨`DeepSeekV2-236B`,并可在 `单个 A100/H100/H800 GPU` 上评估 PPL。 ## 使用指南 @@ -154,6 +158,14 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ✅ [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) +✅ [DeepSeekv2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) + +✅ [LLaMA V3.2 Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) + +✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) + +✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) + 你可以参考 `llmc/models/*.py` 文件添加自己的模型类型。 ## 支持的后端列表 diff --git a/configs/quantization/backend/sglang/fp8/awq_fp8_static.yml b/configs/quantization/backend/sglang/fp8/awq_fp8_static.yml new file mode 100644 index 000000000..bd7881759 --- /dev/null +++ b/configs/quantization/backend/sglang/fp8/awq_fp8_static.yml @@ -0,0 +1,50 @@ +base: + seed: &seed 42 +model: + type: model_type + path: model path + tokenizer_mode: slow + torch_dtype: auto +calib: + name: pileval + download: False + path: calib data path + n_samples: 128 + bs: -1 + seq_len: 512 + preproc: general + seed: *seed +eval: + eval_pos: [fake_quant] + name: wikitext2 + download: False + path: eval data path + seq_len: 2048 + # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False". + # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True". + bs: 1 + inference_per_block: False +quant: + method: Awq + quant_type: float-quant + weight: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + act: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + static: True + special: + trans: True + trans_version: v2 + weight_clip: True + quant_out: True +save: + save_sgl: True + save_path: /path/to/save_for_sgl_awq_fp8/ diff --git a/docs/en/source/backend/sglang.md b/docs/en/source/backend/sglang.md index 7575aaa3b..adac59f59 100644 --- a/docs/en/source/backend/sglang.md +++ b/docs/en/source/backend/sglang.md @@ -102,10 +102,9 @@ quant: Additionally, if AWQ does not meet accuracy requirements, we recommend using the **Quarot + GPTQ** combined algorithm as introduced in [this section](https://llmc-en.readthedocs.io/en/latest/practice/quarot_gptq.html) to further improve accuracy. The corresponding [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/w8a8_combin) is also provided. -**FP8** - -In the FP8 quantization, it typically offers marginally better precision than INT8. In some cases, the use of the RTN (Round to Nearest) algorithm is sufficient. However, we still recommend utilizing the AWQ algorithm for enhanced quantization accuracy. The specific implementation can be referenced from the AWQ FP8 [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8.yml). +**FP8-Dynamic** +In FP8 quantization, **LLMC** supports weight quantization per-channel and activation quantization dynamically per-token. In this case, the RTN (Round to Nearest) algorithm is sufficient. However, we recommend using the AWQ algorithm for better quantization accuracy. For implementation details, refer to the AWQ FP8 [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8.yml). ```yaml # configs/quantization/backend/sglang/fp8/awq_fp8.yml @@ -131,14 +130,38 @@ quant: quant_out: True ``` -Please ensure that the `quant_type` is set to `float_quant`, which represents floating-point quantization. Additionally, set `use_qtorch` to `True`, as `LLMC`'s floating-point quantization implementation relies on functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library. +Ensure that `quant_type` is set to `float_quant` to indicate floating-point quantization. Additionally, set `use_qtorch` to `True`, as **LLMC**'s FP8 implementation depends on certain functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library. -You can install [QPyTorch](https://github.com/Tiiiger/QPyTorch) using the following command: +Install [QPyTorch](https://github.com/Tiiiger/QPyTorch) with the following command: ```bash pip install qtorch ``` +**FP8-Static** + +In FP8 quantization, **LLMC** also supports weight quantization per-tensor and activation quantization statically per-tensor. In this case, we recommend using the AWQ algorithm while adjusting the activation ranges. Refer to the AWQ FP8 static quantization [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8_static.yml). + +```yaml +# configs/quantization/backend/sglang/fp8/awq_fp8_static.yml +quant: + method: Awq + quant_type: float-quant + weight: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + act: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + static: True +``` + ### 1.3.3 Exporting Real Quantized Model ```yaml diff --git a/docs/en/source/backend/vllm.md b/docs/en/source/backend/vllm.md index 1d832b91d..78ca00a92 100644 --- a/docs/en/source/backend/vllm.md +++ b/docs/en/source/backend/vllm.md @@ -17,13 +17,15 @@ pip install vllm In **VLLM**'s fixed-point integer quantization, the following common formats are supported: -- **W4A16**: Weights are int4, activations are float16; -- **W8A16**: Weights are int8, activations are float16; -- **W8A8**: Weights are int8, activations are int8; -- **FP8 (E4M3, E5M2)**: Weights are float8, activations are float8; -- **Per-channel/group quantization**: Quantization is applied per channel or per group; -- **Per-token dynamic quantization**: Dynamic quantization per token, which further improves quantization accuracy and efficiency; -- **Weight/activation symmetric quantization**: quantization parameters include scale. +- **W4A16**: Weights are int4, activations are float16. +- **W8A16**: Weights are int8, activations are float16. +- **W8A8**: Weights are int8, activations are int8. +- **FP8 (E4M3, E5M2)**: Weights are float8, activations are float8. +- **Per-channel/group weight quantization**: Quantization applied per channel or group. +- **Per-tensor weight quantization**: Quantization applied per tensor. +- **Per-token dynamic activation quantization**: Dynamic quantization for each token to further improve precision. +- **Per-tensor static activation quantization**: Static quantization for each tensor to enhance efficiency. +- **Symmetric weight/activation quantization**: Quantization parameters include scale. Therefore, when quantizing models with **LLMC**, make sure that the bit settings for weights and activations are in formats supported by **VLLM**. @@ -112,10 +114,9 @@ quant: If AWQ cannot meet accuracy requirements, we recommend using the **Quarot + GPTQ combination algorithm** described in [this chapter](https://llmc-en.readthedocs.io/en/latest/practice/quarot_gptq.html) to further improve accuracy. The corresponding [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/w8a8_combin) is also provided. -**FP8** - -In the FP8 quantization, it typically offers marginally better precision than INT8. In some cases, the use of the RTN (Round to Nearest) algorithm is sufficient. However, we still recommend utilizing the AWQ algorithm for enhanced quantization accuracy. The specific implementation can be referenced from the AWQ FP8 [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8.yml). +**FP8-Dynamic** +In FP8 quantization, **LLMC** supports weight quantization per-channel and activation quantization dynamically per-token. In this case, the RTN (Round to Nearest) algorithm is sufficient. However, we recommend using the AWQ algorithm for better quantization accuracy. For implementation details, refer to the AWQ FP8 [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8.yml). ```yaml # configs/quantization/backend/vllm/fp8/awq_fp8.yml @@ -141,14 +142,38 @@ quant: quant_out: True ``` -Please ensure that the `quant_type` is set to `float_quant`, which represents floating-point quantization. Additionally, set `use_qtorch` to `True`, as `LLMC`'s floating-point quantization implementation relies on functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library. +Ensure that `quant_type` is set to `float_quant` to indicate floating-point quantization. Additionally, set `use_qtorch` to `True`, as **LLMC**'s FP8 implementation depends on certain functionalities from the [QPyTorch](https://github.com/Tiiiger/QPyTorch) library. -You can install [QPyTorch](https://github.com/Tiiiger/QPyTorch) using the following command: +Install [QPyTorch](https://github.com/Tiiiger/QPyTorch) with the following command: ```bash pip install qtorch ``` +**FP8-Static** + +In FP8 quantization, **LLMC** also supports weight quantization per-tensor and activation quantization statically per-tensor. In this case, we recommend using the AWQ algorithm while adjusting the activation ranges. Refer to the AWQ FP8 static quantization [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8_static.yml). + +```yaml +# configs/quantization/backend/vllm/fp8/awq_fp8_static.yml +quant: + method: Awq + quant_type: float-quant + weight: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + act: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + static: True +``` + ### 1.3.3 Exporting Real Quantized Model ```yaml diff --git a/docs/zh_cn/source/backend/sglang.md b/docs/zh_cn/source/backend/sglang.md index 430a561fb..72fd2279a 100644 --- a/docs/zh_cn/source/backend/sglang.md +++ b/docs/zh_cn/source/backend/sglang.md @@ -106,12 +106,12 @@ quant: 此外,如果 AWQ 无法满足精度需求,我们建议使用 [章节](https://llmc-zhcn.readthedocs.io/en/latest/practice/quarot_gptq.html) 介绍的 **Quarot+GPTQ 组合算法** 来进一步提升精度。在此也给出相应的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/w8a8_combin) -**FP8** +**FP8-Dynamic** -在 FP8 的量化中,其精度通常略优于 INT8,而且在某些情况下,使用RTN(Round to Nearest)算法就足够了。然而,我们仍然建议使用AWQ算法以获得更好的量化精度。具体的实现可以参考AWQ FP8的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/sglang/fp8/awq_fp8.yml)。 +在 FP8 的量化中,**LLMC** 支持权重per-channel,激活动态per-token的量化,在这种情况下,使用RTN(Round to Nearest)算法就足够了。然而,我们仍然建议使用AWQ算法以获得更好的量化精度。具体的实现可以参考AWQ FP8的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8.yml)。 ```yaml -# configs/quantization/backend/sglang/fp8/awq_fp8.yml +# configs/quantization/backend/vllm/fp8/awq_fp8.yml quant: method: Awq quant_type: float_quant @@ -133,6 +133,7 @@ quant: weight_clip: True quant_out: True ``` + 请确保将 `quant_type` 设置为 `float_quant`,表示浮点量化。同时,将 `use_qtorch` 设置为 `True`,因为 `LLMC` 的浮点量化实现依赖 [QPyTorch](https://github.com/Tiiiger/QPyTorch) 库中的部分功能。 您可以使用以下命令来安装 [QPyTorch](https://github.com/Tiiiger/QPyTorch): @@ -141,6 +142,30 @@ quant: pip install qtorch ``` +**FP8-Static** + +在 FP8 的量化中,**LLMC** 同时也支持权重per-tensor,激活静态per-tensor的量化,在这种情况下,我们建议使用AWQ算法,调整下激活的范围,可以参考AWQ FP8静态量化的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8_static.yml)。 + +```yaml +# configs/quantization/backend/vllm/fp8/awq_fp8_static.yml +quant: + method: Awq + quant_type: float-quant + weight: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + act: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + static: True +``` + ### 1.3.3 真实量化模型导出 diff --git a/docs/zh_cn/source/backend/vllm.md b/docs/zh_cn/source/backend/vllm.md index ae72fd383..b035df9d6 100644 --- a/docs/zh_cn/source/backend/vllm.md +++ b/docs/zh_cn/source/backend/vllm.md @@ -21,8 +21,10 @@ pip install vllm - **W8A16**:权重为 int8,激活为 float16; - **W8A8**:权重为 int8,激活为 int8; - **FP8 (E4M3, E5M2)**:权重为 float8,激活为 float8; -- **权重 per-channel/group 量化**:按通道或按组进行量化; -- **激活 per-token 动态量化**:针对每个 token 的动态量化方式,进一步提升量化精度和效率。 +- **权重 per-channel/group 量化**:按tensor进行量化; +- **权重 per-tensor 量化**:按通道或按组进行量化; +- **激活 per-token 动态量化**:针对每个 token 的动态量化方式,进一步提升量化精度。 +- **激活 per-tensor 静态量化**:针对每个 tensor 的静态量化方式,进一步提升效率。 - **权重\激活对称量化**:量化参数包括scale; 因此,在使用 **LLMC** 进行模型量化时,必须确保权重和激活的比特数设置为 VLLM 支持的格式。 @@ -117,9 +119,9 @@ quant: 此外,如果 AWQ 无法满足精度需求,我们建议使用 [章节](https://llmc-zhcn.readthedocs.io/en/latest/practice/quarot_gptq.html) 介绍的 **Quarot+GPTQ 组合算法** 来进一步提升精度。在此也给出相应的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/w8a8_combin) -**FP8** +**FP8-Dynamic** -在 FP8 的量化中,其精度通常略优于 INT8,而且在某些情况下,使用RTN(Round to Nearest)算法就足够了。然而,我们仍然建议使用AWQ算法以获得更好的量化精度。具体的实现可以参考AWQ FP8的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8.yml)。 +在 FP8 的量化中,**LLMC** 支持权重per-channel,激活动态per-token的量化,在这种情况下,使用RTN(Round to Nearest)算法就足够了。然而,我们仍然建议使用AWQ算法以获得更好的量化精度。具体的实现可以参考AWQ FP8的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8.yml)。 ```yaml # configs/quantization/backend/vllm/fp8/awq_fp8.yml @@ -153,6 +155,29 @@ quant: pip install qtorch ``` +**FP8-Static** + +在 FP8 的量化中,**LLMC** 同时也支持权重per-tensor,激活静态per-tensor的量化,在这种情况下,我们建议使用AWQ算法,调整下激活的范围,可以参考AWQ FP8静态量化的[配置文件](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/vllm/fp8/awq_fp8_static.yml)。 + +```yaml +# configs/quantization/backend/vllm/fp8/awq_fp8_static.yml +quant: + method: Awq + quant_type: float-quant + weight: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + act: + # Support ["e4m3", "e5m2"] + bit: e4m3 + symmetric: True + granularity: per_tensor + use_qtorch: True + static: True +``` ### 1.3.3 真实量化模型导出