Dev fix (#376)

gushiqiao · gushiqiao · web-flow · commit d3ffb372874e · 2025-05-12T16:14:57.000+08:00
* Support wan i2v and lora.

* Update docs and readme.

---------

Co-authored-by: gushiqiao &lt;gushiqiao@sensetime.com&gt;
Co-authored-by: root &lt;gushiqiao&gt;
diff --git a/README.md b/README.md
@@ -48,6 +48,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ## Latest News
 
+- **May 12, 2025:** 🔥 We now fully support quantization for the **`Wan2.1`** series of video generation models and provide export of truly quantized **INT8/FP8** weights, compatible with the [lightx2v](https://github.com/ModelTC/lightx2v) inference framework. For details, please refer to the [lightx2v documentation](https://llmc-en.readthedocs.io/en/latest/backend/lightx2v.html).
+
 - **Feb 7, 2025:** 🔥 We now fully support quantization of large-scale **`MOE`** models like **`DeepSeekv3`**, **`DeepSeek-R1`**, and **`DeepSeek-R1-zero`** with **`671B`** parameters. You can now directly load FP8 weights without any extra conversion. AWQ and RTN quantization can run on a single 80GB GPU, and we also support the export of true quantized **INT4/INT8** weights.
 
 - **Nov 20, 2024:** 🔥 We now fully support the quantization of ✨`DeepSeekv2(2.5)` and other `MOE` models, as well as ✨`Qwen2VL`, `Llama3.2`, and other `VLM` models. Supported quantization methods include ✅integer quantization, ✅floating-point quantization, and advanced algorithms like ✅AWQ, ✅GPTQ, ✅SmoothQuant, and ✅Quarot.
diff --git a/README_ja.md b/README_ja.md
@@ -48,7 +48,9 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ## 最新情報
 
-- V 🔥 私たちは現在、671Bパラメータを持つ大規模な **`MOE`** モデル、例えば **`DeepSeekv3`**、**`DeepSeek-R1`**、および **`DeepSeek-R1-zero`** の量子化を完全にサポートしています。今すぐFP8ウェイトを追加の変換なしで直接読み込むことができます。AWQおよびRTN量子化は、1枚の80GB GPUで実行でき、さらに、真の量子化された **INT4/INT8** ウェイトのエクスポートにも対応しています。
+- **2025年5月12日：** 🔥 **`Wan2.1`** シリーズのビデオ生成モデルの量子化を完全にサポートし、実際に量子化された **INT8/FP8** 重みのエクスポートにも対応しました。これらは [lightx2v](https://github.com/ModelTC/lightx2v) 推論フレームワークと互換性があります。詳細は [lightx2v ドキュメント](https://llmc-en.readthedocs.io/en/latest/backend/lightx2v.html) をご参照ください。
+
+- **2025年2月7日:** 🔥 私たちは現在、671Bパラメータを持つ大規模な **`MOE`** モデル、例えば **`DeepSeekv3`**、**`DeepSeek-R1`**、および **`DeepSeek-R1-zero`** の量子化を完全にサポートしています。今すぐFP8ウェイトを追加の変換なしで直接読み込むことができます。AWQおよびRTN量子化は、1枚の80GB GPUで実行でき、さらに、真の量子化された **INT4/INT8** ウェイトのエクスポートにも対応しています。
 
 - **2024年11月20日:** 🔥 私たちは現在、✨`DeepSeekv2(2.5)`などの`MOE`モデルおよび✨`Qwen2VL`、`Llama3.2`などの`VLM`モデルの量子化を完全にサポートしています。対応する量子化手法には、✅整数量子化、✅浮動小数点量子化、さらに✅AWQ、✅GPTQ、✅SmoothQuant、✅Quarotといった高度なアルゴリズムが含まれます。
 
diff --git a/README_zh.md b/README_zh.md
@@ -48,6 +48,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ## 最新消息
 
+- **2025年5月12日：** 🔥 我们现已全面支持 **`Wan2.1`** 系列视频生成模型的量化，并支持导出真实量化的 **INT8/FP8** 权重，兼容 [lightx2v](https://github.com/ModelTC/lightx2v) 推理框架。详情请参考 [lightx2v 使用文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/lightx2v.html)。
+
 - **2025年2月7日:** 🔥 我们现已全面支持 **`DeepSeekv3`**、**`DeepSeek-R1`** 和 **`DeepSeek-R1-zero`** 等 671B 大规模 **`MOE`** 模型的量化。 您可以直接加载 `FP8` 权重，无需额外转换，使用单张 80G 显存的 GPU 即可运行 `AWQ` 和 `RTN` 量化，同时还支持导出真实量化的 **INT4/INT8** 权重
 
 - **2024年11月20日:** 🔥 我们现已全面支持✨`DeepSeekv2(2.5)`等`MOE`模型以及✨`Qwen2VL`、`Llama3.2`等`VLM`模型的量化。支持的量化方案包括✅整型量化、✅浮点量化，以及✅AWQ、✅GPTQ、✅SmoothQuant 和 ✅Quarot 等先进算法。
diff --git a/configs/quantization/video_gen/wan_i2v/smoothquant_w_a_fp8.yaml b/configs/quantization/video_gen/wan_i2v/smoothquant_w_a_fp8.yaml
@@ -0,0 +1,49 @@
+base:
+    seed: &seed 42
+model:
+    type: WanI2V
+    path: /path/to/model
+    torch_dtype: auto
+calib:
+    name: i2v
+    download: False
+    path: ../assets/wan_i2v/calib/
+    sample_steps: 40
+    bs: 1
+    target_height: 480
+    target_width: 832
+    num_frames: 81
+    guidance_scale: 5.0
+    seed: *seed
+eval:
+    eval_pos: [fake_quant]
+    type: video_gen
+    name: i2v
+    download: False
+    path: ../assets/wan_i2v/eval/
+    bs: 1
+    target_height: 480
+    target_width: 832
+    num_frames: 81
+    guidance_scale: 5.0
+    output_video_path: ./output_videos_sq/
+quant:
+    video_gen:
+        method: SmoothQuant
+        weight:
+            quant_type: float-quant
+            bit: e4m3
+            symmetric: True
+            granularity: per_channel
+            use_qtorch: True
+        act:
+            quant_type: float-quant
+            bit: e4m3
+            symmetric: True
+            granularity: per_token
+            use_qtorch: True
+        special:
+            alpha: 0.75
+save:
+    save_lightx2v: True
+    save_path: /path/to/x2v/
diff --git a/configs/quantization/video_gen/wan_i2v/smoothquant_w_a_int8_lora.yaml b/configs/quantization/video_gen/wan_i2v/smoothquant_w_a_int8_lora.yaml
@@ -0,0 +1,46 @@
+base:
+    seed: &seed 42
+model:
+    type: WanI2V
+    path: /path/to/model
+    lora_path: /path/to/lora_weights
+    torch_dtype: auto
+calib:
+    name: i2v
+    download: False
+    path: ../assets/wan_i2v/calib/
+    sample_steps: 40
+    bs: 1
+    target_height: 480
+    target_width: 832
+    num_frames: 81
+    guidance_scale: 5.0
+    seed: *seed
+eval:
+    eval_pos: [fake_quant]
+    type: video_gen
+    name: i2v
+    download: False
+    path: ../assets/wan_i2v/eval/
+    bs: 1
+    target_height: 480
+    target_width: 832
+    num_frames: 81
+    guidance_scale: 5.0
+    output_video_path: ./output_videos_sq/
+quant:
+    video_gen:
+        method: SmoothQuant
+        weight:
+            bit: 8
+            symmetric: True
+            granularity: per_channel
+        act:
+            bit: 8
+            symmetric: True
+            granularity: per_token
+        special:
+            alpha: 0.75
+save:
+    save_lightx2v: True
+    save_path: /path/to/x2v/
diff --git a/docs/en/source/backend/lightx2v.md b/docs/en/source/backend/lightx2v.md
@@ -0,0 +1,177 @@
+# lightx2v Quantized Inference
+
+[lightx2v](https://github.com/ModelTC/lightx2v) is an efficient backend designed specifically to meet the inference demands of video generation models. By optimizing memory management and computational efficiency, it significantly accelerates the inference process.
+
+**LLMC** supports exporting quantized model formats required by **lightx2v** and offers strong support for multiple quantization algorithms (such as AWQ, GPTQ, SmoothQuant, etc.), maintaining high quantization accuracy while improving inference speed. Combining **LLMC** with **lightx2v** enables accelerated inference and memory optimization without compromising accuracy, making it ideal for scenarios that require efficient video model processing.
+
+---
+
+## 1.1 Environment Setup
+
+To use **lightx2v** for quantized inference, first install and configure the environment:
+
+```bash
+# Clone the repository and its submodules
+git clone https://github.com/ModelTC/lightx2v.git lightx2v && cd lightx2v
+git submodule update --init --recursive
+
+# Create and activate the conda environment
+conda create -n lightx2v python=3.11 && conda activate lightx2v
+pip install -r requirements.txt
+
+# Reinstall transformers separately to bypass version conflicts
+pip install transformers==4.45.2
+
+# Install flash-attention 2
+cd lightx2v/3rd/flash-attention && pip install --no-cache-dir -v -e .
+
+# Install flash-attention 3 (only if using Hopper architecture)
+cd lightx2v/3rd/flash-attention/hopper && pip install --no-cache-dir -v -e .
+```
+
+---
+
+## 1.2 Quantization Formats
+
+**lightx2v** supports several fixed-point quantization formats:
+
+- **W8A8**: int8 for weights and activations.
+- **FP8 (E4M3)**: float8 for weights and activations.
+- **Weight per-channel quantization**.
+- **Activation per-token dynamic quantization** for improved precision.
+- **Symmetric quantization** for both weights and activations (uses only scale).
+
+When using **LLMC** to quantize models, ensure the bit-width of weights and activations matches supported **lightx2v** formats.
+
+---
+
+## 1.3 Quantizing Models with LLMC
+
+### 1.3.1 Calibration Data
+
+For example, for the Wan2.1 model on the I2V task, a calibration dataset is provided in the [directory](https://github.com/ModelTC/llmc/tree/main/assets/wan_i2v/calib). Users can add more samples as needed.
+
+### 1.3.2 Choosing Quantization Algorithm
+
+#### **W8A8**
+
+We recommend using **SmoothQuant** for W8A8 settings.  
+Refer to the SmoothQuant W8A8 [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/video_gen/wan_i2v/smoothquant_w_a.yaml):
+
+```yaml
+quant:
+  video_gen:
+    method: SmoothQuant
+    weight:
+      bit: 8
+      symmetric: True
+      granularity: per_channel
+    act:
+      bit: 8
+      symmetric: True
+      granularity: per_token
+    special:
+      alpha: 0.75
+```
+
+If SmoothQuant does not meet the precision requirement, use **AWQ** for better accuracy. See the corresponding [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/video_gen/wan_i2v/awq_w_a.yaml).
+
+#### **FP8-Dynamic**
+
+LLMC supports FP8 quantization with per-channel weights and per-token dynamic activations. SmoothQuant is again recommended. See the SmoothQuant FP8 [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/lightx2v/fp8/awq_fp8.yml):
+
+```yaml
+quant:
+  video_gen:
+    method: SmoothQuant
+    weight:
+      quant_type: float-quant
+      bit: e4m3
+      symmetric: True
+      granularity: per_channel
+      use_qtorch: True
+    act:
+      quant_type: float-quant
+      bit: e4m3
+      symmetric: True
+      granularity: per_token
+      use_qtorch: True
+    special:
+      alpha: 0.75
+```
+
+Ensure `quant_type` is set to `float-quant` and `use_qtorch` to `True`, as **LLMC** uses [QPyTorch](https://github.com/Tiiiger/QPyTorch) for float quantization.
+
+Install QPyTorch with:
+
+```bash
+pip install qtorch
+```
+
+### 1.3.3 Exporting the Quantized Model
+
+```yaml
+save:
+  save_lightx2v: True
+  save_path: /path/to/save_for_lightx2v/
+```
+
+Set `save_lightx2v` to `True`. LLMC will export weights as `torch.int8` or `torch.float8_e4m3fn` for direct loading in **lightx2v**, along with quantization parameters.
+
+### 1.3.4 Running LLMC
+
+Edit the config path in the run script and execute:
+
+```bash
+# scripts/run_llmc.sh
+llmc=llmc_path
+export PYTHONPATH=$llmc:$PYTHONPATH
+
+task_name=sq_for_lightx2v
+config=${llmc}/configs/quantization/video_gen/wan_i2v/smoothquant_w_a.yaml
+```
+
+After LLMC completes, the quantized model is saved to `save.save_path`.
+
+### 1.3.5 Evaluation
+
+For the I2V task with the Wan2.1 model, an evaluation dataset is provided [here](https://github.com/ModelTC/llmc/tree/main/assets/wan_i2v/eval). Set the following in the config file:
+
+```yaml
+eval:
+  eval_pos: [fake_quant]
+  type: video_gen
+  name: i2v
+  download: False
+  path: ../assets/wan_i2v/eval/
+  bs: 1
+  target_height: 480
+  target_width: 832
+  num_frames: 81
+  guidance_scale: 5.0
+  output_video_path: ./output_videos_sq/
+```
+
+LLMC will generate evaluation videos using the pseudo-quantized model.
+
+---
+
+## 1.4 Inference with lightx2v
+
+### 1.4.1 Weight Structure Conversion
+
+After LLMC exports the model, convert its structure to match **lightx2v** requirements using the [conversion script](https://github.com/ModelTC/lightx2v/blob/main/examples/diffusers/converter.py):
+
+```bash
+python converter.py -s /path/to/save_for_lightx2v/ -o /path/to/output/ -d backward
+```
+
+The converted model will be saved under `/path/to/output/`.
+
+### 1.4.2 Offline Inference
+
+Edit the [inference script](https://github.com/ModelTC/lightx2v/blob/main/scripts/run_wan_i2v_advanced_ptq.sh), set `model_path` to `/path/to/output/` and `lightx2v_path` to your local lightx2v path, then run:
+
+```bash
+bash run_wan_i2v_advanced_ptq.sh
+```
diff --git a/docs/zh_cn/source/backend/lightx2v.md b/docs/zh_cn/source/backend/lightx2v.md