|
| 1 | +# lightx2v Quantized Inference |
| 2 | + |
| 3 | +[lightx2v](https://github.com/ModelTC/lightx2v) is an efficient backend designed specifically to meet the inference demands of video generation models. By optimizing memory management and computational efficiency, it significantly accelerates the inference process. |
| 4 | + |
| 5 | +**LLMC** supports exporting quantized model formats required by **lightx2v** and offers strong support for multiple quantization algorithms (such as AWQ, GPTQ, SmoothQuant, etc.), maintaining high quantization accuracy while improving inference speed. Combining **LLMC** with **lightx2v** enables accelerated inference and memory optimization without compromising accuracy, making it ideal for scenarios that require efficient video model processing. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1.1 Environment Setup |
| 10 | + |
| 11 | +To use **lightx2v** for quantized inference, first install and configure the environment: |
| 12 | + |
| 13 | +```bash |
| 14 | +# Clone the repository and its submodules |
| 15 | +git clone https://github.com/ModelTC/lightx2v.git lightx2v && cd lightx2v |
| 16 | +git submodule update --init --recursive |
| 17 | + |
| 18 | +# Create and activate the conda environment |
| 19 | +conda create -n lightx2v python=3.11 && conda activate lightx2v |
| 20 | +pip install -r requirements.txt |
| 21 | + |
| 22 | +# Reinstall transformers separately to bypass version conflicts |
| 23 | +pip install transformers==4.45.2 |
| 24 | + |
| 25 | +# Install flash-attention 2 |
| 26 | +cd lightx2v/3rd/flash-attention && pip install --no-cache-dir -v -e . |
| 27 | + |
| 28 | +# Install flash-attention 3 (only if using Hopper architecture) |
| 29 | +cd lightx2v/3rd/flash-attention/hopper && pip install --no-cache-dir -v -e . |
| 30 | +``` |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 1.2 Quantization Formats |
| 35 | + |
| 36 | +**lightx2v** supports several fixed-point quantization formats: |
| 37 | + |
| 38 | +- **W8A8**: int8 for weights and activations. |
| 39 | +- **FP8 (E4M3)**: float8 for weights and activations. |
| 40 | +- **Weight per-channel quantization**. |
| 41 | +- **Activation per-token dynamic quantization** for improved precision. |
| 42 | +- **Symmetric quantization** for both weights and activations (uses only scale). |
| 43 | + |
| 44 | +When using **LLMC** to quantize models, ensure the bit-width of weights and activations matches supported **lightx2v** formats. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## 1.3 Quantizing Models with LLMC |
| 49 | + |
| 50 | +### 1.3.1 Calibration Data |
| 51 | + |
| 52 | +For example, for the Wan2.1 model on the I2V task, a calibration dataset is provided in the [directory](https://github.com/ModelTC/llmc/tree/main/assets/wan_i2v/calib). Users can add more samples as needed. |
| 53 | + |
| 54 | +### 1.3.2 Choosing Quantization Algorithm |
| 55 | + |
| 56 | +#### **W8A8** |
| 57 | + |
| 58 | +We recommend using **SmoothQuant** for W8A8 settings. |
| 59 | +Refer to the SmoothQuant W8A8 [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/video_gen/wan_i2v/smoothquant_w_a.yaml): |
| 60 | + |
| 61 | +```yaml |
| 62 | +quant: |
| 63 | + video_gen: |
| 64 | + method: SmoothQuant |
| 65 | + weight: |
| 66 | + bit: 8 |
| 67 | + symmetric: True |
| 68 | + granularity: per_channel |
| 69 | + act: |
| 70 | + bit: 8 |
| 71 | + symmetric: True |
| 72 | + granularity: per_token |
| 73 | + special: |
| 74 | + alpha: 0.75 |
| 75 | +``` |
| 76 | +
|
| 77 | +If SmoothQuant does not meet the precision requirement, use **AWQ** for better accuracy. See the corresponding [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/video_gen/wan_i2v/awq_w_a.yaml). |
| 78 | +
|
| 79 | +#### **FP8-Dynamic** |
| 80 | +
|
| 81 | +LLMC supports FP8 quantization with per-channel weights and per-token dynamic activations. SmoothQuant is again recommended. See the SmoothQuant FP8 [configuration](https://github.com/ModelTC/llmc/tree/main/configs/quantization/backend/lightx2v/fp8/awq_fp8.yml): |
| 82 | +
|
| 83 | +```yaml |
| 84 | +quant: |
| 85 | + video_gen: |
| 86 | + method: SmoothQuant |
| 87 | + weight: |
| 88 | + quant_type: float-quant |
| 89 | + bit: e4m3 |
| 90 | + symmetric: True |
| 91 | + granularity: per_channel |
| 92 | + use_qtorch: True |
| 93 | + act: |
| 94 | + quant_type: float-quant |
| 95 | + bit: e4m3 |
| 96 | + symmetric: True |
| 97 | + granularity: per_token |
| 98 | + use_qtorch: True |
| 99 | + special: |
| 100 | + alpha: 0.75 |
| 101 | +``` |
| 102 | +
|
| 103 | +Ensure `quant_type` is set to `float-quant` and `use_qtorch` to `True`, as **LLMC** uses [QPyTorch](https://github.com/Tiiiger/QPyTorch) for float quantization. |
| 104 | + |
| 105 | +Install QPyTorch with: |
| 106 | + |
| 107 | +```bash |
| 108 | +pip install qtorch |
| 109 | +``` |
| 110 | + |
| 111 | +### 1.3.3 Exporting the Quantized Model |
| 112 | + |
| 113 | +```yaml |
| 114 | +save: |
| 115 | + save_lightx2v: True |
| 116 | + save_path: /path/to/save_for_lightx2v/ |
| 117 | +``` |
| 118 | + |
| 119 | +Set `save_lightx2v` to `True`. LLMC will export weights as `torch.int8` or `torch.float8_e4m3fn` for direct loading in **lightx2v**, along with quantization parameters. |
| 120 | + |
| 121 | +### 1.3.4 Running LLMC |
| 122 | + |
| 123 | +Edit the config path in the run script and execute: |
| 124 | + |
| 125 | +```bash |
| 126 | +# scripts/run_llmc.sh |
| 127 | +llmc=llmc_path |
| 128 | +export PYTHONPATH=$llmc:$PYTHONPATH |
| 129 | +
|
| 130 | +task_name=sq_for_lightx2v |
| 131 | +config=${llmc}/configs/quantization/video_gen/wan_i2v/smoothquant_w_a.yaml |
| 132 | +``` |
| 133 | + |
| 134 | +After LLMC completes, the quantized model is saved to `save.save_path`. |
| 135 | + |
| 136 | +### 1.3.5 Evaluation |
| 137 | + |
| 138 | +For the I2V task with the Wan2.1 model, an evaluation dataset is provided [here](https://github.com/ModelTC/llmc/tree/main/assets/wan_i2v/eval). Set the following in the config file: |
| 139 | + |
| 140 | +```yaml |
| 141 | +eval: |
| 142 | + eval_pos: [fake_quant] |
| 143 | + type: video_gen |
| 144 | + name: i2v |
| 145 | + download: False |
| 146 | + path: ../assets/wan_i2v/eval/ |
| 147 | + bs: 1 |
| 148 | + target_height: 480 |
| 149 | + target_width: 832 |
| 150 | + num_frames: 81 |
| 151 | + guidance_scale: 5.0 |
| 152 | + output_video_path: ./output_videos_sq/ |
| 153 | +``` |
| 154 | + |
| 155 | +LLMC will generate evaluation videos using the pseudo-quantized model. |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +## 1.4 Inference with lightx2v |
| 160 | + |
| 161 | +### 1.4.1 Weight Structure Conversion |
| 162 | + |
| 163 | +After LLMC exports the model, convert its structure to match **lightx2v** requirements using the [conversion script](https://github.com/ModelTC/lightx2v/blob/main/examples/diffusers/converter.py): |
| 164 | + |
| 165 | +```bash |
| 166 | +python converter.py -s /path/to/save_for_lightx2v/ -o /path/to/output/ -d backward |
| 167 | +``` |
| 168 | + |
| 169 | +The converted model will be saved under `/path/to/output/`. |
| 170 | + |
| 171 | +### 1.4.2 Offline Inference |
| 172 | + |
| 173 | +Edit the [inference script](https://github.com/ModelTC/lightx2v/blob/main/scripts/run_wan_i2v_advanced_ptq.sh), set `model_path` to `/path/to/output/` and `lightx2v_path` to your local lightx2v path, then run: |
| 174 | + |
| 175 | +```bash |
| 176 | +bash run_wan_i2v_advanced_ptq.sh |
| 177 | +``` |
0 commit comments