diff --git a/docs/steps/choosing-algo.md b/docs/steps/choosing-algo.md index cb20ba2698..6cfd8d9100 100644 --- a/docs/steps/choosing-algo.md +++ b/docs/steps/choosing-algo.md @@ -24,6 +24,64 @@ Weight and activation quantization is best for maximum throughput on modern hard !!! note AWQ and GPTQ are typically used for weight-only quantization but can also be applied to weight and activation quantization workflows. +### AWQ details + +The AWQ recipe uses the `AWQModifier`, which adjusts model scales ahead of weight quantization: + +```python +recipe = [ + AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]), +] +``` + +AWQ requires layer mappings to identify where to apply activation-aware scaling. Mappings for common model families are built in, but you can supply your own via the `mappings` argument. For example, the Llama mapping looks like: + +```python +[ + AWQMapping("re:.*input_layernorm", ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]), + AWQMapping("re:.*v_proj", ["re:.*o_proj"]), + AWQMapping("re:.*post_attention_layernorm", ["re:.*gate_proj", "re:.*up_proj"]), + AWQMapping("re:.*up_proj", ["re:.*down_proj"]), +] +``` + +!!! note + Mappings define which layers get smoothed, while `targets` and `ignore` define which layers get quantized. A layer in the `ignore` list that is matched by a mapping will still be smoothed but not quantized. + +To add support for a new model family, supply your own mappings via the `mappings` argument or contribute them to the [mappings registry](/src/llmcompressor/modifiers/awq/mappings.py). + +### AutoRound details + +AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. It processes each decoder layer sequentially using block-wise output reconstruction error as the training objective. + +**When to use AutoRound:** + +- **INT4 for large models (≈30B+):** Performance comparable to other PTQ methods; accuracy drop is generally minimal at this scale. +- **INT4 for small-to-medium models:** Likely to deliver higher accuracy than other PTQ methods. +- **Sub-4-bit (INT2/INT3):** Shows 10–20% absolute accuracy improvements over PTQ methods, matching QAT performance at 1–2 orders of magnitude lower tuning cost. +- **New data types (MXFP4/NVFP4):** Consistently outperforms RTN in accuracy for emerging floating-point formats. + +**Key parameters:** + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `scheme` | Quantization scheme (e.g. `W4A16`, `W8A16`) | — | +| `iters` | Tuning iterations per block | 200 | +| `batch_size` | Batch size for calibration | 8 | +| `lr` | Learning rate; auto-set to `1.0/iters` if `None` | `None` | + +**Recommended configurations:** + +| Mode | Batch Size | Iters | Seq Length | Samples | Speed | Memory | Accuracy | +|------|------------|-------|------------|---------|-------|--------|----------| +| `default` | 8 | 200 | 2048 | 128 | Fast | Medium | Good | +| `best` | 8 | 1000 | 2048 | 512 | Slow | High | Best | +| `light` | 8 | 50 | 2048 | 128 | Fastest | Medium | Slight drop | +| `fast` | 4 | 200 | 512 | 128 | Fastest | Low | Good | + +!!! note + AutoRound currently supports WNA16, NVFP4, and W8A8-FP8 quantization schemes. Support for additional schemes is planned; follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968). + ## KV cache and attention quantization KV cache quantization reduces memory usage for long context inference: diff --git a/examples/autoround/README.md b/examples/autoround/README.md deleted file mode 100644 index 66c13a0500..0000000000 --- a/examples/autoround/README.md +++ /dev/null @@ -1,80 +0,0 @@ -# `AutoRound` Quantization - -`llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM. - -AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. - -## Installation - -To get started, install: - -```bash -git clone https://github.com/vllm-project/llm-compressor.git -cd llm-compressor -pip install -e . -``` - -## When to Use AutoRound - -In summary, AutoRound demonstrates leading or on-par performance at 4-bit precision, with clear advantages for sub-4-bit, as reported in **SignRoundV1** ([paper](https://arxiv.org/pdf/2309.05516)), **SignRoundV2** ([paper](http://arxiv.org/abs/2512.04746)) and the **Intel Low-Bit Open LLM Leaderboard** ([link](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)), - -**INT4 for Large Models (≈30B and above)** -AutoRound achieves performance comparable to other PTQ methods, as the accuracy drop for these large models is generally minimal. - -**INT4 for Small-to-Medium LLMs** -AutoRound is likely to deliver higher accuracy than existing PTQ methods, making it particularly effective for smaller models. See SignRoundV1 And Low Bit Open LLM Leaderboard for accuracy data. - -**Sub-4-Bit Quantization (INT2/INT3)** -As the bit-width decreases, AutoRound shows increasing benefits, achieving 10–20% absolute accuracy improvements over PTQ methods, while matching QAT performance at 1–2 orders of magnitude lower tuning cost. See SignRound V2 for details. - -**New Data Types (MXFP4 / NVFP4)** -For emerging floating-point formats, AutoRound consistently outperforms RTN in accuracy, demonstrating strong forward compatibility with evolving quantization standards. See SignRound V2 for details. - -### Key Parameters -- `scheme`: Quantization scheme (e.g., `W4A16`, `W8A16`, more schemes will be supported soon) -- `iters`: Number of tuning iterations per block. Default: 200 -- `batch_size`: Batch size for calibration. Default: 8 -- `lr`: Learning rate for tuning. If `None`, auto-set to `1.0/iters`. Default: `None` -- `NUM_CALIBRATION_SAMPLES`: Number of calibration samples. Default: 128 -- `MAX_SEQUENCE_LENGTH`: Sequence length of calibration samples. Default: 2048 - - -### Quantization Configurations - -The accuracy of the quantized model is configured by tuning-related parameters. AutoRound provides four recommended configurations to balance accuracy and quantization speed: - -| Mode | Batch Size | Iterations | Sequence Length | Calibration Samples | Learning Rate | Quantization Speed | Memory Usage | Accuracy | -|---------|------------|------------|-----------------|---------------------|---------------|--------------------|--------------|------------| -|`default`| 8 | 200 | 2048 | 128 | Auto | 🚀🚀 | 🟡 Medium | 🎯🎯 Good | -|`best` | 8 | 1000 | 2048 | 512 | Auto | 🚀 | 🔴 High | 🏆 Best | -|`light` | 8 | 50 | 2048 | 128 | 5e-3 | 🚀🚀🚀 | 🟡 Medium | 🎯🎯 (slight drop in some cases) | -|`fast` | 4 | 200 | 512 | 128 | Auto | 🚀🚀🚀 | 🟢 Low | 🎯 | - -> [!TIP] -> - Use `best` for production models where accuracy is critical -> - Use `light` for rapid iteration during development (2-3× speedup) -> - Use `fast` when GPU memory is limited or for quick evaluation -> - The `default` recipe provides a good balance for most use cases - -> [!NOTE] -> These configurations are based on our experiments and may vary depending on the model architecture. - - -### Support Matrix -| Scheme | Examples | Note | -| ------------------- | ------------------------------------------------------------------------- | ------------------------------------- | -| `wNa16` | [llama3_example](./quantization_w4a16/llama3_example.py) | | -| `wNa16` | [qwen3_example](./quantization_w4a16/qwen3_example.py) | Multiple cards for `Qwen3-235B-A22B` | -| `wNa16` + `FP8KV` | [llama3_example](./quantization_kv_cache/llama3_example.py) | | -| `W8A8-FP8` Static | [llama4_example](./quantization_w8a8_fp8/llama4_static_quant_example.py) | | -| `W8A8-FP8` Dynamic | [llama4_example](./quantization_w8a8_fp8/llama4_dynamic_quant_example.py) | | -| `NVFP4` | [llama3.1_example](./quantization_w4a4_fp4/llama3.1_example.py) | | -| `MXFP4` | [qwen3_example](../../experimental/mxfp4/autoround_qwen3_example.py) | | - - -### Known Issues -Currently, `llm-compressor` supports applying AutoRound only on the WNA16, NVFP4, and W8A8-FP8 quantization schemes. Support for additional schemes is planned. You can follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968). - -### Questions or Feature Requests? - -Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round). diff --git a/examples/awq/README.md b/examples/awq/README.md deleted file mode 100644 index 321d77a960..0000000000 --- a/examples/awq/README.md +++ /dev/null @@ -1,47 +0,0 @@ -# AWQ Quantization # - -Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values. - -The AWQ implementation found in LLM Compressor is derived from the pioneering work of [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) and with assistance from its original maintainer, [@casper-hansen](https://github.com/casper-hansen). - -## AWQ Recipe ## - -The AWQ recipe has been inferfaced as follows, where the `AWQModifier` adjusts model scales ahead of efficient weight quantization by the `QuantizationModifier` - -```python -recipe = [ - AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]), -] -``` - -## Compressing Your Own Model ## -To use your own model, start with an existing example change the `model_id` to match your own model stub. -```python -model_id = "path/to/your/model" -model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto") -``` - -## Adding Mappings ## -In order to target weight and activation scaling locations within the model, the `AWQModifier` must be provided an AWQ mapping. For example, the AWQ mapping for the Llama family of models looks like this: - -```python -[ - AWQMapping( - "re:.*input_layernorm", - ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], - ), - AWQMapping("re:.*v_proj", ["re:.*o_proj"]), - AWQMapping( - "re:.*post_attention_layernorm", - ["re:.*gate_proj", "re:.*up_proj"], - ), - AWQMapping( - "re:.*up_proj", - ["re:.*down_proj"], - ), -] -``` - -Note: the mappings define which layers get smoothed whereas targets and ignore define which layers get quantized. So if you include a layer in the ignore list that is going to get matched due to the included mappings, it will get smoothed but not quantized. - -To support other model families, you can supply your own mappings via the `mappings` argument with instantiating the `AWQModifier`, or you can add them to the registry [here](/src/llmcompressor/modifiers/awq/mappings.py) (contributions are welcome!) diff --git a/examples/awq/RESULTS.md b/examples/awq/RESULTS.md deleted file mode 100644 index 61c1e8abda..0000000000 --- a/examples/awq/RESULTS.md +++ /dev/null @@ -1,61 +0,0 @@ -# AWQ + FP8 Quantization Results - -**Model:** Meta-Llama-3-8B-Instruct -**Hardware:** 8x NVIDIA A100-SXM4-80GB -**Date:** Feb 10, 2026 - -## Summary - -Ran the example scripts with both FP8 schemes (FP8_DYNAMIC and FP8_BLOCK) on Meta-Llama-3-8B-Instruct, then evaluated on GSM8K. - -This PR adds `RESULTS.md` with reproducible workflow for evaluating AWQ+FP8 quantization schemes on GSM8K. - -## GSM8K Results - -| Scheme | Strict Match | Flexible Extract | -|--------|-------------|------------------| -| **FP8_DYNAMIC** | 76.42% | 76.19% | -| **FP8_BLOCK** | 75.21% | 74.98% | - -**Evaluation details:** -- 1,319 test samples -- Batch size: 16 -- Model: Meta-Llama-3-8B-Instruct - -## Discussion - -This behavior where FP8_BLOCK underperforms FP8_DYNAMIC contradicts our expectation since for RTN FP8_BLOCK outperforms FP8_DYNAMIC, however there are 2 important things to notice. -1) FP8_BLOCK quantization creates quantization `groups` whose size is equivalent to the number of elements in a block, whereas FP8_DYNAMIC quantization creates quantization `groups` - whose size is equal to the in_features. Thus as long as in_features is less than the block size (128x128=16384) the number of weight scales will actually be higher for per channel quantization. - For Meta-Llama-3-8B-Instruct the per-channel weight quantization of the FP8_DYNAMIC scheme has more scales than FP8_BLOCK for every weight. -2) Its also noteworthy that for AWQ, the scale factors being searched for during AWQ align directly with the quantization scales of the per channel weight quantization, this is likely why AWQ yields - such a large improvement for FP8_DYNAMIC - -## Model Checkpoints - -- FP8_DYNAMIC: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-dynamic -- FP8_BLOCK: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-block - -## Setup - -Use the existing example scripts from the repo: -```bash -cd examples/awq -python fp8_dynamic_llama_example.py -python fp8_block_llama_example.py -``` - -## Evaluation - -Run GSM8K evaluation using lm-eval: - -```bash -lm_eval \ - --model vllm \ - --model_args pretrained=,dtype=auto \ - --tasks gsm8k \ - --batch_size 16 \ - --output_path -``` - -**Important:** Setting `batch_size=16` is critical. The default `auto` picks 1, which significantly increases evaluation time. diff --git a/examples/awq/llama_example_ddp.py b/examples/ddp/llama3/w4a16/awq/llama_ddp_example.py similarity index 100% rename from examples/awq/llama_example_ddp.py rename to examples/ddp/llama3/w4a16/awq/llama_ddp_example.py diff --git a/examples/quantization_w4a16/llama3_ddp_example.py b/examples/ddp/llama3/w4a16/llama3_ddp_example.py similarity index 100% rename from examples/quantization_w4a16/llama3_ddp_example.py rename to examples/ddp/llama3/w4a16/llama3_ddp_example.py diff --git a/examples/quantization_w8a8_int8/benchmark_smoothquant_ddp.py b/examples/ddp/llama3/w8a8_int8/benchmark_smoothquant_ddp.py similarity index 100% rename from examples/quantization_w8a8_int8/benchmark_smoothquant_ddp.py rename to examples/ddp/llama3/w8a8_int8/benchmark_smoothquant_ddp.py diff --git a/examples/quantization_w8a8_int8/smoothquant_ddp_example.py b/examples/ddp/llama3/w8a8_int8/smoothquant_ddp_example.py similarity index 100% rename from examples/quantization_w8a8_int8/smoothquant_ddp_example.py rename to examples/ddp/llama3/w8a8_int8/smoothquant_ddp_example.py diff --git a/examples/autoround/ddp/ddp_qwen3_example.py b/examples/ddp/qwen3/w4a16/autoround/qwen3_ddp_example.py similarity index 100% rename from examples/autoround/ddp/ddp_qwen3_example.py rename to examples/ddp/qwen3/w4a16/autoround/qwen3_ddp_example.py diff --git a/examples/awq/qwen3_moe_example_ddp.py b/examples/ddp/qwen3/w4a16/awq/qwen3_moe_ddp_example.py similarity index 100% rename from examples/awq/qwen3_moe_example_ddp.py rename to examples/ddp/qwen3/w4a16/awq/qwen3_moe_ddp_example.py diff --git a/examples/quantizing_moe/deepseek_r1_example.py b/examples/models/deepseek_r1/w4a16/deepseek_r1_example.py similarity index 100% rename from examples/quantizing_moe/deepseek_r1_example.py rename to examples/models/deepseek_r1/w4a16/deepseek_r1_example.py diff --git a/examples/model_free_ptq/deepseek_r1_nvfp4_fp8_block.py b/examples/models/deepseek_r1/w8a8_fp8/deepseek_r1_nvfp4_fp8_block.py similarity index 100% rename from examples/model_free_ptq/deepseek_r1_nvfp4_fp8_block.py rename to examples/models/deepseek_r1/w8a8_fp8/deepseek_r1_nvfp4_fp8_block.py diff --git a/examples/quantization_w8a8_fp8/gemma2_example.py b/examples/models/gemma2/w8a8_fp8/gemma2_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/gemma2_example.py rename to examples/models/gemma2/w8a8_fp8/gemma2_example.py diff --git a/examples/quantization_w8a8_int8/gemma2_example.py b/examples/models/gemma2/w8a8_int8/gemma2_example.py similarity index 100% rename from examples/quantization_w8a8_int8/gemma2_example.py rename to examples/models/gemma2/w8a8_int8/gemma2_example.py diff --git a/examples/multimodal_vision/gemma3_example.py b/examples/models/gemma3/w4a16/gemma3_example.py similarity index 100% rename from examples/multimodal_vision/gemma3_example.py rename to examples/models/gemma3/w4a16/gemma3_example.py diff --git a/examples/quantizing_moe/glm4_7_example.py b/examples/models/glm4/w4a16/awq/glm4_example.py similarity index 100% rename from examples/quantizing_moe/glm4_7_example.py rename to examples/models/glm4/w4a16/awq/glm4_example.py diff --git a/examples/quantizing_moe/glm5_example.py b/examples/models/glm5/w4a16/awq/glm5_example.py similarity index 100% rename from examples/quantizing_moe/glm5_example.py rename to examples/models/glm5/w4a16/awq/glm5_example.py diff --git a/examples/quantization_w4a8/gpt_oss_20b_example.py b/examples/models/gpt_oss/w4a8/gpt_oss_20b_example.py similarity index 100% rename from examples/quantization_w4a8/gpt_oss_20b_example.py rename to examples/models/gpt_oss/w4a8/gpt_oss_20b_example.py diff --git a/examples/quantization_w8a8_fp8/README_granite4.md b/examples/models/granite4/w8a8_fp8/README.md similarity index 100% rename from examples/quantization_w8a8_fp8/README_granite4.md rename to examples/models/granite4/w8a8_fp8/README.md diff --git a/examples/quantization_w8a8_fp8/granite4_example.py b/examples/models/granite4/w8a8_fp8/granite4_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/granite4_example.py rename to examples/models/granite4/w8a8_fp8/granite4_example.py diff --git a/examples/multimodal_vision/idefics3_example.py b/examples/models/idefics3/w4a16/idefics3_example.py similarity index 100% rename from examples/multimodal_vision/idefics3_example.py rename to examples/models/idefics3/w4a16/idefics3_example.py diff --git a/examples/multimodal_vision/README_internvl3.md b/examples/models/internvl3/w8a8_fp8/README.md similarity index 100% rename from examples/multimodal_vision/README_internvl3.md rename to examples/models/internvl3/w8a8_fp8/README.md diff --git a/examples/multimodal_vision/internvl3_example.py b/examples/models/internvl3/w8a8_fp8/internvl3_example.py similarity index 100% rename from examples/multimodal_vision/internvl3_example.py rename to examples/models/internvl3/w8a8_fp8/internvl3_example.py diff --git a/examples/model_free_ptq/kimi_k2_thinking_nvfp4a16.py b/examples/models/kimi_k2/w4a16_fp4/kimi_k2_thinking_nvfp4a16.py similarity index 100% rename from examples/model_free_ptq/kimi_k2_thinking_nvfp4a16.py rename to examples/models/kimi_k2/w4a16_fp4/kimi_k2_thinking_nvfp4a16.py diff --git a/examples/model_free_ptq/kimi_k2_thinking_fp8_block.py b/examples/models/kimi_k2/w8a8_fp8/kimi_k2_thinking_fp8_block.py similarity index 100% rename from examples/model_free_ptq/kimi_k2_thinking_fp8_block.py rename to examples/models/kimi_k2/w8a8_fp8/kimi_k2_thinking_fp8_block.py diff --git a/examples/autoround/quantization_kv_cache/llama3_example.py b/examples/models/llama3/kv_cache/autoround/llama3_example.py similarity index 100% rename from examples/autoround/quantization_kv_cache/llama3_example.py rename to examples/models/llama3/kv_cache/autoround/llama3_example.py diff --git a/examples/quantization_w4a16/README.md b/examples/models/llama3/w4a16/README.md similarity index 100% rename from examples/quantization_w4a16/README.md rename to examples/models/llama3/w4a16/README.md diff --git a/examples/autoround/quantization_w4a16/README.md b/examples/models/llama3/w4a16/autoround/README.md similarity index 100% rename from examples/autoround/quantization_w4a16/README.md rename to examples/models/llama3/w4a16/autoround/README.md diff --git a/examples/autoround/quantization_w4a16/llama3_example.py b/examples/models/llama3/w4a16/autoround/llama3_example.py similarity index 100% rename from examples/autoround/quantization_w4a16/llama3_example.py rename to examples/models/llama3/w4a16/autoround/llama3_example.py diff --git a/examples/awq/llama_example.py b/examples/models/llama3/w4a16/awq/llama_example.py similarity index 100% rename from examples/awq/llama_example.py rename to examples/models/llama3/w4a16/awq/llama_example.py diff --git a/examples/awq/llama_example_with_masking.py b/examples/models/llama3/w4a16/awq/llama_example_with_masking.py similarity index 100% rename from examples/awq/llama_example_with_masking.py rename to examples/models/llama3/w4a16/awq/llama_example_with_masking.py diff --git a/examples/quantization_w4a16/llama3_example.py b/examples/models/llama3/w4a16/llama3_example.py similarity index 100% rename from examples/quantization_w4a16/llama3_example.py rename to examples/models/llama3/w4a16/llama3_example.py diff --git a/examples/multimodal_vision/mllama_example.py b/examples/models/llama3/w4a16/mllama_example.py similarity index 100% rename from examples/multimodal_vision/mllama_example.py rename to examples/models/llama3/w4a16/mllama_example.py diff --git a/examples/quantization_w4a16_fp4/mxfp4/llama3_example.py b/examples/models/llama3/w4a16_fp4/mxfp4/llama3_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/mxfp4/llama3_example.py rename to examples/models/llama3/w4a16_fp4/mxfp4/llama3_example.py diff --git a/examples/quantization_w4a16_fp4/nvfp4/llama3_example.py b/examples/models/llama3/w4a16_fp4/nvfp4/llama3_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/nvfp4/llama3_example.py rename to examples/models/llama3/w4a16_fp4/nvfp4/llama3_example.py diff --git a/examples/quantization_w4a4_fp4/README.md b/examples/models/llama3/w4a4_fp4/README.md similarity index 100% rename from examples/quantization_w4a4_fp4/README.md rename to examples/models/llama3/w4a4_fp4/README.md diff --git a/examples/autoround/quantization_w4a4_fp4/README.md b/examples/models/llama3/w4a4_fp4/autoround/README.md old mode 100755 new mode 100644 similarity index 53% rename from examples/autoround/quantization_w4a4_fp4/README.md rename to examples/models/llama3/w4a4_fp4/autoround/README.md index 52f688f94b..6488578926 --- a/examples/autoround/quantization_w4a4_fp4/README.md +++ b/examples/models/llama3/w4a4_fp4/autoround/README.md @@ -4,21 +4,7 @@ AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. -## Installation - -To get started, install: - -```bash -git clone https://github.com/vllm-project/llm-compressor.git -cd llm-compressor -pip install -e . -``` - -## Quickstart - -The example includes end-to-end scripts for applying the AutoRound quantization algorithm. - -### Llama 3.1 Example +## Llama 3.1 Example ```bash python3 llama3.1_example.py @@ -26,7 +12,7 @@ python3 llama3.1_example.py The resulting model `Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM. -#### Evaluate Accuracy +### Evaluate Accuracy With the model created, we can now load and run in vLLM (after installing). @@ -47,26 +33,25 @@ lm_eval --model vllm \ --batch_size 'auto' ``` -##### meta-llama/Meta-Llama-3.1-8B-Instruct +#### meta-llama/Meta-Llama-3.1-8B-Instruct |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7710|± |0.0116| | | |strict-match | 5|exact_match|↑ |0.7043|± |0.0126| -##### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier) +#### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier) |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7248|± |0.0123| | | |strict-match | 5|exact_match|↑ |0.6611|± |0.0130| - -##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0) +#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0) |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7362|± |0.0121| | | |strict-match | 5|exact_match|↑ |0.6702|± |0.0129| -##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200) +#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200) |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7210|± |0.0124| @@ -74,59 +59,6 @@ lm_eval --model vllm \ > Note: quantized model accuracy may vary slightly due to nondeterminism. -### Qwen3-VL Example - -```bash -python3 qwen3_vl_example.py -``` - -The resulting model `Qwen3-VL-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM. - -#### Evaluate Accuracy - -Run the following to test accuracy on GSM-8K and ChartQA: - -```bash -lm_eval --model vllm-vlm \ - --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \ - --tasks gsm8k \ - --num_fewshot 5 \ - --batch_size 'auto' - -lm_eval --model vllm-vlm \ - --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \ - --tasks chartqa \ - --batch_size 'auto' \ - --apply_chat_template -``` - -##### Qwen/Qwen3-VL-8B-Instruct (Baseline) -|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| -|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| -|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8628|± |0.0095| -| | |strict-match | 5|exact_match|↑ |0.8453|± |0.0100| - -| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| -|-------|------:|------|-----:|-----------------|---|-----:|---|-----:| -|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.7908|± |0.0081| -| | |none | 0|exact_match |↑ |0.5592|± |0.0099| -| | |none | 0|relaxed_accuracy |↑ |0.7696|± |0.0084| - - -##### Qwen3-VL-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200) -|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| -|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| -|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8415|± |0.0101| -| | |strict-match | 5|exact_match|↑ |0.8408|± |0.0101| - -| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| -|-------|------:|------|-----:|-----------------|---|-----:|---|-----:| -|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.8220|± |0.0077| -| | |none | 0|exact_match |↑ |0.5748|± |0.0099| -| | |none | 0|relaxed_accuracy |↑ |0.8044|± |0.0079| - -> Note: quantized model accuracy may vary slightly due to nondeterminism. - ### Questions or Feature Request? Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round). diff --git a/examples/autoround/quantization_w4a4_fp4/llama3.1_example.py b/examples/models/llama3/w4a4_fp4/autoround/llama3.1_example.py similarity index 100% rename from examples/autoround/quantization_w4a4_fp4/llama3.1_example.py rename to examples/models/llama3/w4a4_fp4/autoround/llama3.1_example.py diff --git a/examples/quantization_w4a4_fp4/llama3_example.py b/examples/models/llama3/w4a4_fp4/llama3_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/llama3_example.py rename to examples/models/llama3/w4a4_fp4/llama3_example.py diff --git a/examples/quantization_w4a4_fp4/llama3_example_prefetch.py b/examples/models/llama3/w4a4_fp4/llama3_example_prefetch.py similarity index 100% rename from examples/quantization_w4a4_fp4/llama3_example_prefetch.py rename to examples/models/llama3/w4a4_fp4/llama3_example_prefetch.py diff --git a/examples/quantization_w4a4_fp4/llama3_gptq_example.py b/examples/models/llama3/w4a4_fp4/llama3_gptq_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/llama3_gptq_example.py rename to examples/models/llama3/w4a4_fp4/llama3_gptq_example.py diff --git a/examples/awq/w4a8_fp8_llama_example.py b/examples/models/llama3/w4a8_fp8/awq/w4a8_fp8_llama_example.py similarity index 100% rename from examples/awq/w4a8_fp8_llama_example.py rename to examples/models/llama3/w4a8_fp8/awq/w4a8_fp8_llama_example.py diff --git a/examples/quantization_w4a8_fp8/llama3_example.py b/examples/models/llama3/w4a8_fp8/llama3_example.py similarity index 100% rename from examples/quantization_w4a8_fp8/llama3_example.py rename to examples/models/llama3/w4a8_fp8/llama3_example.py diff --git a/examples/quantization_w8a8_fp8/README.md b/examples/models/llama3/w8a8_fp8/README.md similarity index 100% rename from examples/quantization_w8a8_fp8/README.md rename to examples/models/llama3/w8a8_fp8/README.md diff --git a/examples/awq/fp8_block_llama_example.py b/examples/models/llama3/w8a8_fp8/awq/fp8_block_llama_example.py similarity index 100% rename from examples/awq/fp8_block_llama_example.py rename to examples/models/llama3/w8a8_fp8/awq/fp8_block_llama_example.py diff --git a/examples/awq/fp8_dynamic_llama_example.py b/examples/models/llama3/w8a8_fp8/awq/fp8_dynamic_llama_example.py similarity index 100% rename from examples/awq/fp8_dynamic_llama_example.py rename to examples/models/llama3/w8a8_fp8/awq/fp8_dynamic_llama_example.py diff --git a/examples/quantization_w8a8_fp8/llama3.2_vision_example.py b/examples/models/llama3/w8a8_fp8/llama3.2_vision_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/llama3.2_vision_example.py rename to examples/models/llama3/w8a8_fp8/llama3.2_vision_example.py diff --git a/examples/quantization_w8a8_fp8/llama3_example.py b/examples/models/llama3/w8a8_fp8/llama3_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/llama3_example.py rename to examples/models/llama3/w8a8_fp8/llama3_example.py diff --git a/examples/quantization_w8a8_int8/README.md b/examples/models/llama3/w8a8_int8/README.md similarity index 100% rename from examples/quantization_w8a8_int8/README.md rename to examples/models/llama3/w8a8_int8/README.md diff --git a/examples/quantization_w8a8_int8/llama3_example.py b/examples/models/llama3/w8a8_int8/llama3_example.py similarity index 100% rename from examples/quantization_w8a8_int8/llama3_example.py rename to examples/models/llama3/w8a8_int8/llama3_example.py diff --git a/examples/multimodal_vision/llama4_example.py b/examples/models/llama4/w4a16/llama4_example.py similarity index 100% rename from examples/multimodal_vision/llama4_example.py rename to examples/models/llama4/w4a16/llama4_example.py diff --git a/examples/quantization_w4a4_fp4/llama4_example.py b/examples/models/llama4/w4a4_fp4/llama4_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/llama4_example.py rename to examples/models/llama4/w4a4_fp4/llama4_example.py diff --git a/examples/autoround/quantization_w8a8_fp8/llama4_dynamic_quant_example.py b/examples/models/llama4/w8a8_fp8/autoround/llama4_dynamic_quant_example.py similarity index 100% rename from examples/autoround/quantization_w8a8_fp8/llama4_dynamic_quant_example.py rename to examples/models/llama4/w8a8_fp8/autoround/llama4_dynamic_quant_example.py diff --git a/examples/autoround/quantization_w8a8_fp8/llama4_static_quant_example.py b/examples/models/llama4/w8a8_fp8/autoround/llama4_static_quant_example.py similarity index 100% rename from examples/autoround/quantization_w8a8_fp8/llama4_static_quant_example.py rename to examples/models/llama4/w8a8_fp8/autoround/llama4_static_quant_example.py diff --git a/examples/quantization_w8a8_fp8/llama4_fp8_block_example.py b/examples/models/llama4/w8a8_fp8/llama4_fp8_block_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/llama4_fp8_block_example.py rename to examples/models/llama4/w8a8_fp8/llama4_fp8_block_example.py diff --git a/examples/multimodal_vision/llava_example.py b/examples/models/llava/w4a16/llava_example.py similarity index 100% rename from examples/multimodal_vision/llava_example.py rename to examples/models/llava/w4a16/llava_example.py diff --git a/examples/quantization_w8a8_fp8/llava1.5_example.py b/examples/models/llava/w8a8_fp8/llava1.5_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/llava1.5_example.py rename to examples/models/llava/w8a8_fp8/llava1.5_example.py diff --git a/examples/multimodal_vision/medgemma_example.py b/examples/models/medgemma/w4a16/medgemma_example.py similarity index 100% rename from examples/multimodal_vision/medgemma_example.py rename to examples/models/medgemma/w4a16/medgemma_example.py diff --git a/examples/multimodal_vision/mistral3_chat_template.json b/examples/models/mistral3/w4a16/mistral3_chat_template.json similarity index 100% rename from examples/multimodal_vision/mistral3_chat_template.json rename to examples/models/mistral3/w4a16/mistral3_chat_template.json diff --git a/examples/multimodal_vision/mistral3_example.py b/examples/models/mistral3/w4a16/mistral3_example.py similarity index 100% rename from examples/multimodal_vision/mistral3_example.py rename to examples/models/mistral3/w4a16/mistral3_example.py diff --git a/examples/quantizing_moe/mixtral_example.py b/examples/models/mixtral/w8a8_fp8/mixtral_example.py similarity index 100% rename from examples/quantizing_moe/mixtral_example.py rename to examples/models/mixtral/w8a8_fp8/mixtral_example.py diff --git a/examples/models/omnicoder/w8a8_fp8/omnicoder_fp8_dynamic.py b/examples/models/omnicoder/w8a8_fp8/omnicoder_fp8_dynamic.py new file mode 100644 index 0000000000..d41d33af64 --- /dev/null +++ b/examples/models/omnicoder/w8a8_fp8/omnicoder_fp8_dynamic.py @@ -0,0 +1,21 @@ +from llmcompressor import model_free_ptq + +MODEL_ID = "Tesslate/OmniCoder-9B" +SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic" + +# Apply FP8-Dynamic to the model +# Once quantized, the model is saved +# using compressed-tensors to the SAVE_DIR. +model_free_ptq( + model_stub=MODEL_ID, + save_directory=SAVE_DIR, + scheme="FP8_DYNAMIC", + ignore=[ + "lm_head", + "re:.*model.embed_tokens.*", + "re:.*visual.*", + "re:.*conv1d.*", + ], + max_workers=15, + device="cuda:0", +) diff --git a/examples/multimodal_vision/phi3_vision_example.py b/examples/models/phi3/w4a16/phi3_vision_example.py similarity index 100% rename from examples/multimodal_vision/phi3_vision_example.py rename to examples/models/phi3/w4a16/phi3_vision_example.py diff --git a/examples/multimodal_vision/pixtral_example.py b/examples/models/pixtral/w4a16/pixtral_example.py similarity index 100% rename from examples/multimodal_vision/pixtral_example.py rename to examples/models/pixtral/w4a16/pixtral_example.py diff --git a/examples/quantizing_moe/qwen_example.py b/examples/models/qwen1.5/w4a16/qwen1.5_moe_example.py similarity index 100% rename from examples/quantizing_moe/qwen_example.py rename to examples/models/qwen1.5/w4a16/qwen1.5_moe_example.py diff --git a/examples/multimodal_vision/qwen_2_5_vl_example.py b/examples/models/qwen2.5/w4a16/qwen2.5_vl_example.py similarity index 100% rename from examples/multimodal_vision/qwen_2_5_vl_example.py rename to examples/models/qwen2.5/w4a16/qwen2.5_vl_example.py diff --git a/examples/quantization_w8a8_fp8/qwen_2_5_vl_example.py b/examples/models/qwen2.5/w8a8_fp8/qwen2.5_vl_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/qwen_2_5_vl_example.py rename to examples/models/qwen2.5/w8a8_fp8/qwen2.5_vl_example.py diff --git a/examples/multimodal_audio/qwen2_audio.py b/examples/models/qwen2/w4a16/qwen2_audio_example.py similarity index 100% rename from examples/multimodal_audio/qwen2_audio.py rename to examples/models/qwen2/w4a16/qwen2_audio_example.py diff --git a/examples/multimodal_vision/qwen2_vl_example.py b/examples/models/qwen2/w4a16/qwen2_vl_example.py similarity index 100% rename from examples/multimodal_vision/qwen2_vl_example.py rename to examples/models/qwen2/w4a16/qwen2_vl_example.py diff --git a/examples/quantization_w8a8_fp8/qwen2vl_example.py b/examples/models/qwen2/w8a8_fp8/qwen2_vl_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/qwen2vl_example.py rename to examples/models/qwen2/w8a8_fp8/qwen2_vl_example.py diff --git a/examples/quantization_w4a16_fp4/mxfp4/qwen3.5_example.py b/examples/models/qwen3.5/w4a16_fp4/mxfp4/qwen3.5_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/mxfp4/qwen3.5_example.py rename to examples/models/qwen3.5/w4a16_fp4/mxfp4/qwen3.5_example.py diff --git a/examples/quantization_w4a16_fp4/nvfp4/qwen3.5_example.py b/examples/models/qwen3.5/w4a16_fp4/nvfp4/qwen3.5_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/nvfp4/qwen3.5_example.py rename to examples/models/qwen3.5/w4a16_fp4/nvfp4/qwen3.5_example.py diff --git a/examples/quantization_w4a4_fp4/qwen3_5_example.py b/examples/models/qwen3.5/w4a4_fp4/qwen3.5_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/qwen3_5_example.py rename to examples/models/qwen3.5/w4a4_fp4/qwen3.5_example.py diff --git a/examples/autoround/quantization_w4a16/qwen3_example.py b/examples/models/qwen3/w4a16/autoround/qwen3_example.py similarity index 100% rename from examples/autoround/quantization_w4a16/qwen3_example.py rename to examples/models/qwen3/w4a16/autoround/qwen3_example.py diff --git a/examples/awq/qwen3_coder_moe_example.py b/examples/models/qwen3/w4a16/awq/qwen3_coder_moe_example.py similarity index 100% rename from examples/awq/qwen3_coder_moe_example.py rename to examples/models/qwen3/w4a16/awq/qwen3_coder_moe_example.py diff --git a/examples/awq/qwen3_moe_example.py b/examples/models/qwen3/w4a16/awq/qwen3_moe_example.py similarity index 100% rename from examples/awq/qwen3_moe_example.py rename to examples/models/qwen3/w4a16/awq/qwen3_moe_example.py diff --git a/examples/awq/qwen3-vl-30b-a3b-Instruct-example.py b/examples/models/qwen3/w4a16/awq/qwen3_vl_30b_a3b_example.py similarity index 100% rename from examples/awq/qwen3-vl-30b-a3b-Instruct-example.py rename to examples/models/qwen3/w4a16/awq/qwen3_vl_30b_a3b_example.py diff --git a/examples/multimodal_vision/qwen3_vl_example.py b/examples/models/qwen3/w4a16/awq/qwen3_vl_example.py similarity index 100% rename from examples/multimodal_vision/qwen3_vl_example.py rename to examples/models/qwen3/w4a16/awq/qwen3_vl_example.py diff --git a/examples/multimodal_vision/qwen3_omni_example.py b/examples/models/qwen3/w4a16/qwen3_omni_example.py similarity index 100% rename from examples/multimodal_vision/qwen3_omni_example.py rename to examples/models/qwen3/w4a16/qwen3_omni_example.py diff --git a/examples/quantization_w4a16_fp4/mxfp4/qwen3_example.py b/examples/models/qwen3/w4a16_fp4/mxfp4/qwen3_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/mxfp4/qwen3_example.py rename to examples/models/qwen3/w4a16_fp4/mxfp4/qwen3_example.py diff --git a/examples/quantization_w4a16_fp4/nvfp4/qwen3_example.py b/examples/models/qwen3/w4a16_fp4/nvfp4/qwen3_example.py similarity index 100% rename from examples/quantization_w4a16_fp4/nvfp4/qwen3_example.py rename to examples/models/qwen3/w4a16_fp4/nvfp4/qwen3_example.py diff --git a/examples/models/qwen3/w4a4_fp4/autoround/README.md b/examples/models/qwen3/w4a4_fp4/autoround/README.md new file mode 100644 index 0000000000..7249b518bf --- /dev/null +++ b/examples/models/qwen3/w4a4_fp4/autoround/README.md @@ -0,0 +1,61 @@ +# `AutoRound` Quantization + +`llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM. + +AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. + +## Qwen3-VL Example + +```bash +python3 qwen3_vl_example.py +``` + +The resulting model `Qwen3-VL-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM. + +### Evaluate Accuracy + +Run the following to test accuracy on GSM-8K and ChartQA: + +```bash +lm_eval --model vllm-vlm \ + --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \ + --tasks gsm8k \ + --num_fewshot 5 \ + --batch_size 'auto' + +lm_eval --model vllm-vlm \ + --model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \ + --tasks chartqa \ + --batch_size 'auto' \ + --apply_chat_template +``` + +#### Qwen/Qwen3-VL-8B-Instruct (Baseline) +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8628|± |0.0095| +| | |strict-match | 5|exact_match|↑ |0.8453|± |0.0100| + +| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| +|-------|------:|------|-----:|-----------------|---|-----:|---|-----:| +|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.7908|± |0.0081| +| | |none | 0|exact_match |↑ |0.5592|± |0.0099| +| | |none | 0|relaxed_accuracy |↑ |0.7696|± |0.0084| + +#### Qwen3-VL-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200) +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8415|± |0.0101| +| | |strict-match | 5|exact_match|↑ |0.8408|± |0.0101| + +| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| +|-------|------:|------|-----:|-----------------|---|-----:|---|-----:| +|chartqa| 0|none | 0|anywhere_accuracy|↑ |0.8220|± |0.0077| +| | |none | 0|exact_match |↑ |0.5748|± |0.0099| +| | |none | 0|relaxed_accuracy |↑ |0.8044|± |0.0079| + +> Note: quantized model accuracy may vary slightly due to nondeterminism. + +### Questions or Feature Request? + +Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round). diff --git a/examples/autoround/quantization_w4a4_fp4/qwen3_vl_example.py b/examples/models/qwen3/w4a4_fp4/autoround/qwen3_vl_example.py similarity index 100% rename from examples/autoround/quantization_w4a4_fp4/qwen3_vl_example.py rename to examples/models/qwen3/w4a4_fp4/autoround/qwen3_vl_example.py diff --git a/examples/quantization_w4a4_fp4/qwen_30b_a3b.py b/examples/models/qwen3/w4a4_fp4/qwen3_30b_a3b_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/qwen_30b_a3b.py rename to examples/models/qwen3/w4a4_fp4/qwen3_30b_a3b_example.py diff --git a/examples/quantization_w4a4_fp4/qwen3_vl_moe_w4a4_fp4.py b/examples/models/qwen3/w4a4_fp4/qwen3_vl_moe_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/qwen3_vl_moe_w4a4_fp4.py rename to examples/models/qwen3/w4a4_fp4/qwen3_vl_moe_example.py diff --git a/examples/quantization_w8a8_fp8/fp8_block_example.py b/examples/models/qwen3/w8a8_fp8/fp8_block_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/fp8_block_example.py rename to examples/models/qwen3/w8a8_fp8/fp8_block_example.py diff --git a/examples/model_free_ptq/qwen3_fp8_block.py b/examples/models/qwen3/w8a8_fp8/qwen3_fp8_block.py similarity index 100% rename from examples/model_free_ptq/qwen3_fp8_block.py rename to examples/models/qwen3/w8a8_fp8/qwen3_fp8_block.py diff --git a/examples/quantization_w8a8_fp8/qwen3_reranker_example.py b/examples/models/qwen3/w8a8_fp8/qwen3_reranker_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/qwen3_reranker_example.py rename to examples/models/qwen3/w8a8_fp8/qwen3_reranker_example.py diff --git a/examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py b/examples/models/qwen3/w8a8_fp8/qwen3_vl_moe_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py rename to examples/models/qwen3/w8a8_fp8/qwen3_vl_moe_example.py diff --git a/examples/awq/qwen3_next_example.py b/examples/models/qwen3_next/w4a16/awq/qwen3_next_example.py similarity index 100% rename from examples/awq/qwen3_next_example.py rename to examples/models/qwen3_next/w4a16/awq/qwen3_next_example.py diff --git a/examples/quantization_w4a4_fp4/qwen3_next_example.py b/examples/models/qwen3_next/w4a4_fp4/qwen3_next_example.py similarity index 100% rename from examples/quantization_w4a4_fp4/qwen3_next_example.py rename to examples/models/qwen3_next/w4a4_fp4/qwen3_next_example.py diff --git a/examples/quantization_w8a8_fp8/qwen3_next_example.py b/examples/models/qwen3_next/w8a8_fp8/qwen3_next_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/qwen3_next_example.py rename to examples/models/qwen3_next/w8a8_fp8/qwen3_next_example.py diff --git a/examples/multimodal_audio/whisper_example.py b/examples/models/whisper/w4a16/whisper_example.py similarity index 100% rename from examples/multimodal_audio/whisper_example.py rename to examples/models/whisper/w4a16/whisper_example.py diff --git a/examples/quantization_w8a8_fp8/whisper_example.py b/examples/models/whisper/w8a8_fp8/whisper_example.py similarity index 100% rename from examples/quantization_w8a8_fp8/whisper_example.py rename to examples/models/whisper/w8a8_fp8/whisper_example.py