Skip to content

Commit ef26dc4

Browse files
Merge branch 'main' into bdellabe/awq-w4a8
2 parents 584a432 + 16de22f commit ef26dc4

File tree

242 files changed

+3464
-3960
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

242 files changed

+3464
-3960
lines changed

.github/workflows/test-check-transformers.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ env:
1616
CADENCE: "commit"
1717
HF_TOKEN: ${{ secrets.HF_TOKEN_READ }}
1818

19+
concurrency:
20+
group: ${{ github.workflow }}-${{ github.ref }}
21+
cancel-in-progress: true
22+
1923
jobs:
2024
detect-changes:
2125
runs-on: ubuntu-latest
@@ -97,14 +101,10 @@ jobs:
97101
if: (success() || failure()) && steps.install.outcome == 'success'
98102
run: |
99103
pytest -v tests/llmcompressor/transformers/oneshot
100-
- name: Running Sparsification Tests
101-
if: (success() || failure()) && steps.install.outcome == 'success'
102-
run: |
103-
pytest tests/llmcompressor/transformers/sparsification -v
104-
- name: Running OBCQ Tests
104+
- name: Running SparseGPT Tests
105105
if: (success() || failure()) && steps.install.outcome == 'success'
106106
run: |
107-
pytest -v tests/llmcompressor/transformers/obcq
107+
pytest -v tests/llmcompressor/transformers/sparsegpt
108108
- name: Running Tracing Tests
109109
if: (success() || failure()) && steps.install.outcome == 'success'
110110
run: |

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -804,3 +804,8 @@ wandb/
804804
timings/
805805
output_finetune/
806806
env_log.json
807+
808+
# uv artifacts
809+
uv.lock
810+
.venv/
811+

DEVELOPING.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ make style
2424
make quality
2525
```
2626

27-
This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
28-
repository's code matches its standards.
27+
This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
2928

3029
**EXAMPLE: test changes locally**
3130

Makefile

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,12 @@ quality:
2626
@echo "Running python quality checks";
2727
ruff check $(CHECKDIRS);
2828
ruff format --check $(CHECKDIRS);
29-
isort --check-only $(CHECKDIRS);
30-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3129

3230
# style the code according to accepted standards for the repo
3331
style:
3432
@echo "Running python styling";
33+
ruff check --fix $(CHECKDIRS);
3534
ruff format $(CHECKDIRS);
36-
isort $(CHECKDIRS);
37-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3835

3936
# run tests for the repo
4037
test:

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,18 +22,26 @@
2222
<img alt="LLM Compressor Flow" src="https://github.com/user-attachments/assets/adf07594-6487-48ae-af62-d9555046d51b" width="80%" />
2323
</p>
2424

25+
---
26+
27+
💬 Join us on the [vLLM Community Slack](https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack) and share your questions, thoughts, or ideas in:
28+
29+
- `#sig-quantization`
30+
- `#llm-compressor`
31+
32+
---
33+
2534
## 🚀 What's New!
2635

2736
Big updates have landed in LLM Compressor! To get a more in-depth look, check out the [deep-dive](https://x.com/RedHat_AI/status/1937865425687093554).
2837

2938
Some of the exciting new features include:
3039

40+
* **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
3141
* **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
3242
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).
3343
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
3444
* **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
35-
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
36-
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
3745

3846
### Supported Formats
3947
* Activation Quantization: W8A8 (int8 and fp8)

docs/developer/developing.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,7 @@ make style
2929
make quality
3030
```
3131

32-
This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
33-
repository's code matches its standards.
32+
This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
3433

3534
**EXAMPLE: test changes locally**
3635

docs/guides/saving_a_model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ If you need more control, you can wrap `save_pretrained` manually:
6969

7070
```python
7171
from transformers import AutoModelForCausalLM
72-
from llmcompressor.transformers.sparsification.compressed_tensors_utils import modify_save_pretrained
72+
from llmcompressor.transformers.compression.compressed_tensors_utils import modify_save_pretrained
7373

7474
# Load model
7575
model = AutoModelForCausalLM.from_pretrained("your-model")

examples/awq/llama_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1313

1414
# Select calibration dataset.
15-
DATASET_ID = "mit-han-lab/pile-val-backup"
16-
DATASET_SPLIT = "validation"
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
1717

1818
# Select number of samples. 256 samples is a good place to start.
1919
# Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
2828
def preprocess(example):
2929
return {
3030
"text": tokenizer.apply_chat_template(
31-
[{"role": "user", "content": example["text"]}],
31+
example["messages"],
3232
tokenize=False,
3333
)
3434
}

examples/awq/qwen3_moe_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1313

1414
# Select calibration dataset.
15-
DATASET_ID = "mit-han-lab/pile-val-backup"
16-
DATASET_SPLIT = "validation"
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
1717

1818
# Select number of samples. 256 samples is a good place to start.
1919
# Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
2828
def preprocess(example):
2929
return {
3030
"text": tokenizer.apply_chat_template(
31-
[{"role": "user", "content": example["text"]}],
31+
example["messages"],
3232
tokenize=False,
3333
)
3434
}

examples/multimodal_vision/gemma3_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ def data_collator(batch):
3232
scheme="W4A16",
3333
ignore=[
3434
"lm_head",
35-
"re:model\.vision_tower.*",
36-
"re:model\.multi_modal_projector.*",
35+
r"re:model\.vision_tower.*",
36+
r"re:model\.multi_modal_projector.*",
3737
],
3838
),
3939
]

0 commit comments

Comments
 (0)