Skip to content

Commit 371103f

Browse files
authored
Merge branch 'main' into feature/intermediates-cache-prefetch
2 parents 22b62b9 + 370c04c commit 371103f

File tree

195 files changed

+3434
-4244
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

195 files changed

+3434
-4244
lines changed

.MAINTAINERS

Lines changed: 0 additions & 17 deletions
This file was deleted.

.claude/skills/style.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Code Quality Check
2+
3+
After completing a batch of code changes:
4+
5+
1. Run `make style` to auto-format the code
6+
2. Run `make quality` to validate code quality
7+
3. Fix any issues reported by `make quality`
8+
9+
When writing code, keep lines under 88 characters and avoid unnecessary indentation.

.claude/skills/test.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Running Tests
2+
3+
Nearly all tests in this project require a GPU.
4+
5+
When running tests:
6+
7+
1. First check if `canhazgpu` is available: `which canhazgpu`
8+
2. If available, MUST run tests using `canhazgpu` with appropriate GPU allocation
9+
3. Use the format: `canhazgpu --gpus 1 -- python3 -m pytest tests/...`
10+
11+
Example:
12+
```bash
13+
# Check if canhazgpu is available
14+
which canhazgpu
15+
16+
# Run tests with canhazgpu (required if available)
17+
canhazgpu --gpus 1 -- python3 -m pytest tests/test_example.py
18+
```
19+
20+
Note: Adjust the `--gpus` count based on test requirements (typically 1 GPU is sufficient for most tests).

.github/mergify.yml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,43 @@
1+
queue_rules:
2+
- name: default
3+
merge_method: merge
4+
commit_message_template: |
5+
{{ title }} (#{{ number }})
6+
7+
{{ body }}
8+
9+
Signed-off-by: Mergify <noreply@mergify.com>
10+
queue_conditions:
11+
- check-success=DCO
12+
- check-success=quality-check
13+
- check-success=transformers-tests
14+
- check-success=base-tests (3.10)
15+
- check-success=base-tests (3.13)
16+
- check-success=pytorch-tests (3.10)
17+
- check-success=pytorch-tests (3.13)
18+
- check-success=markdown-link-check
19+
120
pull_request_rules:
21+
- name: Automatically merge when ready
22+
conditions:
23+
- base=main
24+
- label=ready
25+
- "#approved-reviews-by>=2"
26+
- check-success=DCO
27+
- check-success=quality-check
28+
- check-success=transformers-tests
29+
- check-success=base-tests (3.10)
30+
- check-success=base-tests (3.13)
31+
- check-success=pytorch-tests (3.10)
32+
- check-success=pytorch-tests (3.13)
33+
- check-success=markdown-link-check
34+
- check-success=ready-label-check
35+
- -conflict
36+
- -draft
37+
actions:
38+
queue:
39+
name: default
40+
241
- name: label-documentation
342
description: Automatically apply documentation label
443
conditions:

.gitignore

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,13 @@ venv.bak/
128128
/site
129129
docs/.cache/*
130130

131+
# zensical docs build (generated by pre-build scripts)
132+
docs/api/llmcompressor/
133+
docs/examples/
134+
docs/experimental/
135+
docs/developer/code-of-conduct.md
136+
docs/developer/contributing.md
137+
131138
# mypy
132139
.mypy_cache/
133140
### Example user template template
@@ -811,5 +818,3 @@ env_log.json
811818
# uv artifacts
812819
uv.lock
813820
.venv/
814-
815-
.claude/

.readthedocs.yaml

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,16 @@
1-
# Read the Docs configuration file
2-
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
3-
4-
# Required
51
version: 2
62

7-
# Set the OS, Python version, and other tools you might need
83
build:
94
os: ubuntu-24.04
105
tools:
116
python: "3.12"
12-
13-
# Build documentation with Mkdocs
14-
mkdocs:
15-
configuration: mkdocs.yml
16-
17-
python:
18-
install:
19-
- method: pip
20-
path: .
21-
extra_requirements:
22-
- dev
7+
jobs:
8+
install:
9+
- pip install -e ".[dev]"
10+
build:
11+
html:
12+
- python docs/scripts/zensical_gen_files.py
13+
- zensical build
14+
post_build:
15+
- mkdir -p $READTHEDOCS_OUTPUT/html/
16+
- cp --recursive site/* $READTHEDOCS_OUTPUT/html/

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,12 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
3737

3838
Some of the exciting new features include:
3939

40-
* **Batched Calibration Support**: LLM Compressor now supports calibration with batch sizes > 1. A new [`batch_size`](src/llmcompressor/args/dataset_arguments.py#L70) argument has been added to the `dataset_arguments` enabling the option to improve quantization speed. Default `batch_size` is currently set to 1
40+
* **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs/guides/big_models_and_distributed/model_loading.md).
41+
* **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples/quantization_w4a16/llama3_ddp_example.py).
42+
* **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
4143
* **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src/llmcompressor/entrypoints/model_free/__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512). Additional [examples](examples/model_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2
4244
* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new `per-head` quantization scheme. Support for these checkpoints is on-going in vLLM and scripts to get started have been added to the [experimental folder](experimental/attention)
43-
* **Generalized AWQ Support**: The AWQModifier has been updated to support quantization schemes beyond W4A16 (e.g W4AFp8). In particular, AWQ no longer constrains that the quantization config needs to have the same settings for `group_size`, `symmetric`, and `num_bits` for each config_group
44-
* **AutoRound Quantization Support**: Added [`AutoRoundModifier`](examples/autoround) for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance
45-
* **Experimental MXFP4 Support**: Models can now be quantized using an [`MXFP4`](https://github.com/vllm-project/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py#L208) pre-set scheme. Examples can be found under the [experimental folder](experimental/mxfp4/llama3_mxfp4.py). This pathway is still experimental as support and validation with vLLM is still a WIP.
46-
* **R3 Transform Support**: LLM Compressor now supports applying transforms to attention in the style of SpinQuant's R3 rotation. Note: this feature is currently not yet supported in vLLM. An example applying R3 can be found in the [experimental folder](experimental/attention/llama3_attention_r3_nvfp4.py)
45+
4746

4847
### Supported Formats
4948
* Activation Quantization: W8A8 (int8 and fp8)

docs/.nav.yml

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
nav:
22
- Home: index.md
3-
- Why use LLM Compressor?: getting-started/why-llmcompressor.md
4-
- Choosing the right compression scheme: getting-started/choosing-scheme.md
5-
- Choosing the right compression algorithm: getting-started/choosing-algo.md
3+
- Why use LLM Compressor?: steps/why-llmcompressor.md
4+
- Compressing your model, step-by-step:
5+
- Choosing your model: steps/choosing-model.md
6+
- Choosing the right compression scheme: steps/choosing-scheme.md
7+
- Choosing the right compression algorithm: steps/choosing-algo.md
8+
- Choosing a dataset: steps/choosing-dataset.md
9+
- Compressing your model: steps/compress.md
10+
- Deploying with vLLM: steps/deploy.md
611
- Getting started:
712
- getting-started/index.md
813
- Installing LLM Compressor: getting-started/install.md
9-
- Compressing your Model: getting-started/compress.md
10-
- Deploying with vLLM: getting-started/deploy.md
11-
- FAQ: getting-started/faq.md
1214
- Key Models:
1315
- key-models/index.md
1416
- Llama 4:
@@ -24,14 +26,25 @@ nav:
2426
- key-models/mistral-large-3/index.md
2527
- FP8 Example: key-models/mistral-large-3/fp8-example.md
2628
- Guides:
29+
- Big Models and Distributed Support:
30+
- Model Loading: guides/big_models_and_distributed/model_loading.md
31+
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
32+
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
2733
- Compression Schemes: guides/compression_schemes.md
2834
- Saving a Model: guides/saving_a_model.md
29-
- Observers: observers.md
35+
- Observers: guides/observers.md
36+
- Memory Requirements: guides/memory.md
37+
- Runtime Performance: guides/runtime.md
3038
- Examples:
31-
- examples/index.md
39+
- examples/README.md
3240
- examples/*
41+
- Experimental:
42+
- experimental/README.md
43+
- experimental/*
3344
- Developer:
3445
- developer/index.md
3546
- developer/*
3647
- API Reference:
3748
- api/*
49+
- FAQ:
50+
- faq/faq.md

docs/DEVELOPMENT.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Getting started with LLM Compressor docs
2+
3+
```bash
4+
cd docs
5+
```
6+
7+
- Install the dependencies:
8+
9+
```bash
10+
make install
11+
```
12+
13+
- Clean the previous build (optional but recommended):
14+
15+
```bash
16+
make clean
17+
```
18+
19+
- Generate docs content (files, API references, and navigation):
20+
21+
```bash
22+
make gen
23+
```
24+
25+
- Serve the docs locally (runs `gen` automatically):
26+
27+
```bash
28+
make serve
29+
```
30+
31+
This will start a local server. You can now open your browser and view the documentation.
32+
33+
- Build the static site (runs `gen` automatically):
34+
35+
```bash
36+
make build
37+
```
38+
39+
- List all available targets:
40+
41+
```bash
42+
make help
43+
```

docs/Makefile

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,32 @@
1-
# Minimal mkdocs makefile
1+
# Minimal zensical makefile
22

3-
PYTHON := python3
4-
MKDOCS_CMD := mkdocs
5-
MKDOCS_CONF := ../mkdocs.yml
3+
ZENSICAL_CMD := zensical
4+
ZENSICAL_CONF := ../zensical.toml
65

7-
.PHONY: help install serve build clean
6+
.PHONY: help install gen serve build clean
87

98
help:
109
@echo "Available targets:"
1110
@echo " install Install dependencies globally"
11+
@echo " gen Generate docs content (files + API + nav)"
1212
@echo " serve Serve docs locally"
1313
@echo " build Build static site"
1414
@echo " clean Remove build artifacts"
1515

1616
install:
1717
pip install -e "../[dev]"
1818

19-
serve:
20-
$(MKDOCS_CMD) serve --livereload -f $(MKDOCS_CONF)
19+
gen:
20+
cd .. && python docs/scripts/zensical_gen_files.py
2121

22-
build:
23-
$(MKDOCS_CMD) build -f $(MKDOCS_CONF)
22+
serve: gen
23+
cd .. && $(ZENSICAL_CMD) serve
24+
25+
build: gen
26+
cd .. && $(ZENSICAL_CMD) build
2427

2528
clean:
26-
rm -rf site/ .cache/
29+
rm -rf site/ .cache/ api/llmcompressor/
30+
rm -rf examples/ experimental/
31+
rm -f developer/code-of-conduct.md developer/contributing.md
32+
cd .. && python3 docs/scripts/zensical_gen_files.py --clean

0 commit comments

Comments
 (0)