Skip to content

Commit 1f8ae1d

Browse files
committed
Move to zensical
Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> Signed-off-by: Aidan Reilly <aireilly@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
1 parent 12aa563 commit 1f8ae1d

File tree

10 files changed

+717
-123
lines changed

10 files changed

+717
-123
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,13 @@ venv.bak/
128128
/site
129129
docs/.cache/*
130130

131+
# zensical docs build (generated by pre-build scripts)
132+
docs/api/llmcompressor/
133+
docs/examples/
134+
docs/experimental/
135+
docs/developer/code-of-conduct.md
136+
docs/developer/contributing.md
137+
131138
# mypy
132139
.mypy_cache/
133140
### Example user template template

.readthedocs.yaml

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,16 @@
1-
# Read the Docs configuration file
2-
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
3-
4-
# Required
51
version: 2
62

7-
# Set the OS, Python version, and other tools you might need
83
build:
94
os: ubuntu-24.04
105
tools:
116
python: "3.12"
12-
13-
# Build documentation with Mkdocs
14-
mkdocs:
15-
configuration: mkdocs.yml
16-
17-
python:
18-
install:
19-
- method: pip
20-
path: .
21-
extra_requirements:
22-
- dev
7+
jobs:
8+
install:
9+
- pip install -e ".[dev]"
10+
build:
11+
html:
12+
- python docs/scripts/zensical_gen_files.py
13+
- zensical build
14+
post_build:
15+
- mkdir -p $READTHEDOCS_OUTPUT/html/
16+
- cp --recursive site/* $READTHEDOCS_OUTPUT/html/

docs/.nav.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
nav:
2-
- Home: index.md
2+
- Home: README.md
33
- Why use LLM Compressor?: steps/why-llmcompressor.md
44
- Compresssing your model, step-by-step:
55
- Choosing your model: steps/choosing-model.md

docs/Makefile

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,31 @@
1-
# Minimal mkdocs makefile
1+
# Minimal zensical makefile
22

3-
PYTHON := python3
4-
MKDOCS_CMD := mkdocs
5-
MKDOCS_CONF := ../mkdocs.yml
3+
ZENSICAL_CMD := zensical
4+
ZENSICAL_CONF := ../zensical.toml
65

7-
.PHONY: help install serve build clean
6+
.PHONY: help install gen serve build clean
87

98
help:
109
@echo "Available targets:"
1110
@echo " install Install dependencies globally"
11+
@echo " gen Generate docs content (files + API + nav)"
1212
@echo " serve Serve docs locally"
1313
@echo " build Build static site"
1414
@echo " clean Remove build artifacts"
1515

1616
install:
1717
pip install -e "../[dev]"
1818

19-
serve:
20-
$(MKDOCS_CMD) serve --livereload -f $(MKDOCS_CONF)
19+
gen:
20+
cd .. && python docs/scripts/zensical_gen_files.py
2121

22-
build:
23-
$(MKDOCS_CMD) build -f $(MKDOCS_CONF)
22+
serve: gen
23+
cd .. && $(ZENSICAL_CMD) serve
24+
25+
build: gen
26+
cd .. && $(ZENSICAL_CMD) build
2427

2528
clean:
26-
rm -rf site/ .cache/
29+
rm -rf site/ .cache/ api/llmcompressor/
30+
rm -rf examples/ experimental/
31+
rm -f developer/code-of-conduct.md developer/contributing.md

docs/README.md

Lines changed: 65 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,73 @@
1-
# Getting started with LLM Compressor docs
1+
# What is LLM Compressor?
22

3-
```bash
4-
cd docs
5-
```
3+
**LLM Compressor** is an easy-to-use library for optimizing large language models for deployment with vLLM. It provides a comprehensive toolkit for applying state-of-the-art compression algorithms to reduce model size, lower hardware requirements, and improve inference performance.
64

7-
- Install the dependencies:
5+
<p align="center">
6+
<img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/>
7+
</p>
88

9-
```bash
10-
make install
11-
```
9+
## Which challenges does LLM Compressor address?
1210

13-
- Clean the previous build (optional but recommended):
11+
Model optimization through quantization and pruning addresses the key challenges of deploying AI at scale:
1412

15-
```bash
16-
make clean
17-
```
13+
| Challenge | How LLM Compressor helps |
14+
|-----------|--------------------------|
15+
| GPU and infrastructure costs | Reduces memory requirements by 50-75%, enabling deployment on fewer GPUs |
16+
| Response latency | Reduces data movement overhead because quantized weights load faster |
17+
| Request throughput | Utilizes lower-precision tensor cores for faster computation |
18+
| Energy consumption | Smaller models consume less power during inference |
1819

19-
- Serve the docs:
20+
For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md)
2021

21-
```bash
22-
make serve
23-
```
22+
## New in this release
2423

25-
This will start a local server at http://localhost:8000. You can now open your browser and view the documentation.
24+
Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
25+
26+
!!! info "Batched Calibration Support"
27+
LLM Compressor now supports calibration with batch sizes > 1. A new batch_size argument has been added to the dataset_arguments enabling the option to improve quantization speed. Default batch_size is currently set to 1
28+
29+
!!! info "New Model-Free PTQ Pathway"
30+
A new model-free PTQ pathway has been added to LLM Compressor, called model_free_ptq. This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where oneshot may fail. This pathway is currently supported for data-free pathways only, such as FP8 quantization and was leveraged to quantize the Mistral Large 3 model. Additional examples have been added illustrating how LLM Compressor can be used for Kimi K2
31+
32+
!!! info "Extended KV Cache and Attention Quantization Support"
33+
LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new per-head quantization scheme. Support for these checkpoints is ongoing in vLLM and scripts to get started have been added to the [experimental](https://github.com/vllm-project/llm-compressor/tree/main/experimental) folder
34+
35+
!!! info "Generalized AWQ Support"
36+
The `AWQModifier` has been updated to support quantization schemes beyond W4A16 (e.g., W4AFp8). In particular, AWQ no longer constrains that the quantization config needs to have the same settings for group_size, symmetric, and num_bits for each config_group
37+
38+
!!! info "AutoRound Quantization Support"
39+
Added AutoRoundModifier for quantization using AutoRound, an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance
40+
41+
!!! info "Experimental MXFP4 Support"
42+
Models can now be quantized using an MXFP4 pre-set scheme. Examples can be found under the experimental folder. This pathway is still experimental as support and validation with vLLM is still a WIP.
43+
44+
## Supported algorithms and techniques
45+
46+
| Algorithm | Description | Use Case |
47+
|-----------|-------------|----------|
48+
| **RTN** (Round-to-Nearest) | Fast baseline quantization | Quick compression with minimal setup |
49+
| **GPTQ** | Weighted quantization with calibration | High-accuracy 4 and 8 bit weight quantization |
50+
| **AWQ** | Activation-aware weight quantization | Preserves accuracy for important weights |
51+
| **SmoothQuant** | Outlier handling for W8A8 | Improved activation quantization |
52+
| **SparseGPT** | Pruning with quantization | 2:4 sparsity patterns |
53+
| **SpinQuant** | Rotation-based transforms | Improved low-bit accuracy |
54+
| **QuIP** | Incoherence processing | Advanced quantization preprocessing |
55+
| **FP8 KV Cache** | KV cache quantization | Long context inference on Hopper-class and newer GPUs |
56+
| **AutoRound** | Optimizes rounding and clipping ranges via sign-gradient descent | Broad compatibility |
57+
58+
## Supported quantization schemes
59+
60+
LLM Compressor supports applying multiple formats in a given model.
61+
62+
| Format | Targets | Compute Capability | Use Case |
63+
|--------|---------|-------------------|----------|
64+
| **W4A16/W8A16** | Weights | 8.0 (Ampere and up) | Optimize for latency on older hardware |
65+
| **W8A8-INT8** | Weights and activations | 7.5 (Turing and up) | Balanced performance and compatibility |
66+
| **W8A8-FP8** | Weights and activations | 8.9 (Hopper and up) | High throughput on modern GPUs |
67+
| **NVFP4/MXFP4** | Weights and activations | 10.0 (Blackwell) | Maximum compression on latest hardware |
68+
| **W4AFP8** | Weights and activations | 8.9 (Hopper and up) | Low-bit weights with dynamic FP8 activations |
69+
| **W4AINT8** | Weights and activations | 7.5 (Turing and up) | Low-bit weights with dynamic INT8 activations |
70+
| **2:4 Sparse** | Weights | 8.0 (Ampere and up) | Sparsity-accelerated inference |
71+
72+
!!! note
73+
Listed compute capability indicates the minimum architecture required for hardware acceleration.

docs/index.md

Lines changed: 0 additions & 73 deletions
This file was deleted.

0 commit comments

Comments
 (0)