Skip to content

Commit cd2054e

Browse files
committed
Merge branch 'kylesayrs/quantization-observer-tests' into kylesayrs/group-activation-quantization
2 parents 1138be5 + 178d0ae commit cd2054e

File tree

90 files changed

+1095
-266
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+1095
-266
lines changed

.coveragerc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[run]
2+
patch = subprocess

.github/workflows/test-check-transformers.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ env:
1616
CADENCE: "commit"
1717
HF_TOKEN: ${{ secrets.HF_TOKEN_READ }}
1818

19+
concurrency:
20+
group: ${{ github.workflow }}-${{ github.ref }}
21+
cancel-in-progress: true
22+
1923
jobs:
2024
detect-changes:
2125
runs-on: ubuntu-latest

DEVELOPING.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ make style
2424
make quality
2525
```
2626

27-
This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
28-
repository's code matches its standards.
27+
This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
2928

3029
**EXAMPLE: test changes locally**
3130

Makefile

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,12 @@ quality:
2626
@echo "Running python quality checks";
2727
ruff check $(CHECKDIRS);
2828
ruff format --check $(CHECKDIRS);
29-
isort --check-only $(CHECKDIRS);
30-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3129

3230
# style the code according to accepted standards for the repo
3331
style:
3432
@echo "Running python styling";
33+
ruff check --fix $(CHECKDIRS);
3534
ruff format $(CHECKDIRS);
36-
isort $(CHECKDIRS);
37-
flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
3835

3936
# run tests for the repo
4037
test:

docs/developer/developing.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,7 @@ make style
2929
make quality
3030
```
3131

32-
This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
33-
repository's code matches its standards.
32+
This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
3433

3534
**EXAMPLE: test changes locally**
3635

docs/getting-started/deploy.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,13 @@ Before deploying your model, ensure you have the following prerequisites:
2424
vLLM provides a Python API for easy integration with your applications, enabling you to load and use your compressed model directly in your Python code. To test the compressed model, use the following code:
2525

2626
```python
27-
from vllm import LLM
27+
from vllm import LLM, SamplingParams
2828

2929
model = LLM("./TinyLlama-1.1B-Chat-v1.0-INT8")
30-
output = model.generate("What is machine learning?", max_tokens=256)
31-
print(output)
30+
sampling_params = SamplingParams(max_tokens=256)
31+
outputs = model.generate("What is machine learning?", sampling_params)
32+
for output in outputs:
33+
print(output.outputs[0].text)
3234
```
3335

3436
After running the above code, you should see the generated output from your compressed model. This confirms that the model is loaded and ready for inference.
@@ -39,7 +41,7 @@ vLLM also provides an HTTP server for serving your model via a RESTful API that
3941
To start the HTTP server, use the following command:
4042

4143
```bash
42-
vllm serve "./TinyLlama-1.1B-Chat-v1.0-INT8"
44+
vllm serve "TinyLlama-1.1B-Chat-v1.0-INT8"
4345
```
4446

4547
By default, the server will run on `localhost:8000`. You can change the host and port by using the `--host` and `--port` flags. Now that the server is running, you can send requests to it using any HTTP client. For example, you can use `curl` to send a request:

docs/getting-started/install.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ If you need a specific version of LLM Compressor, you can specify the version nu
3838
pip install llmcompressor==0.5.1
3939
```
4040

41-
Replace `0.1.0` with your desired version number.
41+
Replace `0.5.1` with your desired version number.
4242

4343
### Install from Source
4444

docs/guides/saving_a_model.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ If you need more control, you can wrap `save_pretrained` manually:
6969

7070
```python
7171
from transformers import AutoModelForCausalLM
72-
from llmcompressor.transformers.sparsification import modify_save_pretrained
72+
from llmcompressor.transformers.sparsification.compressed_tensors_utils import modify_save_pretrained
7373

7474
# Load model
7575
model = AutoModelForCausalLM.from_pretrained("your-model")
@@ -88,7 +88,11 @@ model.save_pretrained(
8888
### Saving with Custom Sparsity Configuration
8989

9090
```python
91-
from compressed_tensors.sparsification import SparsityCompressionConfig
91+
from transformers import AutoModelForCausalLM
92+
from compressed_tensors import SparsityCompressionConfig
93+
94+
# Load model
95+
model = AutoModelForCausalLM.from_pretrained("your-model")
9296

9397
# Create custom sparsity config
9498
custom_config = SparsityCompressionConfig(

examples/multimodal_vision/gemma3_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ def data_collator(batch):
3232
scheme="W4A16",
3333
ignore=[
3434
"lm_head",
35-
"re:model\.vision_tower.*",
36-
"re:model\.multi_modal_projector.*",
35+
r"re:model\.vision_tower.*",
36+
r"re:model\.multi_modal_projector.*",
3737
],
3838
),
3939
]
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# `fp8` Weight and Activation Quantization for Granite 4
2+
3+
`llmcompressor` supports quantizing weights and activations to `fp8` for memory savings and inference acceleration with `vllm`
4+
5+
For Granite 4, in addition to typical `nn.Linear` layers in `mamba` or `mlp` modules, there are three "Linear-like" layers in `GraniteMoeHybridMoe` (`moe` module) that could be quantized as well. Among the three layers, usually `router` should be kept in high precision for accuracy reason. Therefore, users could choose to quantize the other two layers, `input_linear` and `output_linear`, for better model compression.
6+
7+
Note that input_linear and output_linear are `GraniteMoeHybridParallelExperts`, which subclasses `nn.Module` instead of `nn.Linear`, for it needs to store weights in 3D, i.e. [num_experts, out_feat, in_feat]. Because llm-compressor can only handle `nn.Linear` at the moment, our simple workaround would be:
8+
1. **Swap `GraniteMoeHybridParallelExperts` with `GraniteMoeHybridParallelExpertsLinear`**
9+
10+
The custom class is equivalent to the original one, except it subclasses nn.Linear and stores 2D weights. Moe expert weight tensors will be converted from 3D to 2D, i.e. from [num_experts, out_feat, in_feat] to [num_experts * out_feat, in_feat].
11+
2. **Perform dynamic fp8 quantization**
12+
13+
The new class is compatible with typical per-channel weight quantization, llm-compressor will be able to identify those layers and process them normally. The resulting scales will have shape of [num_experts * out_feat, 1]
14+
3. **Reshape weights and scales back to 3D before saving the checkpoint**
15+
16+
> `fp8` compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
17+
18+
## Installation
19+
20+
To get started, install:
21+
22+
```bash
23+
pip install llmcompressor
24+
```
25+
26+
This checkpoint format will need the latest vllm (ver >= 0.10.1.1) to run correctly. Additional dependencies and environment variables needed are:
27+
1. Dependencies: `vllm>=0.10.1.1, lm_eval>=0.4.9.1, flash-attn=2.7.3, torch>=2.7.1`
28+
2. ENV VAR: `VLLM_USE_V1=0, VLLM_WORKER_MULTIPROC_METHOD=spawn`
29+
30+
## Quickstart
31+
32+
`granite4_example.py` demonstrates the quantization of `mamba`, `mlp`, and those
33+
"Linear-like" input/output layers with minimal changes to `llm-compressor`.
34+
35+
36+
```bash
37+
python3 granite4_example.py
38+
```
39+
40+
The resulting model `ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter` is ready to be loaded into vLLM.
41+
42+
## Code Walkthough
43+
44+
Now, we will step though the code in the example. There are three steps:
45+
1) Load model
46+
2) Apply quantization
47+
3) Evaluate accuracy in vLLM
48+
49+
### 1) Load Model
50+
51+
Load the model using `AutoModelForCausalLM`
52+
53+
```python
54+
from transformers import AutoTokenizer, AutoModelForCausalLM
55+
56+
MODEL_ID = "ibm-granite/granite-4.0-tiny-preview"
57+
58+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
59+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
60+
```
61+
62+
### 2) Apply Quantization
63+
64+
We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
65+
- Static, per-channel quantization on the weights
66+
- Dynamic, per-token quantization on the activations
67+
68+
Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
69+
70+
Note that we replace the 3D moe expert layers with their 2D equivalent counterpart before quantization and convert them back to 3D before model saving.
71+
72+
```python
73+
from compressed_tensors.utils import replace_module
74+
from llmcompressor import oneshot
75+
from llmcompressor.modifiers.quantization import QuantizationModifier
76+
77+
skip_router_only = True # assume we want to quantize input/output moe layers
78+
79+
ignore_lay = ["lm_head",]
80+
if skip_router_only:
81+
# swap moe linears to a custom class
82+
for n, m in model.named_modules():
83+
if isinstance(m, GraniteMoeHybridParallelExperts):
84+
new_mod = GraniteMoeHybridParallelExpertsLinear.from_3d_expert(m)
85+
replace_module(model, n, new_mod)
86+
ignore_lay += ["re:.*block_sparse_moe.router"]
87+
SAVE_DIR = "ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter"
88+
89+
# Configure the simple PTQ quantization
90+
recipe = QuantizationModifier(
91+
targets=["Linear", "GraniteMoeHybridParallelExpertsLinear"],
92+
scheme="FP8_DYNAMIC",
93+
ignore=ignore_lay,
94+
)
95+
96+
# Apply the quantization algorithm.
97+
oneshot(model=model, recipe=recipe)
98+
99+
# Revert weights of MoE experts to 3D format (num_experts, output_size, input_size)
100+
for n, m in model.named_modules():
101+
if isinstance(m, GraniteMoeHybridParallelExpertsLinear):
102+
m.to_3d_expert()
103+
104+
# Save the model.
105+
model.save_pretrained(SAVE_DIR)
106+
tokenizer.save_pretrained(SAVE_DIR)
107+
```
108+
109+
We have successfully created an `fp8` model!
110+
111+
### 3) Evaluate Accuracy
112+
113+
Install `vllm` and `lm-evaluation-harness`:
114+
115+
```bash
116+
pip install vllm lm_eval
117+
```
118+
119+
Load and run the model in `vllm` and evaluate accuracy with `lm_eval` on `gsm8k`:
120+
121+
1. **Base model**
122+
```bash
123+
export MODEL=ibm-granite/granite-4.0-tiny-preview
124+
export OPT_FLAGS=tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.95,enable_prefix_caching=False,max_model_len=8192
125+
lm_eval --model vllm \
126+
--model_args pretrained=$MODEL,$OPT_FLAGS,add_bos_token=True \
127+
--batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k
128+
```
129+
> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
130+
131+
132+
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
133+
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
134+
|gsm8k| 3|flexible-extract| 5|exact_match||0.602|± |0.0135|
135+
| | |strict-match | 5|exact_match||0.583|± |0.0136|
136+
137+
2. **FP8 model**
138+
```bash
139+
export MODEL=$PWD/ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter
140+
lm_eval --model vllm \
141+
--model_args pretrained=$MODEL,$OPT_FLAGS,add_bos_token=True \
142+
--batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k
143+
```
144+
145+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
146+
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
147+
|gsm8k| 3|flexible-extract| 5|exact_match||0.6164|± |0.0134|
148+
| | |strict-match | 5|exact_match||0.5974|± |0.0135|
149+
150+
We can see the resulting FP8 model look comparable with (and sometimes slightly better than) the baseline.
151+
152+
> NOTE: If running with hf instead of vllm, such as the command below, there will be an error
153+
related to the `weight_scale` when the FP8 ckpt is being used.
154+
`lm_eval --model hf --model_args pretrained=$MODEL --batch_size 16 --trust_remote_code --tasks gsm8k`
155+
156+
157+
### Questions or Feature Request?
158+
159+
Please open up an issue on `vllm-project/llm-compressor`

0 commit comments

Comments
 (0)