vllm-project
diff --git a/‎.coveragerc‎
Lines changed: 2 additions & 0 deletions b/‎.coveragerc‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.github/workflows/test-check-transformers.yaml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/test-check-transformers.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎DEVELOPING.md‎
Lines changed: 1 addition & 2 deletions b/‎DEVELOPING.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎Makefile‎
Lines changed: 1 addition & 4 deletions b/‎Makefile‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎docs/developer/developing.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/developer/developing.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/getting-started/deploy.md‎
Lines changed: 6 additions & 4 deletions b/‎docs/getting-started/deploy.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎docs/getting-started/install.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/getting-started/install.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/saving_a_model.md‎
Lines changed: 6 additions & 2 deletions b/‎docs/guides/saving_a_model.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎examples/multimodal_vision/gemma3_example.py‎
Lines changed: 2 additions & 2 deletions b/‎examples/multimodal_vision/gemma3_example.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/quantization_w8a8_fp8/README_granite4.md‎
Lines changed: 159 additions & 0 deletions b/‎examples/quantization_w8a8_fp8/README_granite4.md‎
Lines changed: 159 additions & 0 deletions
@@ -0,0 +1,2 @@
+[run]
+patch = subprocess
@@ -16,6 +16,10 @@ env:
   CADENCE: "commit"
   HF_TOKEN: ${{ secrets.HF_TOKEN_READ }}
 
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
   detect-changes:
     runs-on: ubuntu-latest
 
@@ -24,8 +24,7 @@ make style
 make quality
 ```
 
-This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
-repository's code matches its standards.
+This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
 
 **EXAMPLE: test changes locally**
 
 
@@ -26,15 +26,12 @@ quality:
 	@echo "Running python quality checks";
 	ruff check $(CHECKDIRS);
 	ruff format --check $(CHECKDIRS);
-	isort --check-only $(CHECKDIRS);
-	flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
 
 # style the code according to accepted standards for the repo
 style:
 	@echo "Running python styling";
+	ruff check --fix $(CHECKDIRS);
 	ruff format $(CHECKDIRS);
-	isort $(CHECKDIRS);
-	flake8 $(CHECKDIRS) --max-line-length 88 --extend-ignore E203,W605;
 
 # run tests for the repo
 test:
 
@@ -29,8 +29,7 @@ make style
 make quality
 ```
 
-This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
-repository's code matches its standards.
+This will run automatic code styling using `ruff` to test that the repository's code matches its standards.
 
 **EXAMPLE: test changes locally**
 
 
@@ -24,11 +24,13 @@ Before deploying your model, ensure you have the following prerequisites:
 vLLM provides a Python API for easy integration with your applications, enabling you to load and use your compressed model directly in your Python code. To test the compressed model, use the following code:
 
 ```python
-from vllm import LLM
+from vllm import LLM, SamplingParams
 
 model = LLM("./TinyLlama-1.1B-Chat-v1.0-INT8")
-output = model.generate("What is machine learning?", max_tokens=256)
-print(output)
+sampling_params = SamplingParams(max_tokens=256)
+outputs = model.generate("What is machine learning?", sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
 ```
 
 After running the above code, you should see the generated output from your compressed model. This confirms that the model is loaded and ready for inference.
@@ -39,7 +41,7 @@ vLLM also provides an HTTP server for serving your model via a RESTful API that
 To start the HTTP server, use the following command:
 
 ```bash
-vllm serve "./TinyLlama-1.1B-Chat-v1.0-INT8"
+vllm serve "TinyLlama-1.1B-Chat-v1.0-INT8"
 ```
 
 By default, the server will run on `localhost:8000`. You can change the host and port by using the `--host` and `--port` flags. Now that the server is running, you can send requests to it using any HTTP client. For example, you can use `curl` to send a request:
 
@@ -38,7 +38,7 @@ If you need a specific version of LLM Compressor, you can specify the version nu
 pip install llmcompressor==0.5.1
 ```
 
-Replace `0.1.0` with your desired version number.
+Replace `0.5.1` with your desired version number.
 
 ### Install from Source
 
 
@@ -69,7 +69,7 @@ If you need more control, you can wrap `save_pretrained` manually:
 
 ```python
 from transformers import AutoModelForCausalLM
-from llmcompressor.transformers.sparsification import modify_save_pretrained
+from llmcompressor.transformers.sparsification.compressed_tensors_utils import modify_save_pretrained
 
 # Load model
 model = AutoModelForCausalLM.from_pretrained("your-model")
@@ -88,7 +88,11 @@ model.save_pretrained(
 ### Saving with Custom Sparsity Configuration
 
 ```python
-from compressed_tensors.sparsification import SparsityCompressionConfig
+from transformers import AutoModelForCausalLM
+from compressed_tensors import SparsityCompressionConfig
+
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-model")
 
 # Create custom sparsity config
 custom_config = SparsityCompressionConfig(
 
@@ -32,8 +32,8 @@ def data_collator(batch):
         scheme="W4A16",
         ignore=[
             "lm_head",
-            "re:model\.vision_tower.*",
-            "re:model\.multi_modal_projector.*",
+            r"re:model\.vision_tower.*",
+            r"re:model\.multi_modal_projector.*",
         ],
     ),
 ]
 
@@ -0,0 +1,159 @@
+# `fp8` Weight and Activation Quantization for Granite 4
+
+`llmcompressor` supports quantizing weights and activations to `fp8` for memory savings and inference acceleration with `vllm`
+
+For Granite 4, in addition to typical `nn.Linear` layers in `mamba` or `mlp` modules, there are three "Linear-like" layers in `GraniteMoeHybridMoe` (`moe` module) that could be quantized as well. Among the three layers, usually `router` should be kept in high precision for accuracy reason. Therefore, users could choose to quantize the other two layers, `input_linear` and `output_linear`, for better model compression.
+
+Note that input_linear and output_linear are `GraniteMoeHybridParallelExperts`, which subclasses `nn.Module` instead of `nn.Linear`, for it needs to store weights in 3D, i.e. [num_experts, out_feat, in_feat]. Because llm-compressor can only handle `nn.Linear` at the moment, our simple workaround would be:
+1. **Swap `GraniteMoeHybridParallelExperts` with `GraniteMoeHybridParallelExpertsLinear`**
+
+   The custom class is equivalent to the original one, except it subclasses nn.Linear and stores 2D weights. Moe expert weight tensors will be converted from 3D to 2D, i.e. from [num_experts, out_feat, in_feat] to [num_experts * out_feat, in_feat].
+2. **Perform dynamic fp8 quantization**
+
+   The new class is compatible with typical per-channel weight quantization, llm-compressor will be able to identify those layers and process them normally. The resulting scales will have shape of [num_experts * out_feat, 1]
+3. **Reshape weights and scales back to 3D before saving the checkpoint**
+
+> `fp8` compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+
+## Installation
+
+To get started, install:
+
+```bash
+pip install llmcompressor
+```
+
+This checkpoint format will need the latest vllm (ver >= 0.10.1.1) to run correctly. Additional dependencies and environment variables needed are:
+1. Dependencies:  `vllm>=0.10.1.1, lm_eval>=0.4.9.1, flash-attn=2.7.3, torch>=2.7.1`
+2. ENV VAR:  `VLLM_USE_V1=0, VLLM_WORKER_MULTIPROC_METHOD=spawn`
+
+## Quickstart
+
+`granite4_example.py` demonstrates the quantization of `mamba`, `mlp`, and those
+"Linear-like" input/output layers with minimal changes to `llm-compressor`.
+
+
+```bash
+python3 granite4_example.py
+```
+
+The resulting model `ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter` is ready to be loaded into vLLM.
+
+## Code Walkthough
+
+Now, we will step though the code in the example. There are three steps:
+1) Load model
+2) Apply quantization
+3) Evaluate accuracy in vLLM
+
+### 1) Load Model
+
+Load the model using `AutoModelForCausalLM`
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "ibm-granite/granite-4.0-tiny-preview"
+
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2) Apply Quantization
+
+We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
+
+Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
+
+Note that we replace the 3D moe expert layers with their 2D equivalent counterpart before quantization and convert them back to 3D before model saving.
+
+```python
+from compressed_tensors.utils import replace_module
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+skip_router_only = True  # assume we want to quantize input/output moe layers
+
+ignore_lay = ["lm_head",]
+if skip_router_only:
+    # swap moe linears to a custom class
+    for n, m in model.named_modules():
+        if isinstance(m, GraniteMoeHybridParallelExperts):
+            new_mod = GraniteMoeHybridParallelExpertsLinear.from_3d_expert(m)
+            replace_module(model, n, new_mod)
+    ignore_lay += ["re:.*block_sparse_moe.router"]
+    SAVE_DIR = "ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter"
+
+# Configure the simple PTQ quantization
+recipe = QuantizationModifier(
+    targets=["Linear", "GraniteMoeHybridParallelExpertsLinear"],
+    scheme="FP8_DYNAMIC",
+    ignore=ignore_lay,
+)
+
+# Apply the quantization algorithm.
+oneshot(model=model, recipe=recipe)
+
+# Revert weights of MoE experts to 3D format (num_experts, output_size, input_size)
+for n, m in model.named_modules():
+    if isinstance(m, GraniteMoeHybridParallelExpertsLinear):
+        m.to_3d_expert()
+
+# Save the model.
+model.save_pretrained(SAVE_DIR)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+We have successfully created an `fp8` model!
+
+### 3) Evaluate Accuracy
+
+Install `vllm` and `lm-evaluation-harness`:
+
+```bash
+pip install vllm lm_eval
+```
+
+Load and run the model in `vllm` and evaluate accuracy with `lm_eval` on `gsm8k`:
+
+1. **Base model**
+```bash
+export MODEL=ibm-granite/granite-4.0-tiny-preview
+export OPT_FLAGS=tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.95,enable_prefix_caching=False,max_model_len=8192
+lm_eval --model vllm \
+    --model_args pretrained=$MODEL,$OPT_FLAGS,add_bos_token=True \
+    --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k
+```
+> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
+
+
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.602|±  |0.0135|
+|     |       |strict-match    |     5|exact_match|↑  |0.583|±  |0.0136|
+
+2. **FP8 model**
+```bash
+export MODEL=$PWD/ibm-granite-4-tiny-fp8-dynamic-skipMoeRouter 
+lm_eval --model vllm \
+    --model_args pretrained=$MODEL,$OPT_FLAGS,add_bos_token=True \
+    --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k
+```
+
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6164|±  |0.0134|
+|     |       |strict-match    |     5|exact_match|↑  |0.5974|±  |0.0135|
+
+We can see the resulting FP8 model look comparable with (and sometimes slightly better than) the baseline.
+
+> NOTE: If running with hf instead of vllm, such as the command below, there will be an error
+related to the `weight_scale` when the FP8 ckpt is being used.
+`lm_eval --model hf --model_args pretrained=$MODEL --batch_size 16 --trust_remote_code --tasks gsm8k`
+
+
+### Questions or Feature Request?
+
+Please open up an issue on `vllm-project/llm-compressor`