vllm-project · kylesayrs · Mar 18, 2026 · Mar 18, 2026 · Mar 19, 2026 · Mar 23, 2026
diff --git a/examples/quantization_attention/README.md b/examples/quantization_attention/README.md
@@ -0,0 +1,21 @@
+# Attention Quantization in LLM Compressor #
+LLM Compressor supports applying static attention quantization to models
+
+## FP8 Attention Example ##
+For an example applying attention quantization, see [llama3_attention.py](/experimental/attention/llama3_attention.py).
+
+```python
+recipe = QuantizationModifier(
+    config_groups={
+        "attention": QuantizationScheme(
+            targets=["LlamaAttention"],
+            input_activations=QuantizationArgs(
+                num_bits=8, type="float", strategy="attn_head"
+            ),
+        )
+    }
+)
+```
+
+Accuracy should be almost identical to the base model for FP8 attention.
+Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
diff --git a/experimental/attention/llama3_attention.py → ...uantization_attention/llama3_attention.py b/experimental/attention/llama3_attention.py → ...uantization_attention/llama3_attention.py
@@ -1,10 +1,10 @@
+from compressed_tensors.offload import dispatch_model
 from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
-from compressed_tensors.offload import dispatch_model
 
 # Select model and load it.
 model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

diff --git a/experimental/attention/README.md b/experimental/attention/README.md
@@ -1,23 +1,5 @@
 # Attention Quantization in LLM Compressor #
-LLM Compressor supports applying static attention quantization to models. Please note that attention quantization support in vLLM is still ongoing and is not fully supported as of this writing.
-
-## FP8 Attention Example ##
-For an example applying attention quantization, see [llama3_attention.py](/experimental/attention/llama3_attention.py).
-
-```python
-recipe = QuantizationModifier(
-    config_groups={
-        "attention": QuantizationScheme(
-            targets=["LlamaAttention"],
-            input_activations=QuantizationArgs(
-                num_bits=8, type="float", strategy="attn_head"
-            ),
-        )
-    }
-)
-```
-
-Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
+LLM Compressor supports applying static attention quantization to models. Please note that NVFP4 attention quantization and R3 support in vLLM is still ongoing and is not fully supported as of this writing.
 
 ## NVFP4 Attention + R3 Example ##
 Attention quantization can be improved using the R3 transform, as described by [SpinQuant](https://arxiv.org/abs/2405.16406). This transform reduces the presence of outliers in the attention activation distribution, thereby improving accurcy recovery.