Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions examples/quantization_attention/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Attention Quantization in LLM Compressor #
LLM Compressor supports applying static attention quantization to models

## FP8 Attention Example ##
For an example applying attention quantization, see [llama3_attention.py](/experimental/attention/llama3_attention.py).

```python
recipe = QuantizationModifier(
config_groups={
"attention": QuantizationScheme(
targets=["LlamaAttention"],
input_activations=QuantizationArgs(
num_bits=8, type="float", strategy="attn_head"
),
)
}
)
```

Accuracy should be almost identical to the base model for FP8 attention.
Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from compressed_tensors.offload import dispatch_model
from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.offload import dispatch_model

# Select model and load it.
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
Expand Down
20 changes: 1 addition & 19 deletions experimental/attention/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,5 @@
# Attention Quantization in LLM Compressor #
LLM Compressor supports applying static attention quantization to models. Please note that attention quantization support in vLLM is still ongoing and is not fully supported as of this writing.

## FP8 Attention Example ##
For an example applying attention quantization, see [llama3_attention.py](/experimental/attention/llama3_attention.py).

```python
recipe = QuantizationModifier(
config_groups={
"attention": QuantizationScheme(
targets=["LlamaAttention"],
input_activations=QuantizationArgs(
num_bits=8, type="float", strategy="attn_head"
),
)
}
)
```

Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
LLM Compressor supports applying static attention quantization to models. Please note that NVFP4 attention quantization and R3 support in vLLM is still ongoing and is not fully supported as of this writing.

## NVFP4 Attention + R3 Example ##
Attention quantization can be improved using the R3 transform, as described by [SpinQuant](https://arxiv.org/abs/2405.16406). This transform reduces the presence of outliers in the attention activation distribution, thereby improving accurcy recovery.
Expand Down
Loading