This directory contains example scripts for quantizing LLMs using the static per-tensor FP8 quantization scheme.
To get started, install the necessary dependencies by executing the following commands:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
python3 mixtral_example.pyThis example leverages llm-compressor and compressed-tensors to create an FP8-quantized Mixtral-8x7B-Instruct-v0.1 model. The model is calibrated and trained using the ultrachat_200k dataset.
You can follow the detailed steps below or simply run the example script with:
python mixtral_example.pyIn this step, you'll choose a base model for quantization, a dataset for calibration, and a quantization recipe.
- Models: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
- Datasets: Can also be from a local directory or the Hugging Face Hub.
- Recipes: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a
QuantizationModifierobject with the scheme set toFP8.
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"])NOTE: .*block_sparse_moe.gate layers do not quantize well, hence they are ignored!
The oneshot method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be sparsified and saved to Mixtral-8x7B-Instruct-v0.1-FP8.
from llmcompressor import oneshot
output_dir = "Mixtral-8x7B-Instruct-v0.1-FP8"
oneshot(
model=model,
dataset=dataset,
recipe=recipe,
save_compressed=True,
output_dir=output_dir,
max_seq_length=2048,
num_calibration_samples=512,
)NOTE: Only per-tensor quantization is supported in vLLM as of now (vllm==0.6.1)
The repository supports multiple quantization techniques configured via a recipe. Supported strategies include tensor, group, and channel quantization.
In the above example, quantization is specified by the FP8 scheme. For other preset schemes, refer to the quantization schemes in the compressed-tensors library.
A custom scheme can also be specified using config_groups:
# Example of defining a custom quantization scheme
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
config_groups = {
"group_0": {
"targets": ["Linear"],
"input_activations": None,
"output_activations": None,
"weights": {
"num_bits": 8,
"type": "int",
"symmetric": True,
"strategy": "group",
"group_size": 128,
}
}
}
recipe = GPTQModifier(config_groups=config_groups)