Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8

This directory contains example scripts for quantizing LLMs using the static per-tensor FP8 quantization scheme.

Installation

To get started, install the necessary dependencies by executing the following commands:

git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .

Quickstart

The provided example script demonstrates an end-to-end process for applying the quantization algorithm:

python3 mixtral_example.py

Creating a Quantized MoE Model

This example leverages llm-compressor and compressed-tensors to create an FP8-quantized Mixtral-8x7B-Instruct-v0.1 model. The model is calibrated and trained using the ultrachat_200k dataset.

You can follow the detailed steps below or simply run the example script with:

python mixtral_example.py

Step 1: Select a Model, Dataset, and Recipe

In this step, you'll choose a base model for quantization, a dataset for calibration, and a quantization recipe.

Models: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
Datasets: Can also be from a local directory or the Hugging Face Hub.
Recipes: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a QuantizationModifier object with the scheme set to FP8.

from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"])

NOTE: .*block_sparse_moe.gate layers do not quantize well, hence they are ignored!

Step 2: Run Quantization Using Oneshot

The oneshot method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be sparsified and saved to Mixtral-8x7B-Instruct-v0.1-FP8.

from llmcompressor import oneshot

output_dir = "Mixtral-8x7B-Instruct-v0.1-FP8"

oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    save_compressed=True,
    output_dir=output_dir,
    max_seq_length=2048,
    num_calibration_samples=512,
)

Custom Quantization

NOTE: Only per-tensor quantization is supported in vLLM as of now (vllm==0.6.1)

The repository supports multiple quantization techniques configured via a recipe. Supported strategies include tensor, group, and channel quantization.

In the above example, quantization is specified by the FP8 scheme. For other preset schemes, refer to the quantization schemes in the compressed-tensors library.

A custom scheme can also be specified using config_groups:

# Example of defining a custom quantization scheme

from llmcompressor.modifiers.quantization.gptq import GPTQModifier

config_groups = {
    "group_0": {
        "targets": ["Linear"],
        "input_activations": None,
        "output_activations": None,
        "weights": {
            "num_bits": 8,
            "type": "int",
            "symmetric": True,
            "strategy": "group",
            "group_size": 128, 
        }
    }
}

recipe = GPTQModifier(config_groups=config_groups)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8

Installation

Quickstart

Creating a Quantized MoE Model

Step 1: Select a Model, Dataset, and Recipe

Step 2: Run Quantization Using Oneshot

Custom Quantization

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8

Installation

Quickstart

Creating a Quantized MoE Model

Step 1: Select a Model, Dataset, and Recipe

Step 2: Run Quantization Using Oneshot

Custom Quantization