llm-compressor/docs/key-models/llama4/fp8-example.md at 487dff6458cf12e38824310f426aea23ee697231 · vllm-project/llm-compressor

Llama4 FP8 Example

Code Walkthrough

Let's walk through the main steps of the quantization process:

Load model
Configure quantization algorithm and scheme
Apply quantization
Confirm generations of the quantized model look sane
Save to disk in compressed-tensors format

1. Load Model

Load the model using AutoModelForCausalLM:

from compressed_tensors.offload import dispatch_model
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

2. Configure the Quantization Algorithm and Scheme

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",
    ignore=[
        "re:.*lm_head",
        "re:.*self_attn",
        "re:.*router",
        "re:.*vision_model.*",
        "re:.*multi_modal_projector.*",
        "Llama4TextAttention",
    ],
)

3. Apply Quantization

oneshot(model=model, recipe=recipe)

4. Confirm Generations of the Quantized Model Look Sane

print("========== SAMPLE GENERATION ==============")
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

5. Save to Disk in Compressed-Tensors Format

SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-BLOCK"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama4 FP8 Example

Code Walkthrough

1. Load Model

2. Configure the Quantization Algorithm and Scheme

3. Apply Quantization

4. Confirm Generations of the Quantized Model Look Sane

5. Save to Disk in Compressed-Tensors Format

FilesExpand file tree

fp8-example.md

Latest commit

History

fp8-example.md

File metadata and controls

Llama4 FP8 Example

Code Walkthrough

1. Load Model

2. Configure the Quantization Algorithm and Scheme

3. Apply Quantization

4. Confirm Generations of the Quantized Model Look Sane

5. Save to Disk in Compressed-Tensors Format