This script demonstrates how to apply 2:4 structured sparsity with and without FP8 quantization to the Meta-Llama-3-8B-Instruct model using the llm-compressor library. The compressed model is optimized for memory efficiency and faster inference on supported GPUs.
Note: FP8 dynamic precision computation is supported on Nvidia GPUs with CUDA Compute Capability 9.0 and above.
To get started, install the llm-compressor library and its dependencies:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .Run the script with the following commands:
- Without FP8 Quantization:
python3 llama3_8b_2of4.py- With FP8 Quantization:
python3 llama3_8b_2of4.py --fp8The script compresses the Meta-Llama-3-8B-Instruct model using:
- 2:4 Structured Sparsity: Applies structured pruning to reduce weights by 50%.
- FP8 Quantization (Optional): Enables dynamic quantization for further memory savings.
- Model: meta-llama/Meta-Llama-3-8B-Instruct
- Dataset: HuggingFaceH4/ultrachat_200k (train split)
- Calibration Samples: 512
- Maximum Sequence Length: 2048
- Select Model, Dataset, and Recipe
The model and dataset are predefined in the script. The recipe dynamically adjusts the recipe based on whether FP8 quantization is enabled (--fp8 flag).
Example Recipe:
recipe = [
SparseGPTModifier(
sparsity=0.5,
mask_structure="2:4",
targets=[r"re:model.layers.\d*$"],
)
]
if fp8_enabled:
recipe.append(
QuantizationModifier(
targets=["Linear"],
ignore=["lm_head"],
scheme="FP8_DYNAMIC",
),
)- Apply Compression The script applies compression using the oneshot function:
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512,
)The compressed model and tokenizer are saved to the output directory:
model.save_pretrained(save_dir, save_compressed=True)
tokenizer.save_pretrained(save_dir)Output Directories:
- Without FP8:
Meta-Llama-3-8B-Instruct-2of4-sparse - With FP8:
Meta-Llama-3-8B-Instruct-2of4-W8A8-FP8-Dynamic-Per-Token
To save the model on disk without sparse compression:
model.save_pretrained(save_dir, save_compressed=True, disable_sparse_compression=True)
tokenizer.save_pretrained(save_dir)Note: Saving a model with both the
save_compressedanddisable_sparse_compressionoptions will compress the model using the quantization compressor; however, instead of using the more disk-efficient sparsity compressor(s), the dense sparsity compressor will be used. Thedensesparsity compressor saves model params as is, and does not leverage sparsity for disk-efficient storage. These options only affect how the model(s) are saved on disk and do not impact the actual pruning or quantization processes.
After compression, the script validates the model by generating a sample output:
========== SAMPLE GENERATION ============
Hello my name is ...
=========================================
Notes:
- Ensure your GPU has sufficient memory (at least ~25GB) to run compression script.
- Use the
--fp8flag to enable FP8 quantization.
Modify the MODEL_ID and DATASET_ID variables to use other models or datasets.
Install vLLM using pip install vllm
# run_model.py
import argparse
from vllm import LLM, SamplingParams
def run_inference(model_path, tensor_parallel_size, prompt="Hello my name is:"):
"""
Loads a model and performs inference using LLM.
"""
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
)
# Load the model
model = LLM(
model=model_path,
enforce_eager=True,
dtype="auto",
tensor_parallel_size=tensor_parallel_size,
)
# Generate inference
outputs = model.generate(prompt, sampling_params=sampling_params)
return outputs[0].outputs[0].text
def main():
""" Main function to handle CLI and process the model. """
# Argument parsing
parser = argparse.ArgumentParser(description="Run inference on a single model and print results.")
parser.add_argument(
"model_path",
type=str,
help="Path to the model to perform inference."
)
parser.add_argument(
"--tensor_parallel_size",
type=int,
default=1,
help="Tensor parallel size for the model. Default is 1."
)
args = parser.parse_args()
model_path = args.model_path
tensor_parallel_size = args.tensor_parallel_size
prompt = "Hello my name is:"
# Load model and perform inference
inference_result = run_inference(model_path, tensor_parallel_size)
print("="* 20)
print("Model:", model_path)
print(prompt, inference_result)
if __name__ == "__main__":
main()Command to run model:
python3 run_model.py <MODEL_PATH>Example:
python3 run_model.py Meta-Llama-3-8B-Instruct2of4-W8A8-FP8-Dynamic-Per-Token