|
| 1 | +--- |
| 2 | +title: Quantize and Launch the vLLM server |
| 3 | +weight: 3 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Access the Model from Hugging Face |
| 10 | + |
| 11 | +Before quantizing, authenticate with Hugging Face using a personal access token. You can generate one from your [Hugging Face Hub](https://huggingface.co/) account under Access Tokens: |
| 12 | + |
| 13 | +```bash |
| 14 | +huggingface-cli login --token $hf_token |
| 15 | +``` |
| 16 | +## Quantization Script Template |
| 17 | + |
| 18 | +Create the `vllm_quantize_model.py` script shown below to quantize the model : |
| 19 | +```bash |
| 20 | +import argparse |
| 21 | +import os |
| 22 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 23 | + |
| 24 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 25 | +from compressed_tensors.quantization import QuantizationScheme |
| 26 | +from compressed_tensors.quantization.quant_args import ( |
| 27 | + QuantizationArgs, |
| 28 | + QuantizationStrategy, |
| 29 | + QuantizationType, |
| 30 | +) |
| 31 | +from llmcompressor.transformers import oneshot |
| 32 | + |
| 33 | + |
| 34 | +def main(): |
| 35 | + parser = argparse.ArgumentParser( |
| 36 | + description="Quantize a model using LLM Compressor with customizable mode, scheme, and group size." |
| 37 | + ) |
| 38 | + parser.add_argument( |
| 39 | + "model_id", |
| 40 | + type=str, |
| 41 | + help="Model identifier or path (e.g., 'meta-llama/Llama-2-13b-chat-hf' or '/path/to/model')", |
| 42 | + ) |
| 43 | + parser.add_argument( |
| 44 | + "--mode", |
| 45 | + type=str, |
| 46 | + choices=["int4", "int8"], |
| 47 | + required=True, |
| 48 | + help="Quantization mode: int4 or int8", |
| 49 | + ) |
| 50 | + parser.add_argument( |
| 51 | + "--scheme", |
| 52 | + type=str, |
| 53 | + choices=["channelwise", "groupwise"], |
| 54 | + required=True, |
| 55 | + help="Quantization scheme for weights (groupwise is only supported for int4)", |
| 56 | + ) |
| 57 | + parser.add_argument( |
| 58 | + "--groupsize", |
| 59 | + type=int, |
| 60 | + default=32, |
| 61 | + help="Group size for groupwise quantization (only used when scheme is groupwise). Defaults to 32." |
| 62 | + ) |
| 63 | + args = parser.parse_args() |
| 64 | + |
| 65 | + # Validate unsupported configuration |
| 66 | + if args.mode == "int8" and args.scheme == "groupwise": |
| 67 | + raise ValueError("Groupwise int8 is unsupported. Please use channelwise for int8.") |
| 68 | + |
| 69 | + # Extract a base model name from the model id or path for the output directory |
| 70 | + if "/" in args.model_id: |
| 71 | + base_model_name = args.model_id.split("/")[-1] |
| 72 | + else: |
| 73 | + base_model_name = os.path.basename(args.model_id) |
| 74 | + |
| 75 | + # Determine output directory based on mode and scheme |
| 76 | + if args.mode == "int4": |
| 77 | + output_dir = f"{base_model_name}-w4a8-{args.scheme}" |
| 78 | + else: # int8 |
| 79 | + output_dir = f"{base_model_name}-w8a8-{args.scheme}" |
| 80 | + |
| 81 | + print(f"Loading model '{args.model_id}'...") |
| 82 | + model = AutoModelForCausalLM.from_pretrained( |
| 83 | + args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True |
| 84 | + ) |
| 85 | + tokenizer = AutoTokenizer.from_pretrained(args.model_id) |
| 86 | + |
| 87 | + # Define quantization arguments based on mode and chosen scheme. |
| 88 | + if args.mode == "int8": |
| 89 | + # Only channelwise is supported for int8. |
| 90 | + weights_args = QuantizationArgs( |
| 91 | + num_bits=8, |
| 92 | + type=QuantizationType.INT, |
| 93 | + strategy=QuantizationStrategy.CHANNEL, |
| 94 | + symmetric=True, |
| 95 | + dynamic=False, |
| 96 | + ) |
| 97 | + else: # int4 mode |
| 98 | + if args.scheme == "channelwise": |
| 99 | + strategy = QuantizationStrategy.CHANNEL |
| 100 | + weights_args = QuantizationArgs( |
| 101 | + num_bits=4, |
| 102 | + type=QuantizationType.INT, |
| 103 | + strategy=strategy, |
| 104 | + symmetric=True, |
| 105 | + dynamic=False, |
| 106 | + ) |
| 107 | + else: # groupwise |
| 108 | + strategy = QuantizationStrategy.GROUP |
| 109 | + weights_args = QuantizationArgs( |
| 110 | + num_bits=4, |
| 111 | + type=QuantizationType.INT, |
| 112 | + strategy=strategy, |
| 113 | + group_size=args.groupsize, |
| 114 | + symmetric=True, |
| 115 | + dynamic=False |
| 116 | + ) |
| 117 | + |
| 118 | + # Activation quantization remains the same for both modes. |
| 119 | + activations_args = QuantizationArgs( |
| 120 | + num_bits=8, |
| 121 | + type=QuantizationType.INT, |
| 122 | + strategy=QuantizationStrategy.TOKEN, |
| 123 | + symmetric=False, |
| 124 | + dynamic=True, |
| 125 | + observer=None, |
| 126 | + ) |
| 127 | + |
| 128 | + # Create a quantization scheme for Linear layers. |
| 129 | + scheme = QuantizationScheme( |
| 130 | + targets=["Linear"], |
| 131 | + weights=weights_args, |
| 132 | + input_activations=activations_args, |
| 133 | + ) |
| 134 | + |
| 135 | + # Create a quantization modifier. We ignore the "lm_head" layer. |
| 136 | + modifier = QuantizationModifier(config_groups={"group_0": scheme}, ignore=["lm_head"]) |
| 137 | + |
| 138 | + # Apply quantization and save the quantized model. |
| 139 | + oneshot( |
| 140 | + model=model, |
| 141 | + recipe=modifier, |
| 142 | + tokenizer=tokenizer, |
| 143 | + output_dir=output_dir, |
| 144 | + ) |
| 145 | + print(f"Quantized model saved to: {output_dir}") |
| 146 | + |
| 147 | + |
| 148 | +if __name__ == "__main__": |
| 149 | + main() |
| 150 | + |
| 151 | + |
| 152 | +``` |
| 153 | +Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy: |
| 154 | +
|
| 155 | +```bash |
| 156 | +cd /home/ubuntu/ |
| 157 | +python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise |
| 158 | +``` |
| 159 | +The output model will be saved locally at: |
| 160 | +`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`. |
| 161 | +
|
| 162 | +## Launch the vLLM server |
| 163 | +
|
| 164 | +The vLLM server supports the OpenAI-compatible `/v1/chat/completions` API. This is used in this learning path for single-prompt testing with `curl` and for batch testing using a custom Python script that simulates multiple concurrent requests. |
| 165 | +
|
| 166 | +Once the model is quantized, launch the vLLM server to enable CPU-based inference. This configuration uses `tcmalloc` and the optimized `OpenBLAS` build to improve performance and reduce latency: |
| 167 | +
|
| 168 | +```bash |
| 169 | +LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/home/ubuntu/OpenBLAS/libopenblas.so \ |
| 170 | +ONEDNN_DEFAULT_FPMATH_MODE=BF16 \ |
| 171 | +VLLM_TARGET_DEVICE=cpu \ |
| 172 | +VLLM_CPU_KVCACHE_SPACE=32 \ |
| 173 | +VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \ |
| 174 | +vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \ |
| 175 | +--dtype float32 --swap-space 16 |
| 176 | +``` |
| 177 | +This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs. |
| 178 | +
|
0 commit comments