ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md‎
Lines changed: 112 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md‎
Lines changed: 112 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md‎
Lines changed: 178 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md‎
Lines changed: 178 additions & 0 deletions
@@ -0,0 +1,112 @@
+---
+title: Overview and Environment Setup
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Overview
+
+[vLLM](https://github.com/vllm-project/vllm) is an open-source, high-throughput inference engine designed to  efficiently serve large language models (LLMs). It offers an OpenAI-compatible API, supports dynamic batching, and is optimized for low-latency performance — making it suitable for both real-time and batch inference workloads.
+
+This learning path walks through how to combine vLLM with INT8 quantization techniques to reduce memory usage and improve inference speed, enabling large models like Llama 3.1 to run effectively on Arm-based CPUs. 
+
+The model featured in this guide — [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) — is sourced from Hugging Face, quantized using the `llmcompressor`, and deployed using vLLM. 
+
+Testing for this learning path was performed on AWS Graviton instance (c8g.16xlarge). The instructions are intended for Arm-based servers running Ubuntu 24.04 LTS.
+
+
+## Learning Path Setup
+
+This learning path uses a Python virtual environment (`venv`) to manage dependencies in an isolated workspace. This approach ensures a clean environment, avoids version conflicts, and makes it easy to reproduce results — especially when using custom-built packages like `vLLM` and `PyTorch`.
+
+### Set up the Python environment
+
+To get started, create a virtual environment and activate it as shown below:
+
+```bash
+sudo apt update
+sudo apt install -y python3 python3-venv
+python3 -m venv vllm_env
+source vllm_env/bin/activate
+pip install --upgrade pip 
+```
+This will create a local Python environment named (`vllm_env`) and upgrade pip to the latest version.
+
+### Install system dependencies
+
+These packages are needed to build libraries like OpenBLAS and manage system-level performance:
+
+```bash
+sudo apt-get update -y
+sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-pip
+sudo apt install python-is-python3
+```
+Set the system default compilers to version 12:
+
+```bash
+sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
+  --slave /usr/bin/g++ g++ /usr/bin/g++-12
+```
+Next, install the  [`tcmalloc memory allocator`](https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html?device=arm), which helps improve performance during inference:
+
+```bash
+sudo apt-get install libtcmalloc-minimal4
+```
+This library will be preloaded during model serving to reduce latency and improve memory efficiency.
+
+### Install OpenBLAS
+
+OpenBLAS is an optimized linear algebra library that improves performance for matrix-heavy operations, which are common in LLM inference. To get the best performance on Arm CPUs, it's recommended to build OpenBLAS from source.
+
+Run these commands to clone and build OpenBLAS:
+```bash
+git clone https://github.com/OpenMathLib/OpenBLAS.git
+cd OpenBLAS
+git checkout ef9e3f715
+```
+{{% notice Note %}}
+This commit is known to work reliably with Arm CPU optimizations (BF16, OpenMP) and has been tested in this learning path. Using it ensures consistent behavior. You can try `main`, but newer commits may introduce changes that haven't been validated here.
+{{% /notice %}}
+
+```bash
+make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3
+make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3 PREFIX=/home/ubuntu/OpenBLAS/dist install
+```
+This will build and install OpenBLAS into `/home/ubuntu/OpenBLAS/dist` with optimizations for Arm CPUs.
+
+### Install Python dependencies
+
+Once the system libraries are in place, install the Python packages required for model quantization and serving. You’ll use prebuilt CPU wheels for vLLM and PyTorch, and install additional tools like `llmcompressor` and `torchvision`.
+
+Before proceeding, make sure the following files are downloaded to your home directory:
+```bash
+
+```
+These are required to complete the installation and model quantization steps.
+Now, navigate to your home directory:
+```bash
+cd /home/ubuntu/
+```
+
+Install the vLLM wheel. This wheel contains the  CPU-optimized version of `vLLM`, built specifically for Arm architecture. Installing it from a local `.whl` file ensures compatibility with the rest of your environment and avoids potential conflicts from nightly or default pip installations.
+
+```bash
+pip install vllm-0.7.3.dev151+gfaee222b.cpu-cp312-cp312-linux_aarch64.whl --force-reinstall
+```
+Install `llmcompressor`, which is used to quantize the model:
+```bash
+pip install llmcompressor
+```
+Install torchvision (nightly version for CPU):
+```bash
+pip install --force-reinstall torchvision==0.22.0.dev20250213 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+```
+Install the custom PyTorch CPU wheel:<br>
+This custom PyTorch wheel is prebuilt for Arm CPU architectures and includes the necessary optimizations for running inference. Installing it locally ensures compatibility with your environment and avoids conflicts with default pip packages.
+```bash
+pip install torch-2.7.0.dev20250306-cp312-cp312-manylinux_2_28_aarch64.whl --force-reinstall --no-deps
+```
+
+You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.
@@ -0,0 +1,178 @@
+---
+title: Quantize and Launch the vLLM server
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Access the Model from Hugging Face
+
+Before quantizing, authenticate with Hugging Face using a personal access token. You can generate one from your [Hugging Face Hub](https://huggingface.co/) account under Access Tokens:
+
+```bash
+huggingface-cli login --token $hf_token
+```
+## Quantization Script Template
+
+Create the `vllm_quantize_model.py` script shown below to quantize the model :
+```bash
+import argparse
+import os
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from compressed_tensors.quantization import QuantizationScheme
+from compressed_tensors.quantization.quant_args import (
+    QuantizationArgs,
+    QuantizationStrategy,
+    QuantizationType,
+)
+from llmcompressor.transformers import oneshot
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Quantize a model using LLM Compressor with customizable mode, scheme, and group size."
+    )
+    parser.add_argument(
+        "model_id",
+        type=str,
+        help="Model identifier or path (e.g., 'meta-llama/Llama-2-13b-chat-hf' or '/path/to/model')",
+    )
+    parser.add_argument(
+        "--mode",
+        type=str,
+        choices=["int4", "int8"],
+        required=True,
+        help="Quantization mode: int4 or int8",
+    )
+    parser.add_argument(
+        "--scheme",
+        type=str,
+        choices=["channelwise", "groupwise"],
+        required=True,
+        help="Quantization scheme for weights (groupwise is only supported for int4)",
+    )
+    parser.add_argument(
+        "--groupsize",
+        type=int,
+        default=32,
+        help="Group size for groupwise quantization (only used when scheme is groupwise). Defaults to 32."
+    )
+    args = parser.parse_args()
+
+    # Validate unsupported configuration
+    if args.mode == "int8" and args.scheme == "groupwise":
+        raise ValueError("Groupwise int8 is unsupported. Please use channelwise for int8.")
+
+    # Extract a base model name from the model id or path for the output directory
+    if "/" in args.model_id:
+        base_model_name = args.model_id.split("/")[-1]
+    else:
+        base_model_name = os.path.basename(args.model_id)
+
+    # Determine output directory based on mode and scheme
+    if args.mode == "int4":
+        output_dir = f"{base_model_name}-w4a8-{args.scheme}"
+    else:  # int8
+        output_dir = f"{base_model_name}-w8a8-{args.scheme}"
+
+    print(f"Loading model '{args.model_id}'...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+
+    # Define quantization arguments based on mode and chosen scheme.
+    if args.mode == "int8":
+        # Only channelwise is supported for int8.
+        weights_args = QuantizationArgs(
+            num_bits=8,
+            type=QuantizationType.INT,
+            strategy=QuantizationStrategy.CHANNEL,
+            symmetric=True,
+            dynamic=False,
+        )
+    else:  # int4 mode
+        if args.scheme == "channelwise":
+            strategy = QuantizationStrategy.CHANNEL
+            weights_args = QuantizationArgs(
+                num_bits=4,
+                type=QuantizationType.INT,
+                strategy=strategy,
+                symmetric=True,
+                dynamic=False,
+            )
+        else:  # groupwise
+            strategy = QuantizationStrategy.GROUP
+            weights_args = QuantizationArgs(
+                num_bits=4,
+                type=QuantizationType.INT,
+                strategy=strategy,
+                group_size=args.groupsize,
+                symmetric=True,
+                dynamic=False
+            )
+
+    # Activation quantization remains the same for both modes.
+    activations_args = QuantizationArgs(
+        num_bits=8,
+        type=QuantizationType.INT,
+        strategy=QuantizationStrategy.TOKEN,
+        symmetric=False,
+        dynamic=True,
+        observer=None,
+    )
+
+    # Create a quantization scheme for Linear layers.
+    scheme = QuantizationScheme(
+        targets=["Linear"],
+        weights=weights_args,
+        input_activations=activations_args,
+    )
+
+    # Create a quantization modifier. We ignore the "lm_head" layer.
+    modifier = QuantizationModifier(config_groups={"group_0": scheme}, ignore=["lm_head"])
+
+    # Apply quantization and save the quantized model.
+    oneshot(
+        model=model,
+        recipe=modifier,
+        tokenizer=tokenizer,
+        output_dir=output_dir,
+    )
+    print(f"Quantized model saved to: {output_dir}")
+
+
+if __name__ == "__main__":
+    main()
+
+
+```
+Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy:
+
+```bash 
+cd /home/ubuntu/
+python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise
+```
+The output model will be saved locally at:
+`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`.
+
+## Launch the vLLM server
+
+The vLLM server supports the OpenAI-compatible `/v1/chat/completions` API. This is used in this learning path for single-prompt testing with `curl` and for batch testing using a custom Python script that simulates multiple concurrent requests.
+
+Once the model is quantized, launch the vLLM server to enable CPU-based inference. This configuration uses `tcmalloc` and the optimized `OpenBLAS` build to improve performance and reduce latency:
+
+```bash
+LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/home/ubuntu/OpenBLAS/libopenblas.so \
+ONEDNN_DEFAULT_FPMATH_MODE=BF16 \
+VLLM_TARGET_DEVICE=cpu \
+VLLM_CPU_KVCACHE_SPACE=32 \
+VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \
+vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \
+--dtype float32 --swap-space 16
+```
+This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs.
+