Merge pull request #2540 from pareenaverma/content_review

pareenaverma · web-flow · commit c7cf14a89c32 · 2025-11-12T13:53:25.000-05:00
Tech review of INT4 vllm LP
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md
@@ -8,101 +8,128 @@ layout: learningpathall
 
 ## What is vLLM?
 
-vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:
-
-- Continuous batching to keep hardware busy across many requests.
-- KV cache management to sustain concurrency during generation.
-- Token streaming so results appear as they are produced.
-
-You interact with vLLM in multiple ways:
-
-- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
-- Python API: load a model and generate locally when needed.
-
-vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.
+vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
+It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
+
+### Key Features
+   * Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
+   * KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
+   * Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
+### Interaction Modes
+You can use vLLM in two main ways:
+   * OpenAI-Compatible REST Server:
+   vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
+   * Python API:
+   Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
+
+vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.
 
 ## What you build
 
-You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.
+In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
+This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
+After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.
 
 ## Why this is fast on Arm
 
+vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
+Each component of this optimized build contributes to higher throughput and lower latency during inference:
+
 - Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
-- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
-- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput.
-- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
-- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.
+- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
+- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
+- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
+- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
+Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.
 
 ## Before you begin
 
-- Use Python 3.12 on Ubuntu 22.04+
-- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.
+Verify that your environment meets the following requirements:
+
+Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
+Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
 
-Install the minimum system package used by vLLM on Arm:
+This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
 
+### Install Build Dependencies
+
+Install the following packages required for compiling vLLM and its dependencies on Arm64:
 ```bash
 sudo apt-get update -y
 sudo apt-get install -y build-essential cmake libnuma-dev
-sudo apt install python3.12-venv python3.12-dev
+sudo apt install -y python3.12-venv python3.12-dev
 ```
 
-Optional performance helper you can install now or later:
+You can optionally install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
 
 ```bash
 sudo apt-get install -y libtcmalloc-minimal4
 ```
 
 {{% notice Note %}}
-On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
+On aarch64, vLLM’s CPU backend automatically builds with the Arm Compute Library (ACL) through oneDNN.
+This ensures optimized Arm kernels are used for matrix multiplications, layer normalization, and activation functions without additional configuration.
 {{% /notice %}}
 
-## Build vLLM for aarch64 CPU
+## Build vLLM for Arm64 CPU
+You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.
 
-Create and activate a virtual environment:
+1. Create and Activate a Python Virtual Environment
+It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
 
 ```bash
 python3.12 -m venv vllm_env
 source vllm_env/bin/activate
 python3 -m pip install --upgrade pip
 ```
 
-Clone vLLM and install build requirements:
+2. Clone vLLM and Install Build Requirements
+Download the official vLLM source code and install its CPU-specific build dependencies:
 
 ```bash
 git clone https://github.com/vllm-project/vllm.git
 cd vllm
 git checkout 5fb4137
 pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
 ```
+The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.
 
-Build a wheel targeted at CPU:
+3. Build the vLLM Wheel for CPU
+Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:
 
 ```bash
 VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
 ```
+The output wheel will appear under dist/ and include all compiled C++/PyBind modules.
 
-Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment:
+4. Install the Wheel
+Install the freshly built wheel into your active environment:
 
 ```bash
 pip install --force-reinstall dist/*.whl              # fresh install
 # pip install --no-deps --force-reinstall dist/*.whl  # incremental build
 ```
 
 {{% notice Tip %}}
-Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
+Do not delete the local vLLM source directory.
+The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
 {{% /notice %}}
 
-## Quick validation via offline inferencing
+## Quick validation via Offline Inferencing
 
-Run the built‑in chat example to confirm the build:
+Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
+Run the built-in chat example included in the vLLM repository:
 
 ```bash
 python examples/offline_inference/basic/chat.py \
   --dtype=bfloat16 \
   --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
 ```
 
-You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.
+Explanation:
+--dtype=bfloat16 runs inference in bfloat16 precision. Recent Arm processors support the BFloat16 (BF16) number format in PyTorch. For example, AWS Graviton3 and Graviton3 processors support BFloat16.
+--model specifies a small Hugging Face model for testing (TinyLlama-1.1B-Chat), ideal for functional validation before deploying larger models.
+You should see token streaming in the console, followed by a generated output confirming that vLLM’s inference pipeline is working correctly.
 
 ```output
 Generated Outputs:
@@ -117,5 +144,8 @@ Processed prompts: 100%|██████████████████
 ```
 
 {{% notice Note %}}
-As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
+As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
 {{% /notice %}}
+
+You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.
+Next, you’ll proceed to model quantization, where you’ll compress LLM weights to INT4 precision using llmcompressor and benchmark the resulting performance improvements.
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md
@@ -5,33 +5,39 @@ weight: 3
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
+## Accelerating LLMs with 4-bit Quantization
 
-You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.
+You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights.
+The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels.
 
 ## Install quantization tools
 
-Install the vLLM model quantization packages
+Install the quantization dependencies used by vLLM and the llmcompressor toolkit:
 
 ```bash
 pip install --no-deps compressed-tensors
 pip install llmcompressor
 ```
-
-Reinstall your locally built vLLM if you rebuilt it:
+  * compressed-tensors provides the underlying tensor storage and compression utilities used for quantized model formats.
+  * llmcompressor includes quantization, pruning, and weight clustering utilities compatible with Hugging Face Transformers and vLLM runtime formats.
+    
+If you recently rebuilt vLLM, reinstall your locally built wheel to ensure compatibility with the quantization extensions:
 
 ```bash
 pip install --no-deps dist/*.whl
 ```
 
-If your chosen model is gated on Hugging Face, authenticate first:
+Authenticate with Hugging Face (if required):
+
+If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or proprietary models), log in to authenticate your credentials before downloading model weights:
 
 ```bash
 huggingface-cli login
 ```
 
-## INT4 Quantization recipe
+## INT4 Quantization Recipe
 
-Save the following as `quantize_vllm_models.py`:
+Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`:
 
 ```python
 import argparse
@@ -124,22 +130,26 @@ if __name__ == "__main__":
     main()
 ```
 
-This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.
+This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and saves it to a new directory.
 
 ## Quantize DeepSeek‑V2‑Lite model
 
 ### Quantization parameter tuning
+Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs.
+
 1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. 
 2. `channelwise` is a good default for most models.
 3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
 
+Execute the following command to quantize the DeepSeek-V2-Lite model:
+
 ```bash
 # DeepSeek example
 python3 quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \
   --scheme channelwise --method mse
 ```
 
-The 4-bit quantized DeepSeek-V2-Lite will be stored the directory:
+This will generate an INT4 quantized model directory such as:
 
 ```text
 DeepSeek-V2-Lite-w4a8dyn-mse-channelwise
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md
@@ -6,13 +6,20 @@ weight: 4
 layout: learningpathall
 ---
 
-## About batch sizing in vLLM
+## Batch Sizing in vLLM
 
-vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.
+vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process:
+  * `max_model_len` — The maximum sequence length (number of tokens per request).
+No single prompt or generated sequence can exceed this limit.
+  * `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests.
+The sum of input and output tokens from all concurrent requests must stay within this limit.
+
+Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated.
+On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing.
 
 ## Serve an OpenAI‑compatible API
 
-Start the server with sensible CPU default parameters and a quantized model:
+Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance.
 
 ```bash
 export VLLM_TARGET_DEVICE=cpu
@@ -27,9 +34,19 @@ vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \
   --dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096
 ```
 
+The server now exposes the standard OpenAI-compatible /v1/chat/completions endpoint.
+
+You can test it using any OpenAI-style client library to measure tokens-per-second throughput and response latency on your Arm-based server.
+
 ## Run multi‑request batch
+After verifying a single request in the previous section, simulate concurrent load against the OpenAI-compatible server to exercise vLLM’s continuous batching scheduler.
 
-After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:
+About the client:
+Uses AsyncOpenAI with base_url="http://localhost:8000/v1" to target the vLLM server.
+A semaphore caps concurrency to 8 simultaneous requests (adjust CONCURRENCY to scale load).
+max_tokens limits generated tokens per request—this directly affects batch size and KV cache use.
+
+Save the code below in a file named `batch_test.py`:
 
 ```python
 import asyncio
@@ -88,7 +105,7 @@ if __name__ == "__main__":
     asyncio.run(main())
 ```
 
-Run 8 concurrent requests against your server:
+Run 8 concurrent requests:
 
 ```bash
 python3 batch_test.py
@@ -108,19 +125,28 @@ This validates multi‑request behavior and shows aggregate throughput in the se
 (APIServer pid=4474) INFO:     127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
 (APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
 ```
-## Optional: Serving BF16 non-quantized model
+## Optional: Serve a BF16 (Non-Quantized) Model
 
-For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.
+For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
 
 ```bash
 vllm serve deepseek-ai/DeepSeek-V2-Lite \
   --dtype bfloat16 --max-model-len 4096  \
   --max-num-batched-tokens 4096
 ```
+Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.
 
 ## Go Beyond: Power Up Your vLLM Workflow
-Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:
+Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.
+
+**Try Different Models**
+Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
+  * Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
+  * Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
+  * Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.
+    
+You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
 
-* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.
+**Connect a chat client:**  Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
 
-* **Connect a chat client:**  Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
+You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md