Revise overview and build instructions for vLLM

madeline-underwood · web-flow · commit 927b1f596ed2 · 2025-11-18T17:30:39.000Z
Updated the document to reflect the new title and improved clarity in the key features and instructions. Enhanced formatting and consistency throughout the text.
diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md
@@ -1,5 +1,5 @@
 ---
-title: Overview and Optimized Build
+title: Build and validate vLLM for Arm64 inference on Azure Cobalt 100
 weight: 2
 
 ### FIXED, DO NOT MODIFY
@@ -8,50 +8,57 @@ layout: learningpathall
 
 ## What is vLLM?
 
-vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
-It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
+vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
 
-### Key Features
-   * Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
-   * KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
-   * Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
-### Interaction Modes
+## Key features
+* Continuous batching: dynamically merges incoming inference requests into larger batches, maximizing Arm CPU utilization and overall throughput
+* KV cache management: efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead
+* Token streaming: streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios
+## Interaction modes
 You can use vLLM in two main ways:
-   * OpenAI-Compatible REST Server:
-   vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
-   * Python API:
-   Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
+- Using an OpenAI-Compatible REST Server: vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK
+- Using a Python API: load and serve models programmatically within your own Python scripts for flexible local inference and evaluation
 
 vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.
 
-## What you build
+## What'll you build
 
-In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
+In this Learning Path, you'll build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
 This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
 After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.
 
 ## Why this is fast on Arm
 
+vLLM achieves high performance on Arm servers by combining software and hardware optimizations. Here’s why your build runs fast:
+
+- Arm-optimized kernels: vLLM uses oneDNN and the Arm Compute Library to accelerate matrix multiplications, normalization, and activation functions. These libraries are tuned for Arm’s aarch64 architecture.
+- Efficient quantization: INT4 quantized models run faster on Arm because KleidiAI microkernels use DOT-product instructions (SDOT/UDOT) available on Arm CPUs.
+- Paged attention tuning: the paged attention mechanism is optimized for Arm’s NEON and SVE pipelines, improving token reuse and throughput during long-sequence generation.
+- MoE fusion: for Mixture-of-Experts models, vLLM fuses INT4 expert layers to reduce memory transfers and bandwidth bottlenecks.
+- Thread affinity and memory allocation: setting thread affinity ensures balanced CPU core usage, while tcmalloc reduces memory fragmentation and allocator contention.
+
+These optimizations work together to deliver higher throughput and lower latency for LLM inference on Arm servers.
+
 vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
 Each component of this optimized build contributes to higher throughput and lower latency during inference:
 
-- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
+- Optimized kernels: the aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
 - 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
-- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
-- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
-- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
+- Efficient MoE execution: for Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
+- Optimized Paged attention: the paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
+- System tuning: using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
 Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.
 
-## Before you begin
+## Set up your environment 
 
-Verify that your environment meets the following requirements:
+Before you begin, make sure your environment meets these requirements:
 
-Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
-Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
+- Python 3.12 on Ubuntu 22.04 LTS or newer
+- At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space
 
-This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
+This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
 
-### Install Build Dependencies
+## Install build dependencies
 
 Install the following packages required for compiling vLLM and its dependencies on Arm64:
 ```bash
@@ -74,7 +81,7 @@ This ensures optimized Arm kernels are used for matrix multiplications, layer no
 ## Build vLLM for Arm64 CPU
 You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.
 
-1. Create and Activate a Python Virtual Environment
+## Create and activate a Python virtual environment
 It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
 
 ```bash
@@ -83,7 +90,7 @@ source vllm_env/bin/activate
 python3 -m pip install --upgrade pip
 ```
 
-2. Clone vLLM and Install Build Requirements
+## Clone vLLM and install build requirements
 Download the official vLLM source code and install its CPU-specific build dependencies:
 
 ```bash
@@ -94,15 +101,15 @@ pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
 ```
 The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.
 
-3. Build the vLLM Wheel for CPU
+## Build the vLLM wheel for CPU
 Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:
 
 ```bash
 VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
 ```
 The output wheel will appear under dist/ and include all compiled C++/PyBind modules.
 
-4. Install the Wheel
+## Install the wheel
 Install the freshly built wheel into your active environment:
 
 ```bash
@@ -115,7 +122,31 @@ Do not delete the local vLLM source directory.
 The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
 {{% /notice %}}
 
-## Quick validation via Offline Inferencing
+## Validate your build with offline inference
+
+Run a quick test to confirm your Arm-optimized vLLM build works as expected. Use the built-in chat example to perform offline inference and verify that oneDNN and Arm Compute Library optimizations are active.
+
+```bash
+python examples/offline_inference/basic/chat.py \
+   --dtype=bfloat16 \
+   --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
+```
+
+This command runs a small Hugging Face model in bfloat16 precision, streaming generated tokens to the console. You should see output similar to:
+
+```output
+Generated Outputs:
+--------------------------------------------------------------------------------
+Prompt: None
+
+Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
+--------------------------------------------------------------------------------
+Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
+Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
+...
+```
+
+If you see token streaming and generated text, your vLLM build is correctly configured for Arm64 inference.
 
 Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
 Run the built-in chat example included in the vLLM repository:
@@ -144,7 +175,7 @@ Processed prompts: 100%|██████████████████
 ```
 
 {{% notice Note %}}
-As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
+As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined `pip` install workflow for aarch64, simplifying future deployments on Arm servers.
 {{% /notice %}}
 
 You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.