diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md index 9209cebfc..a20648437 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md @@ -8,57 +8,74 @@ layout: learningpathall ## What is vLLM? -vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with: - -- Continuous batching to keep hardware busy across many requests. -- KV cache management to sustain concurrency during generation. -- Token streaming so results appear as they are produced. - -You interact with vLLM in multiple ways: - -- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration. -- Python API: load a model and generate locally when needed. - -vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving. +vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). +It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference. + +### Key Features + * Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput. + * KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead. + * Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios. +### Interaction Modes +You can use vLLM in two main ways: + * OpenAI-Compatible REST Server: + vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK. + * Python API: + Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation. + +vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference. ## What you build -You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example. +In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL). +This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations. +After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed. ## Why this is fast on Arm +vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration. +Each component of this optimized build contributes to higher throughput and lower latency during inference: + - Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations. -- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels. -- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput. -- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM. -- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load. +- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions. +- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks +- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines. +- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters. +Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads. ## Before you begin -- Use Python 3.12 on Ubuntu 22.04+ -- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk. +Verify that your environment meets the following requirements: + +Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later. +Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space. -Install the minimum system package used by vLLM on Arm: +This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage. +### Install Build Dependencies + +Install the following packages required for compiling vLLM and its dependencies on Arm64: ```bash sudo apt-get update -y sudo apt-get install -y build-essential cmake libnuma-dev -sudo apt install python3.12-venv python3.12-dev +sudo apt install -y python3.12-venv python3.12-dev ``` -Optional performance helper you can install now or later: +You can optionally install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency: ```bash sudo apt-get install -y libtcmalloc-minimal4 ``` {{% notice Note %}} -On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN. +On aarch64, vLLM’s CPU backend automatically builds with the Arm Compute Library (ACL) through oneDNN. +This ensures optimized Arm kernels are used for matrix multiplications, layer normalization, and activation functions without additional configuration. {{% /notice %}} -## Build vLLM for aarch64 CPU +## Build vLLM for Arm64 CPU +You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend. -Create and activate a virtual environment: +1. Create and Activate a Python Virtual Environment +It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies: ```bash python3.12 -m venv vllm_env @@ -66,7 +83,8 @@ source vllm_env/bin/activate python3 -m pip install --upgrade pip ``` -Clone vLLM and install build requirements: +2. Clone vLLM and Install Build Requirements +Download the official vLLM source code and install its CPU-specific build dependencies: ```bash git clone https://github.com/vllm-project/vllm.git @@ -74,14 +92,18 @@ cd vllm git checkout 5fb4137 pip install -r requirements/cpu.txt -r requirements/cpu-build.txt ``` +The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems. -Build a wheel targeted at CPU: +3. Build the vLLM Wheel for CPU +Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference: ```bash VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel ``` +The output wheel will appear under dist/ and include all compiled C++/PyBind modules. -Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment: +4. Install the Wheel +Install the freshly built wheel into your active environment: ```bash pip install --force-reinstall dist/*.whl # fresh install @@ -89,12 +111,14 @@ pip install --force-reinstall dist/*.whl # fresh install ``` {{% notice Tip %}} -Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel. +Do not delete the local vLLM source directory. +The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation. {{% /notice %}} -## Quick validation via offline inferencing +## Quick validation via Offline Inferencing -Run the built‑in chat example to confirm the build: +Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build. +Run the built-in chat example included in the vLLM repository: ```bash python examples/offline_inference/basic/chat.py \ @@ -102,7 +126,10 @@ python examples/offline_inference/basic/chat.py \ --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ``` -You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server. +Explanation: +--dtype=bfloat16 runs inference in bfloat16 precision. Recent Arm processors support the BFloat16 (BF16) number format in PyTorch. For example, AWS Graviton3 and Graviton3 processors support BFloat16. +--model specifies a small Hugging Face model for testing (TinyLlama-1.1B-Chat), ideal for functional validation before deploying larger models. +You should see token streaming in the console, followed by a generated output confirming that vLLM’s inference pipeline is working correctly. ```output Generated Outputs: @@ -117,5 +144,8 @@ Processed prompts: 100%|██████████████████ ``` {{% notice Note %}} -As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future. +As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers. {{% /notice %}} + +You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels. +Next, you’ll proceed to model quantization, where you’ll compress LLM weights to INT4 precision using llmcompressor and benchmark the resulting performance improvements. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md index 056010811..102ea00e0 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md @@ -5,33 +5,39 @@ weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Accelerating LLMs with 4-bit Quantization -You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels. +You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights. +The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels. ## Install quantization tools -Install the vLLM model quantization packages +Install the quantization dependencies used by vLLM and the llmcompressor toolkit: ```bash pip install --no-deps compressed-tensors pip install llmcompressor ``` - -Reinstall your locally built vLLM if you rebuilt it: + * compressed-tensors provides the underlying tensor storage and compression utilities used for quantized model formats. + * llmcompressor includes quantization, pruning, and weight clustering utilities compatible with Hugging Face Transformers and vLLM runtime formats. + +If you recently rebuilt vLLM, reinstall your locally built wheel to ensure compatibility with the quantization extensions: ```bash pip install --no-deps dist/*.whl ``` -If your chosen model is gated on Hugging Face, authenticate first: +Authenticate with Hugging Face (if required): + +If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or proprietary models), log in to authenticate your credentials before downloading model weights: ```bash huggingface-cli login ``` -## INT4 Quantization recipe +## INT4 Quantization Recipe -Save the following as `quantize_vllm_models.py`: +Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`: ```python import argparse @@ -124,22 +130,26 @@ if __name__ == "__main__": main() ``` -This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory. +This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and saves it to a new directory. ## Quantize DeepSeek‑V2‑Lite model ### Quantization parameter tuning +Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs. + 1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. 2. `channelwise` is a good default for most models. 3. `groupwise` can improve accuracy further; `--groupsize 32` is common. +Execute the following command to quantize the DeepSeek-V2-Lite model: + ```bash # DeepSeek example python3 quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \ --scheme channelwise --method mse ``` -The 4-bit quantized DeepSeek-V2-Lite will be stored the directory: +This will generate an INT4 quantized model directory such as: ```text DeepSeek-V2-Lite-w4a8dyn-mse-channelwise diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md index dae180671..0e208af88 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md @@ -6,13 +6,20 @@ weight: 4 layout: learningpathall --- -## About batch sizing in vLLM +## Batch Sizing in vLLM -vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit. +vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process: + * `max_model_len` — The maximum sequence length (number of tokens per request). +No single prompt or generated sequence can exceed this limit. + * `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests. +The sum of input and output tokens from all concurrent requests must stay within this limit. + +Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated. +On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing. ## Serve an OpenAI‑compatible API -Start the server with sensible CPU default parameters and a quantized model: +Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance. ```bash export VLLM_TARGET_DEVICE=cpu @@ -27,9 +34,19 @@ vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \ --dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096 ``` +The server now exposes the standard OpenAI-compatible /v1/chat/completions endpoint. + +You can test it using any OpenAI-style client library to measure tokens-per-second throughput and response latency on your Arm-based server. + ## Run multi‑request batch +After verifying a single request in the previous section, simulate concurrent load against the OpenAI-compatible server to exercise vLLM’s continuous batching scheduler. -After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`: +About the client: +Uses AsyncOpenAI with base_url="http://localhost:8000/v1" to target the vLLM server. +A semaphore caps concurrency to 8 simultaneous requests (adjust CONCURRENCY to scale load). +max_tokens limits generated tokens per request—this directly affects batch size and KV cache use. + +Save the code below in a file named `batch_test.py`: ```python import asyncio @@ -88,7 +105,7 @@ if __name__ == "__main__": asyncio.run(main()) ``` -Run 8 concurrent requests against your server: +Run 8 concurrent requests: ```bash python3 batch_test.py @@ -108,19 +125,28 @@ This validates multi‑request behavior and shows aggregate throughput in the se (APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% ``` -## Optional: Serving BF16 non-quantized model +## Optional: Serve a BF16 (Non-Quantized) Model -For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized. +For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64). ```bash vllm serve deepseek-ai/DeepSeek-V2-Lite \ --dtype bfloat16 --max-model-len 4096 \ --max-num-batched-tokens 4096 ``` +Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system. ## Go Beyond: Power Up Your vLLM Workflow -Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore: +Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further. + +**Try Different Models** +Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration: + * Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance. + * Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models. + * Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving. + +You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name. -* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma. +**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui) -* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui) \ No newline at end of file +You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md index 2b404b1df..ecc422b3f 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md @@ -1,5 +1,5 @@ --- -title: High throughput LLM serving using vLLM on Arm Servers +title: Optimized LLM Inference with vLLM on Arm-Based Servers draft: true cascade: @@ -7,19 +7,18 @@ cascade: minutes_to_complete: 60 -who_is_this_for: This learning path is for software developers and AI engineers who want to build an optimized vLLM for Arm servers, quantize models to INT4, and serve them through an OpenAI‑compatible API. +who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, and serve them efficiently through an OpenAI-compatible API. learning_objectives: - - Build an optimized vLLM for aarch64 with oneDNN + Arm Compute Library. - - Set up dependencies including PyTorch and llmcompressor dependencies. - - Quantize an LLM (DeepSeek‑V2‑Lite) to 4‑bit weights. - - Run and serve the quantized model using vLLM & test BF16 non‑quantized serving. + - Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL). + - Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries. + - Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision. + - Run and serve both quantized and BF16 (non-quantized) variants using vLLM. - Use OpenAI‑compatible endpoints and understand sequence and batch limits. prerequisites: - - An Arm-based Linux server (Ubuntu 22.04+ recommended) with 32+ vCPUs, 64+ GB RAM, and 32+ GB free disk. + - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space. - Python 3.12 and basic familiarity with Hugging Face Transformers and quantization. - - Optional: a Hugging Face token to access gated models. author: - Nikhil Gupta @@ -37,8 +36,7 @@ tools_software_languages: - Generative AI - Python - PyTorch - - llmcompressor - + further_reading: - resource: title: vLLM Documentation