diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md new file mode 100644 index 000000000..bc4bfb3e3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md @@ -0,0 +1,120 @@ +--- +title: Overview and Optimized Build +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is vLLM? + +vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with: + +- Continuous batching to keep hardware busy across many requests. +- KV cache management to sustain concurrency during generation. +- Token streaming so results appear as they are produced. + +You interact with vLLM in multiple ways: + +- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration. +- Python API: load a model and generate locally when needed. + +vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving. + +## What you build + +You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example. + +## Why this is fast on Arm + +- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations. +- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels. +- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput. +- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM. +- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load. + +## Before you begin + +- Use Python 3.12 on Ubuntu 22.04+ +- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk. + +Install the minimum system package used by vLLM on Arm: + +```bash +sudo apt-get update -y +sudo apt-get install -y libnuma-dev +``` + +Optional performance helper you can install now or later: + +```bash +sudo apt-get install -y libtcmalloc-minimal4 +``` + +{{% notice Note %}} +On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN. +{{% /notice %}} + +## Build vLLM for aarch64 CPU + +Create and activate a virtual environment: + +```bash +python3 -m venv vllm_env +source vllm_env/bin/activate +python -m pip install --upgrade pip +``` + +Clone vLLM and install build requirements: + +```bash +git clone https://github.com/vllm-project/vllm.git +cd vllm +git checkout 5fb4137 +pip install -r requirements/cpu.txt -r requirements/cpu-build.txt +``` + +Build a wheel targeted at CPU: + +```bash +VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel +``` + +Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment: + +```bash +pip install --force-reinstall dist/*.whl # fresh install +# pip install --no-deps --force-reinstall dist/*.whl # incremental build +``` + +{{% notice Tip %}} +Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel. +{{% /notice %}} + +## Quick validation via offline inferencing + +Run the built‑in chat example to confirm the build: + +```bash +python examples/offline_inference/basic/chat.py \ + --dtype=bfloat16 \ + --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 +``` + +You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server. + +```output +Generated Outputs: +-------------------------------------------------------------------------------- +Prompt: None + +Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right' +-------------------------------------------------------------------------------- +Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s] +Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s] +... +``` + +{{% notice Note %}} +As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future. +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md new file mode 100644 index 000000000..a5d472ccc --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md @@ -0,0 +1,148 @@ +--- +title: Quantize an LLM to INT4 for Arm Platform +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels. + +## Install quantization tools + +Install the vLLM model quantization packages + +```bash +pip install --no-deps compressed-tensors +pip install llmcompressor +``` + +Reinstall your locally built vLLM if you rebuilt it: + +```bash +pip install --no-deps dist/*.whl +``` + +If your chosen model is gated on Hugging Face, authenticate first: + +```bash +huggingface-cli login +``` + +## INT4 Quantization recipe + +Save the following as `quantize_vllm_models.py`: + +```python +import argparse +import os +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +from compressed_tensors.quantization import QuantizationScheme +from compressed_tensors.quantization.quant_args import ( + QuantizationArgs, + QuantizationStrategy, + QuantizationType, +) + +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization.quantization import QuantizationModifier + + +def main(): + parser = argparse.ArgumentParser(description="Quantize a model using INT4 with minmax or mse and dynamic activation quantization.") + parser.add_argument("model_id", type=str, help="Model identifier or path") + parser.add_argument("--method", type=str, choices=["minmax", "mse"], default="mse", help="Quantization method") + parser.add_argument("--scheme", type=str, choices=["channelwise", "groupwise"], required=True, help="Quantization scheme for weights") + parser.add_argument("--groupsize", type=int, default=32, help="Group size for groupwise quantization") + args = parser.parse_args() + + # Extract base model name for output dir + base_model_name = os.path.basename(args.model_id.rstrip("/")) + act_tag = "a8dyn" + suffix = f"{args.method}-{args.scheme}" + if args.scheme == "groupwise": + suffix += f"-g{args.groupsize}" + output_dir = f"{base_model_name}-w4{act_tag}-{suffix}" + + print(f"Loading model '{args.model_id}'...") + model = AutoModelForCausalLM.from_pretrained( + args.model_id, trust_remote_code=True + ) + model = model.to(torch.float32) + tokenizer = AutoTokenizer.from_pretrained(args.model_id) + + # Weight quantization args + strategy = QuantizationStrategy.CHANNEL if args.scheme == "channelwise" else QuantizationStrategy.GROUP + weights_args = QuantizationArgs( + num_bits=4, + type=QuantizationType.INT, + strategy=strategy, + symmetric=True, + dynamic=False, + group_size=args.groupsize if args.scheme == "groupwise" else None, + observer=args.method, + ) + + # Activation quantization + input_acts = QuantizationArgs( + num_bits=8, + type=QuantizationType.INT, + strategy=QuantizationStrategy.TOKEN, + symmetric=False, + dynamic=True, + observer=None, + ) + output_acts = None + + # Create quantization scheme and recipe + scheme = QuantizationScheme( + targets=["Linear"], + weights=weights_args, + input_activations=input_acts, + output_activations=output_acts, + ) + recipe = QuantizationModifier( + config_groups={"group_0": scheme}, + ignore=["lm_head"], + ) + + # Run quantization + oneshot( + model=model, + recipe=recipe, + tokenizer=tokenizer, + output_dir=output_dir, + trust_remote_code_model=True, + ) + + print(f"Quantized model saved to: {output_dir}") + + +if __name__ == "__main__": + main() +``` + +This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory. + +## Quantize DeepSeek‑V2‑Lite model + +### Quantization parameter tuning +1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. +2. `channelwise` is a good default for most models. +3. `groupwise` can improve accuracy further; `--groupsize 32` is common. + +```bash +# DeepSeek example +python quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \ + --scheme channelwise --method mse +``` + +The 4-bit quantized DeepSeek-V2-Lite will be stored the directory: + +```text +DeepSeek-V2-Lite-w4a8dyn-mse-channelwise +``` + +You will load this quantized model directory with vLLM in the next step. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md new file mode 100644 index 000000000..dae180671 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md @@ -0,0 +1,126 @@ +--- +title: Serve high throughput inference with vLLM +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## About batch sizing in vLLM + +vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit. + +## Serve an OpenAI‑compatible API + +Start the server with sensible CPU default parameters and a quantized model: + +```bash +export VLLM_TARGET_DEVICE=cpu +export VLLM_CPU_KVCACHE_SPACE=32 +export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))" +export VLLM_MLA_DISABLE=1 +export ONEDNN_DEFAULT_FPMATH_MODE=BF16 +export OMP_NUM_THREADS="$(nproc)" +export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4 + +vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \ + --dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096 +``` + +## Run multi‑request batch + +After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`: + +```python +import asyncio +import time +from openai import AsyncOpenAI + +# vLLM's OpenAI-compatible server +client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") + +model = "DeepSeek-V2-Lite-w4a8dyn-mse-channelwise" # vllm server model + +# Batch of 8 prompts +messages_list = [ + [{"role": "user", "content": "Explain Big O notation with two examples."}], + [{"role": "user", "content": "Show a simple recursive function and explain how it works."}], + [{"role": "user", "content": "Draft a polite email requesting a project deadline extension."}], + [{"role": "user", "content": "Explain what a hash function is and common uses."}], + [{"role": "user", "content": "Explain binary search and its time complexity."}], + [{"role": "user", "content": "Write a Python function that checks if a string is a palindrome."}], + [{"role": "user", "content": "Explain how caching improves performance with a simple analogy."}], + [{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}], +] + +CONCURRENCY = 8 + +async def run_one(i: int, messages): + resp = await client.chat.completions.create( + model=model, + messages=messages, + max_tokens=128, # Change as per comfiguration + ) + return i, resp.choices[0].message.content + +async def main(): + t0 = time.time() + sem = asyncio.Semaphore(CONCURRENCY) + + async def guarded_run(i, msgs): + async with sem: + try: + return await run_one(i, msgs) + except Exception as e: + return i, f"[ERROR] {type(e).__name__}: {e}" + + tasks = [asyncio.create_task(guarded_run(i, msgs)) for i, msgs in enumerate(messages_list, start=1)] + results = await asyncio.gather(*tasks) # order corresponds to tasks list + + # Print outputs in input order + results.sort(key=lambda x: x[0]) + for idx, out in results: + print(f"\n=== Output {idx} ===\n{out}\n") + + print(f"Batch completed in : {time.time() - t0:.2f}s") + +if __name__ == "__main__": + asyncio.run(main()) +``` + +Run 8 concurrent requests against your server: + +```bash +python3 batch_test.py +``` + +This validates multi‑request behavior and shows aggregate throughput in the server logs. + +```output +(APIServer pid=4474) INFO 11-10 01:00:56 [loggers.py:221] Engine 000: Avg prompt throughput: 19.7 tokens/s, Avg generation throughput: 187.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0% +(APIServer pid=4474) INFO: 127.0.0.1:44060 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44134 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44076 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44068 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44090 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK +(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% +``` +## Optional: Serving BF16 non-quantized model + +For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized. + +```bash +vllm serve deepseek-ai/DeepSeek-V2-Lite \ + --dtype bfloat16 --max-model-len 4096 \ + --max-num-batched-tokens 4096 +``` + +## Go Beyond: Power Up Your vLLM Workflow +Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore: + +* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma. + +* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md new file mode 100644 index 000000000..2b404b1df --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md @@ -0,0 +1,67 @@ +--- +title: High throughput LLM serving using vLLM on Arm Servers + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This learning path is for software developers and AI engineers who want to build an optimized vLLM for Arm servers, quantize models to INT4, and serve them through an OpenAI‑compatible API. + +learning_objectives: + - Build an optimized vLLM for aarch64 with oneDNN + Arm Compute Library. + - Set up dependencies including PyTorch and llmcompressor dependencies. + - Quantize an LLM (DeepSeek‑V2‑Lite) to 4‑bit weights. + - Run and serve the quantized model using vLLM & test BF16 non‑quantized serving. + - Use OpenAI‑compatible endpoints and understand sequence and batch limits. + +prerequisites: + - An Arm-based Linux server (Ubuntu 22.04+ recommended) with 32+ vCPUs, 64+ GB RAM, and 32+ GB free disk. + - Python 3.12 and basic familiarity with Hugging Face Transformers and quantization. + - Optional: a Hugging Face token to access gated models. + +author: + - Nikhil Gupta + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Neoverse +operatingsystems: + - Linux +tools_software_languages: + - vLLM + - LLM + - Generative AI + - Python + - PyTorch + - llmcompressor + +further_reading: + - resource: + title: vLLM Documentation + link: https://docs.vllm.ai/ + type: documentation + - resource: + title: vLLM GitHub Repository + link: https://github.com/vllm-project/vllm + type: github + - resource: + title: Hugging Face Model Hub + link: https://huggingface.co/models + type: website + - resource: + title: Build and Run vLLM on Arm Servers + link: /learning-paths/servers-and-cloud-computing/vllm/ + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_next-steps.md new file mode 100644 index 000000000..c3db0de5a --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +---