ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md‎
Lines changed: 120 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md‎
Lines changed: 148 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md‎
Lines changed: 148 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md‎
Lines changed: 126 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md‎
Lines changed: 126 additions & 0 deletions
@@ -0,0 +1,120 @@
+---
+title: Overview and Optimized Build
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## What is vLLM?
+
+vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:
+
+- Continuous batching to keep hardware busy across many requests.
+- KV cache management to sustain concurrency during generation.
+- Token streaming so results appear as they are produced.
+
+You interact with vLLM in multiple ways:
+
+- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
+- Python API: load a model and generate locally when needed.
+
+vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.
+
+## What you build
+
+You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.
+
+## Why this is fast on Arm
+
+- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
+- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
+- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput.
+- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
+- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.
+
+## Before you begin
+
+- Use Python 3.12 on Ubuntu 22.04+
+- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.
+
+Install the minimum system package used by vLLM on Arm:
+
+```bash
+sudo apt-get update -y
+sudo apt-get install -y libnuma-dev
+```
+
+Optional performance helper you can install now or later:
+
+```bash
+sudo apt-get install -y libtcmalloc-minimal4
+```
+
+{{% notice Note %}}
+On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
+{{% /notice %}}
+
+## Build vLLM for aarch64 CPU
+
+Create and activate a virtual environment:
+
+```bash
+python3 -m venv vllm_env
+source vllm_env/bin/activate
+python -m pip install --upgrade pip
+```
+
+Clone vLLM and install build requirements:
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout 5fb4137
+pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
+```
+
+Build a wheel targeted at CPU:
+
+```bash
+VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
+```
+
+Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment:
+
+```bash
+pip install --force-reinstall dist/*.whl              # fresh install
+# pip install --no-deps --force-reinstall dist/*.whl  # incremental build
+```
+
+{{% notice Tip %}}
+Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
+{{% /notice %}}
+
+## Quick validation via offline inferencing
+
+Run the built‑in chat example to confirm the build:
+
+```bash
+python examples/offline_inference/basic/chat.py \
+  --dtype=bfloat16 \
+  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
+```
+
+You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.
+
+```output
+Generated Outputs:
+--------------------------------------------------------------------------------
+Prompt: None
+
+Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
+--------------------------------------------------------------------------------
+Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
+Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
+...
+```
+
+{{% notice Note %}}
+As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
+{{% /notice %}}
@@ -0,0 +1,148 @@
+---
+title: Quantize an LLM to INT4 for Arm Platform
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.
+
+## Install quantization tools
+
+Install the vLLM model quantization packages
+
+```bash
+pip install --no-deps compressed-tensors
+pip install llmcompressor
+```
+
+Reinstall your locally built vLLM if you rebuilt it:
+
+```bash
+pip install --no-deps dist/*.whl
+```
+
+If your chosen model is gated on Hugging Face, authenticate first:
+
+```bash
+huggingface-cli login
+```
+
+## INT4 Quantization recipe
+
+Save the following as `quantize_vllm_models.py`:
+
+```python
+import argparse
+import os
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+ 
+from compressed_tensors.quantization import QuantizationScheme
+from compressed_tensors.quantization.quant_args import (
+    QuantizationArgs,
+    QuantizationStrategy,
+    QuantizationType,
+)
+ 
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization.quantization import QuantizationModifier
+ 
+ 
+def main():
+    parser = argparse.ArgumentParser(description="Quantize a model using INT4 with minmax or mse and dynamic activation quantization.")
+    parser.add_argument("model_id", type=str, help="Model identifier or path")
+    parser.add_argument("--method", type=str, choices=["minmax", "mse"], default="mse", help="Quantization method")
+    parser.add_argument("--scheme", type=str, choices=["channelwise", "groupwise"], required=True, help="Quantization scheme for weights")
+    parser.add_argument("--groupsize", type=int, default=32, help="Group size for groupwise quantization")
+    args = parser.parse_args()
+ 
+    # Extract base model name for output dir
+    base_model_name = os.path.basename(args.model_id.rstrip("/"))
+    act_tag = "a8dyn"
+    suffix = f"{args.method}-{args.scheme}"
+    if args.scheme == "groupwise":
+        suffix += f"-g{args.groupsize}"
+    output_dir = f"{base_model_name}-w4{act_tag}-{suffix}"
+ 
+    print(f"Loading model '{args.model_id}'...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_id, trust_remote_code=True
+    )
+    model = model.to(torch.float32)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+ 
+    # Weight quantization args
+    strategy = QuantizationStrategy.CHANNEL if args.scheme == "channelwise" else QuantizationStrategy.GROUP
+    weights_args = QuantizationArgs(
+        num_bits=4,
+        type=QuantizationType.INT,
+        strategy=strategy,
+        symmetric=True,
+        dynamic=False,
+        group_size=args.groupsize if args.scheme == "groupwise" else None,
+        observer=args.method,
+    )
+ 
+    # Activation quantization
+    input_acts = QuantizationArgs(
+        num_bits=8,
+        type=QuantizationType.INT,
+        strategy=QuantizationStrategy.TOKEN,
+        symmetric=False,
+        dynamic=True,
+        observer=None,
+    )
+    output_acts = None
+ 
+    # Create quantization scheme and recipe
+    scheme = QuantizationScheme(
+        targets=["Linear"],
+        weights=weights_args,
+        input_activations=input_acts,
+        output_activations=output_acts,
+    )
+    recipe = QuantizationModifier(
+        config_groups={"group_0": scheme},
+        ignore=["lm_head"],
+    )
+ 
+    # Run quantization
+    oneshot(
+        model=model,
+        recipe=recipe,
+        tokenizer=tokenizer,
+        output_dir=output_dir,
+        trust_remote_code_model=True,
+    )
+ 
+    print(f"Quantized model saved to: {output_dir}")
+ 
+ 
+if __name__ == "__main__":
+    main()
+```
+
+This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.
+
+## Quantize DeepSeek‑V2‑Lite model
+
+### Quantization parameter tuning
+1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. 
+2. `channelwise` is a good default for most models.
+3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
+
+```bash
+# DeepSeek example
+python quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \
+  --scheme channelwise --method mse
+```
+
+The 4-bit quantized DeepSeek-V2-Lite will be stored the directory:
+
+```text
+DeepSeek-V2-Lite-w4a8dyn-mse-channelwise
+```
+
+You will load this quantized model directory with vLLM in the next step.
@@ -0,0 +1,126 @@
+---
+title: Serve high throughput inference with vLLM
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## About batch sizing in vLLM
+
+vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.
+
+## Serve an OpenAI‑compatible API
+
+Start the server with sensible CPU default parameters and a quantized model:
+
+```bash
+export VLLM_TARGET_DEVICE=cpu
+export VLLM_CPU_KVCACHE_SPACE=32
+export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"
+export VLLM_MLA_DISABLE=1 
+export ONEDNN_DEFAULT_FPMATH_MODE=BF16
+export OMP_NUM_THREADS="$(nproc)"
+export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
+
+vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \
+  --dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096
+```
+
+## Run multi‑request batch
+
+After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:
+
+```python
+import asyncio
+import time
+from openai import AsyncOpenAI
+
+# vLLM's OpenAI-compatible server
+client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+
+model = "DeepSeek-V2-Lite-w4a8dyn-mse-channelwise"   # vllm server model
+
+# Batch of 8 prompts
+messages_list = [
+    [{"role": "user", "content": "Explain Big O notation with two examples."}],
+    [{"role": "user", "content": "Show a simple recursive function and explain how it works."}],
+    [{"role": "user", "content": "Draft a polite email requesting a project deadline extension."}],
+    [{"role": "user", "content": "Explain what a hash function is and common uses."}],
+    [{"role": "user", "content": "Explain binary search and its time complexity."}],
+    [{"role": "user", "content": "Write a Python function that checks if a string is a palindrome."}],
+    [{"role": "user", "content": "Explain how caching improves performance with a simple analogy."}],
+    [{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}],
+]
+
+CONCURRENCY = 8
+
+async def run_one(i: int, messages):
+    resp = await client.chat.completions.create(
+        model=model,
+        messages=messages,
+        max_tokens=128,  # Change as per comfiguration
+    )
+    return i, resp.choices[0].message.content
+
+async def main():
+    t0 = time.time()
+    sem = asyncio.Semaphore(CONCURRENCY)
+
+    async def guarded_run(i, msgs):
+        async with sem:
+            try:
+                return await run_one(i, msgs)
+            except Exception as e:
+                return i, f"[ERROR] {type(e).__name__}: {e}"
+
+    tasks = [asyncio.create_task(guarded_run(i, msgs)) for i, msgs in enumerate(messages_list, start=1)]
+    results = await asyncio.gather(*tasks)  # order corresponds to tasks list
+
+    # Print outputs in input order
+    results.sort(key=lambda x: x[0])
+    for idx, out in results:
+        print(f"\n=== Output {idx} ===\n{out}\n")
+
+    print(f"Batch completed in : {time.time() - t0:.2f}s")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+Run 8 concurrent requests against your server:
+
+```bash
+python3 batch_test.py
+```
+
+This validates multi‑request behavior and shows aggregate throughput in the server logs.
+
+```output
+(APIServer pid=4474) INFO 11-10 01:00:56 [loggers.py:221] Engine 000: Avg prompt throughput: 19.7 tokens/s, Avg generation throughput: 187.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0%
+(APIServer pid=4474) INFO:     127.0.0.1:44060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44068 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO:     127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
+```
+## Optional: Serving BF16 non-quantized model
+
+For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.
+
+```bash
+vllm serve deepseek-ai/DeepSeek-V2-Lite \
+  --dtype bfloat16 --max-model-len 4096  \
+  --max-num-batched-tokens 4096
+```
+
+## Go Beyond: Power Up Your vLLM Workflow
+Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:
+
+* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.
+
+* **Connect a chat client:**  Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)