5 PyPI packages optimized for different Intel/AMD CPU instruction sets
Your support encourages me to keep creating and maintaining open-source projects. If you found value in this project, consider buying me a coffee to fuel those sleepless nights.
- Overview
- Quick Start
- Package Variants
- Which Package Should I Use?
- Installation
- Docker Deployment
- Usage Examples
- CPU Compatibility Guide
- Performance Tips
- Troubleshooting
- Links & Resources
This project provides 5 CPU-optimized PyPI packages built from the upstream vLLM source code. Each package is compiled with specific Intel/AMD CPU instruction set flags to maximize inference performance on different CPU generations.
Why CPU inference?
- No expensive GPU required
- Run LLMs on any server, laptop, or edge device
- Lower power consumption and operational costs
- Ideal for development, testing, and moderate-scale deployments
- ARM64 support for AWS Graviton, Apple Silicon, and Raspberry Pi
Key Features:
- State-of-the-art serving throughput with PagedAttention
- Continuous batching for high concurrency
- OpenAI-compatible API server
- Support for 100+ popular models (Llama, Mistral, Qwen, etc.)
- Quantization support: GPTQ, AWQ, INT4, INT8, FP8
- Multi-LoRA serving
- Streaming outputs
# Detect CPU features and get install command
pkg=vllm-cpu
grep -q avx512f /proc/cpuinfo && pkg=vllm-cpu-avx512
grep -q avx512_vnni /proc/cpuinfo && pkg=vllm-cpu-avx512vnni
grep -q avx512_bf16 /proc/cpuinfo && pkg=vllm-cpu-avx512bf16
grep -q amx_bf16 /proc/cpuinfo && pkg=vllm-cpu-amxbf16
printf "\n\tRUN:\n\t\tuv pip install $pkg\n"# For most modern Intel/AMD CPUs (baseline, works everywhere)
pip install vllm-cpu --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# For Intel Sapphire Rapids / 4th Gen Xeon (best performance)
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simplefrom vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", device="cpu")
outputs = llm.generate(["Hello, my name is"], SamplingParams(max_tokens=50))
print(outputs[0].outputs[0].text)docker run -p 8000:8000 mekayelanik/vllm-cpu:noavx512-latest --model facebook/opt-125mPerformance Hierarchy: vllm-cpu-amxbf16 > vllm-cpu-avx512bf16 > vllm-cpu-avx512vnni > vllm-cpu-avx512 > vllm-cpu
Is your CPU ARM64 (Graviton, Apple Silicon, Pi)?
└─ Yes → vllm-cpu
└─ No (x86_64) →
Does lscpu show "amx_bf16"?
└─ Yes → vllm-cpu-amxbf16
└─ No →
Does lscpu show "avx512_bf16"?
└─ Yes → vllm-cpu-avx512bf16
└─ No →
Does lscpu show "avx512vnni"?
└─ Yes → vllm-cpu-avx512vnni
└─ No →
Does lscpu show "avx512f"?
└─ Yes → vllm-cpu-avx512
└─ No → vllm-cpu
Run this to detect your CPU features and get the install command:
pkg=vllm-cpu
grep -q avx512f /proc/cpuinfo && pkg=vllm-cpu-avx512
grep -q avx512_vnni /proc/cpuinfo && pkg=vllm-cpu-avx512vnni
grep -q avx512_bf16 /proc/cpuinfo && pkg=vllm-cpu-avx512bf16
grep -q amx_bf16 /proc/cpuinfo && pkg=vllm-cpu-amxbf16
printf "\n\tRUN:\n\t\tuv pip install $pkg\n"Output example:
RUN:
uv pip install vllm-cpu-amxbf16
- OS: Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+, etc.)
- Python: 3.10, 3.11, 3.12, or 3.13
- Windows: Use WSL2 (Windows Subsystem for Linux)
- macOS: Use Docker or build from source
# Replace PACKAGE with your variant (vllm-cpu, vllm-cpu-avx512, etc.)
pip install PACKAGE --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simpleAll Variants:
# Baseline (ARM64 + x86_64 without AVX512)
pip install vllm-cpu --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# AVX512 (Intel Skylake-X, AMD Zen 4+)
pip install vllm-cpu-avx512 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# AVX512 + VNNI (Intel Cascade Lake+)
pip install vllm-cpu-avx512vnni --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# AVX512 + VNNI + BF16 (Intel Cooper Lake+, AMD Zen 4 EPYC)
pip install vllm-cpu-avx512bf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# AVX512 + VNNI + BF16 + AMX (Intel Sapphire Rapids+)
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simpleuv is a fast Python package manager. Install it first:
curl -LsSf https://astral.sh/uv/install.sh | shThen install vLLM:
uv pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple# Install specific version (e.g., 0.12.0)
pip install vllm-cpu==0.12.0 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# For versions 0.8.5-0.12.0, use .post2 releases (includes CPU platform fix)
pip install vllm-cpu==0.12.0.post2 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simplePre-built Docker images are available on Docker Hub and GitHub Container Registry.
| Variant | Docker Hub | GHCR | Platforms |
|---|---|---|---|
| noavx512 | mekayelanik/vllm-cpu:noavx512-latest |
ghcr.io/mekayelanik/vllm-cpu:noavx512-latest |
amd64, arm64 |
| avx512 | mekayelanik/vllm-cpu:avx512-latest |
ghcr.io/mekayelanik/vllm-cpu:avx512-latest |
amd64 |
| avx512vnni | mekayelanik/vllm-cpu:avx512vnni-latest |
ghcr.io/mekayelanik/vllm-cpu:avx512vnni-latest |
amd64 |
| avx512bf16 | mekayelanik/vllm-cpu:avx512bf16-latest |
ghcr.io/mekayelanik/vllm-cpu:avx512bf16-latest |
amd64 |
| amxbf16 | mekayelanik/vllm-cpu:amxbf16-latest |
ghcr.io/mekayelanik/vllm-cpu:amxbf16-latest |
amd64 |
Tag Format: <variant>-<version> (e.g., noavx512-0.12.0, amxbf16-0.11.2)
# Pull the image
docker pull mekayelanik/vllm-cpu:noavx512-latest
# Run OpenAI-compatible API server
docker run -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
mekayelanik/vllm-cpu:noavx512-latest \
--model facebook/opt-125m \
--host 0.0.0.0Mount your Hugging Face cache to avoid re-downloading models:
docker run -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=your_token_here \
mekayelanik/vllm-cpu:amxbf16-latest \
--model meta-llama/Llama-2-7b-chat-hf \
--max-model-len 4096docker run -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-e OMP_NUM_THREADS=16 \
-e MKL_NUM_THREADS=16 \
--cpus=16 \
mekayelanik/vllm-cpu:avx512bf16-latest \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--max-model-len 8192 \
--dtype bfloat16---
services:
vllm:
image: mekayelanik/vllm-cpu:amxbf16-latest
ports:
- "8000:8000"
volumes:
- huggingface-cache:/root/.cache/huggingface
environment:
- OMP_NUM_THREADS=8
- MKL_NUM_THREADS=8
- HF_TOKEN=${HF_TOKEN}
command: ["--model", "microsoft/phi-2", "--host", "0.0.0.0", "--max-model-len", "2048"]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
cpus: '8'
memory: 32G
volumes:
huggingface-cache:Optimized for high-load production environments on Intel Xeon (Sapphire Rapids+) or AMD EPYC servers with 32+ cores and 128+ GB RAM.
---
services:
vllm-cpu-prod:
image: mekayelanik/vllm-cpu:amxbf16-latest
container_name: vllm-cpu-prod
restart: always
network_mode: host
cap_add:
- SYS_NICE
- IPC_LOCK
security_opt:
- seccomp=unconfined
shm_size: 16g
volumes:
- /mnt/nvme/vllm-data:/data
- /mnt/nvme/models:/data/models:ro
- type: tmpfs
target: /tmp
tmpfs:
size: 4G
environment:
# Model
- VLLM_MODEL=Qwen/Qwen3-8B
- HF_HOME=/data/models
- HF_HUB_OFFLINE=1
# Server
- VLLM_SERVER_HOST=0.0.0.0
- VLLM_SERVER_PORT=8000
- VLLM_API_KEY=${VLLM_API_KEY:-}
# CPU Optimization (CRITICAL)
# KV Cache: (RAM - Model Size - 8GB) / 2
- VLLM_CPU_KVCACHE_SPACE=40
- VLLM_CPU_OMP_THREADS_BIND=0-31
- VLLM_CPU_NUM_OF_RESERVED_CPU=2
# Threading (physical cores - reserved)
- OMP_NUM_THREADS=30
- MKL_NUM_THREADS=30
- OMP_PROC_BIND=close
- OMP_PLACES=cores
# Memory
- MALLOC_TRIM_THRESHOLD_=65536
- LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
# Logging
- VLLM_LOGGING_LEVEL=WARNING
- VLLM_NO_USAGE_STATS=1
command:
- "--max-model-len"
- "32768"
- "--dtype"
- "bfloat16"
- "--max-num-seqs"
- "256"
- "--max-num-batched-tokens"
- "32768"
- "--disable-log-requests"
- "--enable-chunked-prefill"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 1048576
hard: 1048576
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"KV Cache Sizing Guide:
| System RAM | Model Size | VLLM_CPU_KVCACHE_SPACE |
|---|---|---|
| 32 GB | 7B (~14GB) | 5 |
| 64 GB | 7B (~14GB) | 20 |
| 128 GB | 7B (~14GB) | 50 |
| 256 GB | 70B (~140GB) | 50 |
# For AWS Graviton, Apple Silicon, Raspberry Pi
docker pull mekayelanik/vllm-cpu:noavx512-latest
# The image auto-detects ARM64 and uses appropriate optimizations
docker run -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
mekayelanik/vllm-cpu:noavx512-latest \
--model facebook/opt-125mfrom vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model="microsoft/phi-2",
device="cpu",
dtype="bfloat16", # Use bfloat16 for better performance on supported CPUs
max_model_len=2048
)
# Generate text
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256
)
prompts = [
"Explain quantum computing in simple terms:",
"Write a Python function to reverse a string:",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}\n")Start the server:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--device cpu \
--host 0.0.0.0 \
--port 8000Use with OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
print(response.choices[0].message.content)# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Text completion
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "The meaning of life is",
"max_tokens": 100
}'from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-1.3b", device="cpu")
# Process many prompts efficiently with continuous batching
prompts = [f"Question {i}: What is {i} + {i}?" for i in range(100)]
outputs = llm.generate(prompts, SamplingParams(max_tokens=50))
for output in outputs:
print(output.outputs[0].text)| Generation | Example CPUs | Recommended Package |
|---|---|---|
| Skylake-X (2017) | Core i9-7900X, Xeon W-2195 | vllm-cpu-avx512 |
| Cascade Lake (2019) | Xeon Platinum 8280, Core i9-10980XE | vllm-cpu-avx512vnni |
| Cooper Lake (2020) | Xeon Platinum 8380H (3rd Gen) | vllm-cpu-avx512bf16 |
| Sapphire Rapids (2023) | Xeon w9-3495X, Xeon Platinum 8480+ (4th Gen) | vllm-cpu-amxbf16 |
| Emerald Rapids (2024) | Xeon Platinum 8592+ (5th Gen) | vllm-cpu-amxbf16 |
| Granite Rapids (2024) | Xeon 6 (6th Gen) | vllm-cpu-amxbf16 |
| Consumer (no AVX512) | Core i5/i7/i9 12th-14th Gen | vllm-cpu |
| Generation | Example CPUs | Recommended Package |
|---|---|---|
| Zen 2 (2019) | Ryzen 3000, EPYC 7002 | vllm-cpu |
| Zen 3 (2020) | Ryzen 5000, EPYC 7003 | vllm-cpu |
| Zen 4 (2022) | Ryzen 7000, EPYC 9004 | vllm-cpu-avx512 or vllm-cpu-avx512bf16 |
| Zen 5 (2024) | Ryzen 9000, EPYC 9005 | vllm-cpu-avx512bf16 |
Note: AMD CPUs do not support AMX. Use vllm-cpu-avx512bf16 as the maximum for AMD.
| Platform | Example | Recommended Package |
|---|---|---|
| AWS Graviton 2/3/4 | c7g, m7g, r7g instances | vllm-cpu |
| Apple Silicon | M1, M2, M3, M4 (via Docker/Lima) | vllm-cpu |
| Ampere Altra | Various cloud instances | vllm-cpu |
| Raspberry Pi 4/5 | ARM Cortex-A72/A76 | vllm-cpu |
# Linux
lscpu | grep -E "avx512|vnni|bf16|amx"
# Detailed flags
cat /proc/cpuinfo | grep flags | head -1Flag Meanings:
avx512f→ AVX-512 Foundation (usevllm-cpu-avx512)avx512vnni→ Vector Neural Network Instructions (usevllm-cpu-avx512vnni)avx512_bf16→ BFloat16 support (usevllm-cpu-avx512bf16)amx_bf16→ Advanced Matrix Extensions (usevllm-cpu-amxbf16)
Using the wrong package leaves performance on the table. Always use the most optimized package your CPU supports.
# Set to number of physical cores (not threads)
export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16llm = LLM(model="your-model", device="cpu", dtype="bfloat16")llm = LLM(
model="your-model",
device="cpu",
max_model_len=4096, # Reduce if OOM
gpu_memory_utilization=0.9 # Adjust for CPU memory
)For multi-socket systems:
# Run on specific NUMA node
numactl --cpunodebind=0 --membind=0 python your_script.py# INT8 quantized model for lower memory usage
llm = LLM(model="TheBloke/Llama-2-7B-GPTQ", device="cpu", quantization="gptq")Cause: CPU platform detection issue in versions 0.8.5-0.12.0.
Solution 1: Use .post2 releases (recommended):
pip install vllm-cpu==0.12.0.post2 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simpleSolution 2: Apply manual fix:
import os, sys, importlib.metadata as m
v = next((d.metadata['Version'] for d in m.distributions() if d.metadata['Name'].startswith('vllm-cpu')), None)
if v:
p = next((p for p in sys.path if 'site-packages' in p and os.path.isdir(p)), None)
if p:
d = os.path.join(p, 'vllm-0.0.0.dist-info'); os.makedirs(d, exist_ok=True)
open(os.path.join(d, 'METADATA'), 'w').write(f'Metadata-Version: 2.1\nName: vllm\nVersion: {v}+cpu\n')
print(f'Fixed: vllm version set to {v}+cpu')Cause: Using a package with instructions your CPU doesn't support.
Solution: Detect your CPU features and install the right package:
pkg=vllm-cpu
grep -q avx512f /proc/cpuinfo && pkg=vllm-cpu-avx512
grep -q avx512_vnni /proc/cpuinfo && pkg=vllm-cpu-avx512vnni
grep -q avx512_bf16 /proc/cpuinfo && pkg=vllm-cpu-avx512bf16
grep -q amx_bf16 /proc/cpuinfo && pkg=vllm-cpu-amxbf16
printf "\n\tRUN:\n\t\tuv pip install $pkg\n"Solution: Reduce model memory usage:
llm = LLM(
model="your-model",
device="cpu",
max_model_len=2048, # Reduce context length
dtype="bfloat16" # Use lower precision
)Checklist:
- Are you using the most optimized package for your CPU?
- Are
OMP_NUM_THREADSandMKL_NUM_THREADSset correctly? - Is the model quantized (GPTQ/AWQ) for faster inference?
- Are you on a NUMA system without proper binding?
Solution: Remove all and reinstall:
pip uninstall vllm vllm-cpu vllm-cpu-avx512 vllm-cpu-avx512vnni vllm-cpu-avx512bf16 vllm-cpu-amxbf16 -y
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simplevLLM supports 100+ models including:
| Category | Models |
|---|---|
| LLMs | Llama 2/3/3.1/3.2, Mistral, Mixtral, Qwen 2/2.5/3, Phi-2/3/4, Gemma/Gemma 2/Gemma 3, DeepSeek V2/V3/R1, Yi, Falcon, Command-R |
| Reasoning | DeepSeek-R1, DeepSeek-R1-Distill, Qwen3 (with thinking mode), QwQ |
| Medical | MedGemma, BioMistral, Med-PaLM (via API) |
| Embedding | E5-Mistral, GTE, BGE, Nomic-Embed, Jina |
| Multimodal | LLaVA, LLaVA-NeXT, Qwen-VL, Qwen2.5-VL, InternVL, Pixtral, MiniCPM-V, Molmo |
| Code | CodeLlama, DeepSeek-Coder, StarCoder 1/2, CodeGemma, Codestral, Qwen2.5-Coder |
| MoE | Mixtral 8x7B/8x22B, DeepSeek-MoE, Qwen-MoE, DBRX, Arctic, DeepSeek-V3 |
Full list: vLLM Supported Models
Understanding what each instruction set provides helps you make informed decisions.
AVX-512 (Advanced Vector Extensions 512) extends SIMD (Single Instruction, Multiple Data) operations to 512-bit registers, doubling the width from AVX2's 256-bit registers. This means:
- 2x wider vectors: Process 16 floats or 8 doubles per instruction
- 32 vector registers: Up from 16 in AVX2
- Mask registers: Efficient conditional operations
Impact on LLM inference: Matrix multiplications and attention computations run faster with wider vectors.
VNNI adds specialized instructions for neural network inference:
- VPDPBUSD: Multiply-accumulate for INT8 data
- VPDPWSSD: Multiply-accumulate for INT16 data
Impact on LLM inference: Faster quantized (INT8) inference, reduced memory bandwidth requirements.
BFloat16 is a 16-bit floating-point format optimized for deep learning:
- Same range as FP32 (8 exponent bits)
- Lower precision (7 mantissa bits vs 23)
- 2x memory efficiency compared to FP32
Impact on LLM inference: Faster training and inference with minimal accuracy loss, half the memory usage.
AMX introduces tile-based matrix operations with dedicated accelerators:
- Tile registers: 8 tile registers of 1KB each
- TMUL: Tile matrix multiply unit
- Native BF16 support: Fast BF16 matrix operations
Impact on LLM inference: 2-8x faster matrix operations compared to AVX-512 alone, especially for transformer attention layers.
| Variable | Description | Default |
|---|---|---|
OMP_NUM_THREADS |
OpenMP thread count | All cores |
MKL_NUM_THREADS |
Intel MKL thread count | All cores |
VLLM_CPU_KVCACHE_SPACE |
KV cache size in GB | 4 |
VLLM_CPU_OMP_THREADS_BIND |
Thread binding strategy | auto |
HF_TOKEN |
Hugging Face access token | None |
HF_HOME |
Hugging Face cache directory | ~/.cache/huggingface |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--device cpu \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--dtype bfloat16 \
--max-num-seqs 256 \
--max-num-batched-tokens 32768 \
--disable-log-requestsServe multiple LoRA adapters from a single base model:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
device="cpu",
enable_lora=True,
max_loras=4
)
# Serve different LoRAs for different requests
lora_request = LoRARequest("sql-lora", 1, "/path/to/sql-lora")
outputs = llm.generate(
["Write a SQL query to..."],
SamplingParams(max_tokens=100),
lora_request=lora_request
)Use a smaller draft model to accelerate generation:
llm = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
device="cpu",
speculative_model="meta-llama/Llama-2-7b-chat-hf",
num_speculative_tokens=5
)import time
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-1.3b", device="cpu")
prompts = ["Hello, my name is"] * 10
sampling_params = SamplingParams(max_tokens=100)
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Generated {total_tokens} tokens in {elapsed:.2f}s")
print(f"Throughput: {total_tokens/elapsed:.2f} tokens/sec")For Raspberry Pi 4B, Pi 5, and similar ARM-based SBCs, use GGUF quantized models for optimal performance on resource-constrained devices. vLLM supports GGUF format experimentally.
Recommended Models for SBCs:
| Model | Size | Context | Type | Best For |
|---|---|---|---|---|
| gemma3:270m | 292 MB | 32K | Text | Pi 4B (2GB+), ultra-lightweight |
| smollm2:135m | 271 MB | 8K | Text | Pi 4B (2GB+), fastest responses |
| smollm2:360m | 726 MB | 8K | Text | Pi 4B (4GB+), balanced |
| qwen3:0.6b | 523 MB | 40K | Text | Pi 4B (4GB+), long context |
| gemma3:1b | 815 MB | 32K | Text | Pi 5 (4GB+), good quality |
| deepseek-r1:1.5b | 1.1 GB | 128K | Text | Pi 5 (4GB+), reasoning tasks |
| qwen3:1.7b | 1.4 GB | 40K | Text | Pi 5 (8GB), best quality |
| smollm2:1.7b | 1.8 GB | 8K | Text | Pi 5 (8GB), general use |
| ministral-3b | 3.0 GB | 256K | Text, Image | Pi 5 (8GB), multimodal |
Running GGUF Models with vLLM:
# Download a GGUF model
wget https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
# Run with vLLM (use base model tokenizer for best results)
python -m vllm.entrypoints.openai.api_server \
--model ./gemma-3-1b-it-Q4_K_M.gguf \
--tokenizer google/gemma-3-1b-it \
--device cpu \
--host 0.0.0.0 \
--port 8000Docker Compose for SBCs:
---
services:
vllm-sbc:
image: mekayelanik/vllm-cpu:noavx512-latest
ports:
- "8000:8000"
volumes:
- ./models:/models
- huggingface-cache:/root/.cache/huggingface
environment:
- OMP_NUM_THREADS=4
- MKL_NUM_THREADS=4
command: [
"--model", "/models/gemma-3-1b-it-Q4_K_M.gguf",
"--tokenizer", "google/gemma-3-1b-it",
"--host", "0.0.0.0",
"--port", "8000",
"--max-model-len", "2048"
]
restart: unless-stopped
deploy:
resources:
limits:
memory: 4G
volumes:
huggingface-cache:Setup:
# Create models directory and download GGUF
mkdir -p models
wget -O models/gemma-3-1b-it-Q4_K_M.gguf \
https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
# Start the server
docker compose up -dWhy GGUF for SBCs?
- 4-bit/8-bit quantization reduces memory usage significantly
- Smaller model footprint fits in limited RAM (2-8GB)
- Enables running LLMs on devices with no GPU
For custom builds or unsupported configurations:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 cmake ninja-build ccache \
libtcmalloc-minimal4 libnuma-dev numactl git jq
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/MekayelAnik/vllm-cpu.git
cd vllm-cpu
# Build specific variant
./build_wheels.sh --variant=vllm-cpu-amxbf16 --vllm-versions=0.12.0
# Build all variants
./build_wheels.sh --variant=all --vllm-versions=0.12.0# Build using Docker buildx (recommended)
./docker-buildx.sh --variant=vllm-cpu --vllm-version=0.12.0
# Build for specific platform
./docker-buildx.sh --variant=vllm-cpu --platform=linux/arm64 --vllm-version=0.12.0Not directly. Use WSL2 (Windows Subsystem for Linux) to run Linux binaries on Windows. Docker Desktop with WSL2 backend also works.
Depends on the model:
- Small models (125M-1B): 4-8 GB
- Medium models (7B): 16-32 GB
- Large models (13B-70B): 64-256 GB
Yes, but each model loads into memory separately. Consider using Multi-LoRA serving for efficient variant serving from a single base model.
No. These packages are CPU-only. For GPU inference, use the official vLLM package.
pip uninstall vllm-cpu-amxbf16 -y
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simpleDifferent CPUs support different instruction sets. Using the optimal package for your CPU provides significant performance improvements. Installing the wrong package (e.g., AMX on a non-AMX CPU) causes illegal instruction crashes.
Recommended Instances:
| Instance Type | vCPUs | RAM | Package | Use Case |
|---|---|---|---|---|
| c7i.4xlarge | 16 | 32 GB | vllm-cpu-amxbf16 | 7B models |
| c7i.8xlarge | 32 | 64 GB | vllm-cpu-amxbf16 | 7B-13B models |
| c7i.16xlarge | 64 | 128 GB | vllm-cpu-amxbf16 | 13B-70B models |
| c7g.4xlarge | 16 | 32 GB | vllm-cpu | ARM64, 7B models |
| c6i.8xlarge | 32 | 64 GB | vllm-cpu-avx512vnni | 7B-13B models |
# Launch and setup (Ubuntu 22.04 AMI)
ssh ec2-user@your-instance
# Install dependencies
sudo apt update && sudo apt install -y python3-pip
pip install vllm-cpu-amxbf16 --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
# Set optimal threading
export OMP_NUM_THREADS=$(nproc --all)
export MKL_NUM_THREADS=$(nproc --all)
# Run server
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--device cpu \
--host 0.0.0.0Recommended Machine Types:
| Machine Type | vCPUs | RAM | Package |
|---|---|---|---|
| c3-standard-22 | 22 | 88 GB | vllm-cpu-amxbf16 |
| c3-standard-44 | 44 | 176 GB | vllm-cpu-amxbf16 |
| t2a-standard-16 | 16 | 64 GB | vllm-cpu (ARM) |
| n2-standard-32 | 32 | 128 GB | vllm-cpu-avx512vnni |
Recommended VM Sizes:
| VM Size | vCPUs | RAM | Package |
|---|---|---|---|
| Standard_D32s_v5 | 32 | 128 GB | vllm-cpu-avx512vnni |
| Standard_E32s_v5 | 32 | 256 GB | vllm-cpu-avx512vnni |
| Standard_DC32s_v3 | 32 | 256 GB | vllm-cpu-amxbf16 |
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-cpu
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: mekayelanik/vllm-cpu:amxbf16-latest
ports:
- containerPort: 8000
args:
- "--model"
- "microsoft/phi-2"
- "--host"
- "0.0.0.0"
- "--max-model-len"
- "2048"
env:
- name: OMP_NUM_THREADS
value: "8"
- name: MKL_NUM_THREADS
value: "8"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
resources:
requests:
cpu: "8"
memory: "32Gi"
limits:
cpu: "16"
memory: "64Gi"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
type: LoadBalancerapiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-cpu
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="mistralai/Mistral-7B-Instruct-v0.2"
)
response = llm.invoke("Explain machine learning in simple terms")
print(response.content)from llama_index.llms.openai_like import OpenAILike
llm = OpenAILike(
api_base="http://localhost:8000/v1",
api_key="not-needed",
model="mistralai/Mistral-7B-Instruct-v0.2"
)
response = llm.complete("What is the capital of France?")
print(response.text)from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
kernel = Kernel()
kernel.add_service(OpenAIChatCompletion(
service_id="vllm",
ai_model_id="mistralai/Mistral-7B-Instruct-v0.2",
base_url="http://localhost:8000/v1",
api_key="not-needed"
))The Docker images run as a non-root user (vllm) by default for security.
# Bind to localhost only (recommended for development)
python -m vllm.entrypoints.openai.api_server \
--model your-model \
--device cpu \
--host 127.0.0.1
# Use reverse proxy (nginx) for production
# Never expose vLLM directly to the internetvLLM supports optional API key authentication:
python -m vllm.entrypoints.openai.api_server \
--model your-model \
--device cpu \
--api-key your-secret-keyUse a reverse proxy like nginx for rate limiting:
limit_req_zone $binary_remote_addr zone=vllm:10m rate=10r/s;
server {
location /v1/ {
limit_req zone=vllm burst=20 nodelay;
proxy_pass http://localhost:8000;
}
}Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
See GitHub Releases for version history and release notes.
| Resource | Link |
|---|---|
| GitHub Repository | github.com/MekayelAnik/vllm-cpu |
| Docker Hub | hub.docker.com/r/mekayelanik/vllm-cpu |
| GHCR | ghcr.io/mekayelanik/vllm-cpu |
| vLLM Documentation | docs.vllm.ai |
| Upstream vLLM | github.com/vllm-project/vllm |
| Package | PyPI Link |
|---|---|
| vllm-cpu | pypi.org/project/vllm-cpu |
| vllm-cpu-avx512 | pypi.org/project/vllm-cpu-avx512 |
| vllm-cpu-avx512vnni | pypi.org/project/vllm-cpu-avx512vnni |
| vllm-cpu-avx512bf16 | pypi.org/project/vllm-cpu-avx512bf16 |
| vllm-cpu-amxbf16 | pypi.org/project/vllm-cpu-amxbf16 |
- This project: GPL-3.0
- Upstream vLLM: Apache-2.0
Your support encourages me to keep creating and maintaining open-source projects. If you found value in this project, consider buying me a coffee to fuel those sleepless nights.
Originally developed at Sky Computing Lab, UC Berkeley