vLLM CPU Performance Evaluation

Comprehensive performance evaluation framework for vLLM on CPU platforms.

This repository provides a complete testing methodology, automation tools, and platform configurations for evaluating vLLM inference performance on CPU-based systems.

Quick Start

1. Configure Environment

Set up your test infrastructure and credentials:

# Configure test hosts
export DUT_HOSTNAME=your-dut-hostname.compute.amazonaws.com
export LOADGEN_HOSTNAME=your-loadgen-hostname.compute.amazonaws.com
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_SSH_KEY=~/.ssh/your-key.pem

# Configure HuggingFace token for model access
export HF_TOKEN=hf_your_token_here

2. Run a Benchmark Test

Execute a single LLM benchmark with auto-configured cores:

cd automation/test-execution/ansible

# Run benchmark against a specific model and workload
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "workload_type=chat" \
  -e "requested_cores=16" \
  -e "guidellm_profile=concurrent" \
  -e "guidellm_rate=[1,2,4,8,16,32]" \
  -e "guidellm_max_seconds=600"

# Run with variable workload for realistic traffic simulation
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "workload_type=chat_var" \
  -e "requested_cores=16"

3. View Results

Results are saved locally:

# Results location
ls results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/

# View HTML report
open results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.html

See Ansible Documentation for detailed instructions and advanced usage.

Repository Structure

vllm-cpu-perf-eval/
├── README.md                           # This file
│
├── models/                             # Centralized model definitions
│   ├── models.md                       # Comprehensive model documentation
│   ├── llm-models/                     # LLM model configurations
│   │   ├── model-matrix.yaml          # LLM model test mappings
│   │   └── llm-models.md              # Redirects to models.md
│   └── embedding-models/               # Embedding model configurations
│       └── model-matrix.yaml          # Embedding model test mappings
│
├── tests/                              # Test suites and scenarios
│   ├── tests.md                        # Test suite overview
│   ├── concurrent-load/                # Test Suite 1: Concurrent load testing
│   │   ├── concurrent-load.md         # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions
│   ├── scalability/                    # Test Suite 2: Scalability testing
│   │   ├── scalability.md             # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions
│   ├── resource-contention/            # Test Suite 3: Resource contention
│   │   ├── resource-contention.md     # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions (planned)
│   └── embedding-models/               # Embedding model test scenarios
│       ├── embedding-models.md        # Embedding test documentation
│       ├── baseline-sweep.yaml        # Baseline performance tests
│       └── latency-concurrent.yaml    # Latency tests
│
├── automation/                         # Automation framework
│   ├── automation.md                   # Automation guide
│   ├── test-execution/                 # Test orchestration
│   │   ├── ansible/                   # Ansible playbooks (primary)
│   │   │   ├── inventory/             # Host configurations
│   │   │   ├── playbooks/             # Test execution playbooks
│   │   │   ├── filter_plugins/        # Custom Ansible filters
│   │   │   └── roles/                 # Ansible roles
│   │   ├── bash/                      # Bash automation scripts
│   │   │   └── embedding/             # Embedding test scripts
│   │   └── results/                   # Temporary results (gitignored)
│   ├── platform-setup/                 # Platform configuration
│   │   └── bash/intel/                # Intel platform setup scripts
│   └── utilities/                      # Helper utilities
│       ├── health-checks/             # Health check scripts
│       └── log-monitoring/            # Log analysis tools
│
├── docs/                               # Documentation
│   ├── docs.md                         # Documentation index
│   ├── methodology/                    # Test methodology
│   │   └── overview.md                # Testing approach and metrics
│   └── platform-setup/                 # Platform setup guides
│
├── results/                            # Test results (gitignored)
│   ├── by-suite/                      # Results organized by test suite
│   ├── by-model/                      # Results organized by model
│   ├── by-host/                       # Results organized by test host
│   ├── reports/                       # Generated reports
│   └── metrics/                       # Exported metrics
│
├── utils/                              # Utility scripts and tools
│
└── Configuration Files
    ├── .pre-commit-config.yaml        # Pre-commit hooks configuration
    ├── .yamllint.yaml                 # YAML linting rules
    ├── .markdownlint-cli2.yaml        # Markdown linting rules
    └── .gitignore                     # Git ignore patterns

Key Directories:

models/ - Model definitions reused across all test suites
tests/ - Test suite definitions organized by testing focus
automation/ - Ansible playbooks and bash scripts for test execution
docs/ - Comprehensive testing methodology and guides
results/ - Local test results (not committed to git)

See individual directory README/markdown files for detailed information.

Key Features

Flexible Container Runtime Support

Docker or Podman - Use either runtime
Auto-detection - Automatically detects available runtime
Rootless support - Full Podman rootless compatibility

Centralized Model Management

Define models once, use across all test phases
Easy to add new models
Model matrix for flexible test configuration

Multi-Platform Support

Intel Xeon (Ice Lake, Sapphire Rapids)
AMD EPYC
ARM64 (planned)

Comprehensive Automation

Ansible playbooks for platform setup and test execution
Bash scripts for manual operation
Docker/Podman Compose for containerized testing
Distributed testing across multiple nodes

Multiple Test Suites

Concurrent Load: Concurrent load testing
Scalability: Scalability and sweep testing
Resource Contention: Resource contention testing (planned)

Enhanced Concurrent Load Testing

⏱️ Time-based testing - Consistent 10-minute tests across CPU types
1️⃣ Single-user baseline - Concurrency=1 for efficiency calculations
📊 Variable workloads - Realistic traffic simulation with statistical variance
🔄 Prefix caching control - Baseline vs production comparison
🎯 3-phase testing - Baseline → Realistic → Production methodology
🚀 Large model support - Added gpt-oss-20b (21B MoE) for scalability testing

See 3-Phase Testing Strategy for details.

Testing Workflow

1. Platform Setup

Configure your system for deterministic performance testing:

# With Ansible (recommended)
cd automation/platform-setup/ansible
ansible-playbook playbooks/site.yml

# With bash script
cd automation/platform-setup/bash/intel
sudo ./setup-guidellm-platform.sh --apply

See Platform Setup Guide for details.

2. Run Tests

Execute performance tests using Ansible or Docker/Podman:

# Ansible - Run entire test suite
cd automation/test-execution/ansible
ansible-playbook playbooks/run-suite.yml -e "test_suite=concurrent-load"

# Docker/Podman - Run specific test
cd tests/concurrent-load
MODEL_NAME=llama-3.2-1b SCENARIO=concurrent-8 docker compose up

See Test Execution Guide for details.

3. Analyze Results

Generate reports and compare results:

cd automation/analysis
python generate-report.py \
  --input ../../results/concurrent-load/ \
  --format html \
  --output ../../results/reports/concurrent-load.html

See Reporting Guide for details.

Documentation

Getting Started - Quick start guides
Methodology - Testing methodology and metrics
Platform Setup - Platform configuration guides
Containers - Docker/Podman guides
Ansible - Ansible playbook documentation
Reference - Schema and CLI reference

Full documentation index: docs/docs.md

Test Suites

⚠️ Validation Status:

✅ Concurrent Load - Fully validated and tested

🚧 Scalability - Work in progress; no guarantees

🚧 Resource Contention - Work in progress; no guarantees

🚧 Embedding Models - Work in progress; no guarantees

Only the concurrent load test suite has been fully validated and tested. Other test suites are work in progress and provided as-is with no guarantees they will work without modification.

Test Suite: Concurrent Load

Tests model performance under various concurrent request loads.

Concurrency levels: 1, 2, 4, 8, 16, 32
8 LLM models + 2 embedding models
Focus: P95 latency, TTFT, throughput scaling

Test Suite: Scalability

Characterizes maximum throughput and performance curves.

Sweep tests for capacity discovery
Synchronous baseline tests
Poisson distribution tests
Focus: Maximum capacity, saturation points

Test Suite: Resource Contention (Planned)

Multi-tenant and resource sharing scenarios.

Models

Current model coverage:

LLM Models (8 total):

Llama-3.2 (1B, 3B) - Prefill-heavy
TinyLlama-1.1B - Balanced small-scale
OPT (125M, 1.3B) - Decode-heavy legacy baseline
Granite-3.2-2B - Balanced enterprise
Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced

Embedding Models:

granite-embedding-english-r2
granite-embedding-278m-multilingual

See models/models.md for complete model definitions, selection rationale, and how to add new models.

Requirements

System Requirements

CPU: Intel Xeon (Ice Lake or newer) or AMD EPYC
Memory: 64GB+ RAM recommended
OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
Storage: 500GB+ for models and results

Software Requirements

Python 3.10+
Docker 24.0+ or Podman 4.0+
Ansible 2.14+ (for automation)
GuideLLM v0.5.0+
vLLM

See docs/getting-started/quick-start.md for installation instructions.

Container Runtime Support

This repository supports both Docker and Podman:

Docker: Traditional container runtime
Podman: Daemonless, rootless-capable alternative
Auto-detection: Automatically uses available runtime

Set runtime preference:

# Use Docker
export CONTAINER_RUNTIME=docker

# Use Podman
export CONTAINER_RUNTIME=podman

# Auto-detect (default)
export CONTAINER_RUNTIME=auto

See Container Guide for details.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Run pre-commit checks: pre-commit run --all-files
Submit a pull request

Pre-commit Hooks

This repository uses pre-commit to ensure code quality.

# Install pre-commit
pip install pre-commit

# Install hooks
pre-commit install
pre-commit install --hook-type commit-msg

# Run manually
pre-commit run --all-files

License

[Add license information]

Support

Documentation: See docs/
Issues: GitHub Issues
Discussions: GitHub Discussions

Acknowledgments

vLLM - High-performance LLM inference engine
GuideLLM - LLM benchmarking tool
Intel and AMD for CPU optimization guidance

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github		.github
automation		automation
docs		docs
hack/amd		hack/amd
models		models
results		results
tests		tests
.gitignore		.gitignore
.markdownlint-cli2.yaml		.markdownlint-cli2.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint.yaml		.yamllint.yaml
README.md		README.md
codespell.precommit-toml		codespell.precommit-toml

Folders and files

Latest commit

History

Repository files navigation

vLLM CPU Performance Evaluation

Quick Start

1. Configure Environment

2. Run a Benchmark Test

3. View Results

Repository Structure

Key Features

Flexible Container Runtime Support

Centralized Model Management

Multi-Platform Support

Comprehensive Automation

Multiple Test Suites

Enhanced Concurrent Load Testing

Testing Workflow

1. Platform Setup

2. Run Tests

3. Analyze Results

Documentation

Test Suites

Test Suite: Concurrent Load

Test Suite: Scalability

Test Suite: Resource Contention (Planned)

Models

Requirements

System Requirements

Software Requirements

Container Runtime Support

Contributing

Pre-commit Hooks

License

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages