Comprehensive performance evaluation framework for vLLM on CPU platforms.
This repository provides a complete testing methodology, automation tools, and platform configurations for evaluating vLLM inference performance on CPU-based systems.
Set up your test infrastructure and credentials:
# Configure test hosts
export DUT_HOSTNAME=your-dut-hostname.compute.amazonaws.com
export LOADGEN_HOSTNAME=your-loadgen-hostname.compute.amazonaws.com
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_SSH_KEY=~/.ssh/your-key.pem
# Configure HuggingFace token for model access
export HF_TOKEN=hf_your_token_hereExecute a single LLM benchmark with auto-configured cores:
cd automation/test-execution/ansible
# Run benchmark against a specific model and workload
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=16" \
-e "guidellm_profile=concurrent" \
-e "guidellm_rate=[1,2,4,8,16,32]" \
-e "guidellm_max_seconds=600"
# Run with variable workload for realistic traffic simulation
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat_var" \
-e "requested_cores=16"Results are saved locally:
# Results location
ls results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/
# View HTML report
open results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.htmlSee Ansible Documentation for detailed instructions and advanced usage.
vllm-cpu-perf-eval/
├── README.md # This file
│
├── models/ # Centralized model definitions
│ ├── models.md # Comprehensive model documentation
│ ├── llm-models/ # LLM model configurations
│ │ ├── model-matrix.yaml # LLM model test mappings
│ │ └── llm-models.md # Redirects to models.md
│ └── embedding-models/ # Embedding model configurations
│ └── model-matrix.yaml # Embedding model test mappings
│
├── tests/ # Test suites and scenarios
│ ├── tests.md # Test suite overview
│ ├── concurrent-load/ # Test Suite 1: Concurrent load testing
│ │ ├── concurrent-load.md # Suite documentation
│ │ └── *.yaml # Test scenario definitions
│ ├── scalability/ # Test Suite 2: Scalability testing
│ │ ├── scalability.md # Suite documentation
│ │ └── *.yaml # Test scenario definitions
│ ├── resource-contention/ # Test Suite 3: Resource contention
│ │ ├── resource-contention.md # Suite documentation
│ │ └── *.yaml # Test scenario definitions (planned)
│ └── embedding-models/ # Embedding model test scenarios
│ ├── embedding-models.md # Embedding test documentation
│ ├── baseline-sweep.yaml # Baseline performance tests
│ └── latency-concurrent.yaml # Latency tests
│
├── automation/ # Automation framework
│ ├── automation.md # Automation guide
│ ├── test-execution/ # Test orchestration
│ │ ├── ansible/ # Ansible playbooks (primary)
│ │ │ ├── inventory/ # Host configurations
│ │ │ ├── playbooks/ # Test execution playbooks
│ │ │ ├── filter_plugins/ # Custom Ansible filters
│ │ │ └── roles/ # Ansible roles
│ │ ├── bash/ # Bash automation scripts
│ │ │ └── embedding/ # Embedding test scripts
│ │ └── results/ # Temporary results (gitignored)
│ ├── platform-setup/ # Platform configuration
│ │ └── bash/intel/ # Intel platform setup scripts
│ └── utilities/ # Helper utilities
│ ├── health-checks/ # Health check scripts
│ └── log-monitoring/ # Log analysis tools
│
├── docs/ # Documentation
│ ├── docs.md # Documentation index
│ ├── methodology/ # Test methodology
│ │ └── overview.md # Testing approach and metrics
│ └── platform-setup/ # Platform setup guides
│
├── results/ # Test results (gitignored)
│ ├── by-suite/ # Results organized by test suite
│ ├── by-model/ # Results organized by model
│ ├── by-host/ # Results organized by test host
│ ├── reports/ # Generated reports
│ └── metrics/ # Exported metrics
│
├── utils/ # Utility scripts and tools
│
└── Configuration Files
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── .yamllint.yaml # YAML linting rules
├── .markdownlint-cli2.yaml # Markdown linting rules
└── .gitignore # Git ignore patterns
Key Directories:
- models/ - Model definitions reused across all test suites
- tests/ - Test suite definitions organized by testing focus
- automation/ - Ansible playbooks and bash scripts for test execution
- docs/ - Comprehensive testing methodology and guides
- results/ - Local test results (not committed to git)
See individual directory README/markdown files for detailed information.
- Docker or Podman - Use either runtime
- Auto-detection - Automatically detects available runtime
- Rootless support - Full Podman rootless compatibility
- Define models once, use across all test phases
- Easy to add new models
- Model matrix for flexible test configuration
- Intel Xeon (Ice Lake, Sapphire Rapids)
- AMD EPYC
- ARM64 (planned)
- Ansible playbooks for platform setup and test execution
- Bash scripts for manual operation
- Docker/Podman Compose for containerized testing
- Distributed testing across multiple nodes
- Concurrent Load: Concurrent load testing
- Scalability: Scalability and sweep testing
- Resource Contention: Resource contention testing (planned)
- ⏱️ Time-based testing - Consistent 10-minute tests across CPU types
- 1️⃣ Single-user baseline - Concurrency=1 for efficiency calculations
- 📊 Variable workloads - Realistic traffic simulation with statistical variance
- 🔄 Prefix caching control - Baseline vs production comparison
- 🎯 3-phase testing - Baseline → Realistic → Production methodology
- 🚀 Large model support - Added gpt-oss-20b (21B MoE) for scalability testing
See 3-Phase Testing Strategy for details.
Configure your system for deterministic performance testing:
# With Ansible (recommended)
cd automation/platform-setup/ansible
ansible-playbook playbooks/site.yml
# With bash script
cd automation/platform-setup/bash/intel
sudo ./setup-guidellm-platform.sh --applySee Platform Setup Guide for details.
Execute performance tests using Ansible or Docker/Podman:
# Ansible - Run entire test suite
cd automation/test-execution/ansible
ansible-playbook playbooks/run-suite.yml -e "test_suite=concurrent-load"
# Docker/Podman - Run specific test
cd tests/concurrent-load
MODEL_NAME=llama-3.2-1b SCENARIO=concurrent-8 docker compose upSee Test Execution Guide for details.
Generate reports and compare results:
cd automation/analysis
python generate-report.py \
--input ../../results/concurrent-load/ \
--format html \
--output ../../results/reports/concurrent-load.htmlSee Reporting Guide for details.
- Getting Started - Quick start guides
- Methodology - Testing methodology and metrics
- Platform Setup - Platform configuration guides
- Containers - Docker/Podman guides
- Ansible - Ansible playbook documentation
- Reference - Schema and CLI reference
Full documentation index: docs/docs.md
⚠️ Validation Status:
- ✅ Concurrent Load - Fully validated and tested
- 🚧 Scalability - Work in progress; no guarantees
- 🚧 Resource Contention - Work in progress; no guarantees
- 🚧 Embedding Models - Work in progress; no guarantees
Only the concurrent load test suite has been fully validated and tested. Other test suites are work in progress and provided as-is with no guarantees they will work without modification.
Tests model performance under various concurrent request loads.
- Concurrency levels: 1, 2, 4, 8, 16, 32
- 8 LLM models + 2 embedding models
- Focus: P95 latency, TTFT, throughput scaling
Characterizes maximum throughput and performance curves.
- Sweep tests for capacity discovery
- Synchronous baseline tests
- Poisson distribution tests
- Focus: Maximum capacity, saturation points
Multi-tenant and resource sharing scenarios.
Current model coverage:
LLM Models (8 total):
- Llama-3.2 (1B, 3B) - Prefill-heavy
- TinyLlama-1.1B - Balanced small-scale
- OPT (125M, 1.3B) - Decode-heavy legacy baseline
- Granite-3.2-2B - Balanced enterprise
- Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced
Embedding Models:
- granite-embedding-english-r2
- granite-embedding-278m-multilingual
See models/models.md for complete model definitions, selection rationale, and how to add new models.
- CPU: Intel Xeon (Ice Lake or newer) or AMD EPYC
- Memory: 64GB+ RAM recommended
- OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
- Storage: 500GB+ for models and results
- Python 3.10+
- Docker 24.0+ or Podman 4.0+
- Ansible 2.14+ (for automation)
- GuideLLM v0.5.0+
- vLLM
See docs/getting-started/quick-start.md for installation instructions.
This repository supports both Docker and Podman:
- Docker: Traditional container runtime
- Podman: Daemonless, rootless-capable alternative
- Auto-detection: Automatically uses available runtime
Set runtime preference:
# Use Docker
export CONTAINER_RUNTIME=docker
# Use Podman
export CONTAINER_RUNTIME=podman
# Auto-detect (default)
export CONTAINER_RUNTIME=autoSee Container Guide for details.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run pre-commit checks:
pre-commit run --all-files - Submit a pull request
This repository uses pre-commit to ensure code quality.
# Install pre-commit
pip install pre-commit
# Install hooks
pre-commit install
pre-commit install --hook-type commit-msg
# Run manually
pre-commit run --all-files[Add license information]
- Documentation: See docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions