Skip to content

redhat-et/vllm-cpu-perf-eval

Repository files navigation

vLLM CPU Performance Evaluation

Comprehensive performance evaluation framework for vLLM on CPU platforms.

This repository provides a complete testing methodology, automation tools, and platform configurations for evaluating vLLM inference performance on CPU-based systems.

Quick Start

1. Configure Environment

Set up your test infrastructure and credentials:

# Configure test hosts
export DUT_HOSTNAME=your-dut-hostname.compute.amazonaws.com
export LOADGEN_HOSTNAME=your-loadgen-hostname.compute.amazonaws.com
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_SSH_KEY=~/.ssh/your-key.pem

# Configure HuggingFace token for model access
export HF_TOKEN=hf_your_token_here

2. Run a Benchmark Test

Execute a single LLM benchmark with auto-configured cores:

cd automation/test-execution/ansible

# Run benchmark against a specific model and workload
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "workload_type=chat" \
  -e "requested_cores=16" \
  -e "guidellm_profile=concurrent" \
  -e "guidellm_rate=[1,2,4,8,16,32]" \
  -e "guidellm_max_seconds=600"

# Run with variable workload for realistic traffic simulation
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "workload_type=chat_var" \
  -e "requested_cores=16"

3. View Results

Results are saved locally:

# Results location
ls results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/

# View HTML report
open results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.html

See Ansible Documentation for detailed instructions and advanced usage.

Repository Structure

vllm-cpu-perf-eval/
├── README.md                           # This file
│
├── models/                             # Centralized model definitions
│   ├── models.md                       # Comprehensive model documentation
│   ├── llm-models/                     # LLM model configurations
│   │   ├── model-matrix.yaml          # LLM model test mappings
│   │   └── llm-models.md              # Redirects to models.md
│   └── embedding-models/               # Embedding model configurations
│       └── model-matrix.yaml          # Embedding model test mappings
│
├── tests/                              # Test suites and scenarios
│   ├── tests.md                        # Test suite overview
│   ├── concurrent-load/                # Test Suite 1: Concurrent load testing
│   │   ├── concurrent-load.md         # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions
│   ├── scalability/                    # Test Suite 2: Scalability testing
│   │   ├── scalability.md             # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions
│   ├── resource-contention/            # Test Suite 3: Resource contention
│   │   ├── resource-contention.md     # Suite documentation
│   │   └── *.yaml                     # Test scenario definitions (planned)
│   └── embedding-models/               # Embedding model test scenarios
│       ├── embedding-models.md        # Embedding test documentation
│       ├── baseline-sweep.yaml        # Baseline performance tests
│       └── latency-concurrent.yaml    # Latency tests
│
├── automation/                         # Automation framework
│   ├── automation.md                   # Automation guide
│   ├── test-execution/                 # Test orchestration
│   │   ├── ansible/                   # Ansible playbooks (primary)
│   │   │   ├── inventory/             # Host configurations
│   │   │   ├── playbooks/             # Test execution playbooks
│   │   │   ├── filter_plugins/        # Custom Ansible filters
│   │   │   └── roles/                 # Ansible roles
│   │   ├── bash/                      # Bash automation scripts
│   │   │   └── embedding/             # Embedding test scripts
│   │   └── results/                   # Temporary results (gitignored)
│   ├── platform-setup/                 # Platform configuration
│   │   └── bash/intel/                # Intel platform setup scripts
│   └── utilities/                      # Helper utilities
│       ├── health-checks/             # Health check scripts
│       └── log-monitoring/            # Log analysis tools
│
├── docs/                               # Documentation
│   ├── docs.md                         # Documentation index
│   ├── methodology/                    # Test methodology
│   │   └── overview.md                # Testing approach and metrics
│   └── platform-setup/                 # Platform setup guides
│
├── results/                            # Test results (gitignored)
│   ├── by-suite/                      # Results organized by test suite
│   ├── by-model/                      # Results organized by model
│   ├── by-host/                       # Results organized by test host
│   ├── reports/                       # Generated reports
│   └── metrics/                       # Exported metrics
│
├── utils/                              # Utility scripts and tools
│
└── Configuration Files
    ├── .pre-commit-config.yaml        # Pre-commit hooks configuration
    ├── .yamllint.yaml                 # YAML linting rules
    ├── .markdownlint-cli2.yaml        # Markdown linting rules
    └── .gitignore                     # Git ignore patterns

Key Directories:

  • models/ - Model definitions reused across all test suites
  • tests/ - Test suite definitions organized by testing focus
  • automation/ - Ansible playbooks and bash scripts for test execution
  • docs/ - Comprehensive testing methodology and guides
  • results/ - Local test results (not committed to git)

See individual directory README/markdown files for detailed information.

Key Features

Flexible Container Runtime Support

  • Docker or Podman - Use either runtime
  • Auto-detection - Automatically detects available runtime
  • Rootless support - Full Podman rootless compatibility

Centralized Model Management

  • Define models once, use across all test phases
  • Easy to add new models
  • Model matrix for flexible test configuration

Multi-Platform Support

  • Intel Xeon (Ice Lake, Sapphire Rapids)
  • AMD EPYC
  • ARM64 (planned)

Comprehensive Automation

  • Ansible playbooks for platform setup and test execution
  • Bash scripts for manual operation
  • Docker/Podman Compose for containerized testing
  • Distributed testing across multiple nodes

Multiple Test Suites

  • Concurrent Load: Concurrent load testing
  • Scalability: Scalability and sweep testing
  • Resource Contention: Resource contention testing (planned)

Enhanced Concurrent Load Testing

  • ⏱️ Time-based testing - Consistent 10-minute tests across CPU types
  • 1️⃣ Single-user baseline - Concurrency=1 for efficiency calculations
  • 📊 Variable workloads - Realistic traffic simulation with statistical variance
  • 🔄 Prefix caching control - Baseline vs production comparison
  • 🎯 3-phase testing - Baseline → Realistic → Production methodology
  • 🚀 Large model support - Added gpt-oss-20b (21B MoE) for scalability testing

See 3-Phase Testing Strategy for details.

Testing Workflow

1. Platform Setup

Configure your system for deterministic performance testing:

# With Ansible (recommended)
cd automation/platform-setup/ansible
ansible-playbook playbooks/site.yml

# With bash script
cd automation/platform-setup/bash/intel
sudo ./setup-guidellm-platform.sh --apply

See Platform Setup Guide for details.

2. Run Tests

Execute performance tests using Ansible or Docker/Podman:

# Ansible - Run entire test suite
cd automation/test-execution/ansible
ansible-playbook playbooks/run-suite.yml -e "test_suite=concurrent-load"

# Docker/Podman - Run specific test
cd tests/concurrent-load
MODEL_NAME=llama-3.2-1b SCENARIO=concurrent-8 docker compose up

See Test Execution Guide for details.

3. Analyze Results

Generate reports and compare results:

cd automation/analysis
python generate-report.py \
  --input ../../results/concurrent-load/ \
  --format html \
  --output ../../results/reports/concurrent-load.html

See Reporting Guide for details.

Documentation

Full documentation index: docs/docs.md

Test Suites

⚠️ Validation Status:

  • Concurrent Load - Fully validated and tested
  • 🚧 Scalability - Work in progress; no guarantees
  • 🚧 Resource Contention - Work in progress; no guarantees
  • 🚧 Embedding Models - Work in progress; no guarantees

Only the concurrent load test suite has been fully validated and tested. Other test suites are work in progress and provided as-is with no guarantees they will work without modification.

Test Suite: Concurrent Load

Tests model performance under various concurrent request loads.

  • Concurrency levels: 1, 2, 4, 8, 16, 32
  • 8 LLM models + 2 embedding models
  • Focus: P95 latency, TTFT, throughput scaling

Test Suite: Scalability

Characterizes maximum throughput and performance curves.

  • Sweep tests for capacity discovery
  • Synchronous baseline tests
  • Poisson distribution tests
  • Focus: Maximum capacity, saturation points

Test Suite: Resource Contention (Planned)

Multi-tenant and resource sharing scenarios.

Models

Current model coverage:

LLM Models (8 total):

  • Llama-3.2 (1B, 3B) - Prefill-heavy
  • TinyLlama-1.1B - Balanced small-scale
  • OPT (125M, 1.3B) - Decode-heavy legacy baseline
  • Granite-3.2-2B - Balanced enterprise
  • Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced

Embedding Models:

  • granite-embedding-english-r2
  • granite-embedding-278m-multilingual

See models/models.md for complete model definitions, selection rationale, and how to add new models.

Requirements

System Requirements

  • CPU: Intel Xeon (Ice Lake or newer) or AMD EPYC
  • Memory: 64GB+ RAM recommended
  • OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
  • Storage: 500GB+ for models and results

Software Requirements

  • Python 3.10+
  • Docker 24.0+ or Podman 4.0+
  • Ansible 2.14+ (for automation)
  • GuideLLM v0.5.0+
  • vLLM

See docs/getting-started/quick-start.md for installation instructions.

Container Runtime Support

This repository supports both Docker and Podman:

  • Docker: Traditional container runtime
  • Podman: Daemonless, rootless-capable alternative
  • Auto-detection: Automatically uses available runtime

Set runtime preference:

# Use Docker
export CONTAINER_RUNTIME=docker

# Use Podman
export CONTAINER_RUNTIME=podman

# Auto-detect (default)
export CONTAINER_RUNTIME=auto

See Container Guide for details.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run pre-commit checks: pre-commit run --all-files
  5. Submit a pull request

Pre-commit Hooks

This repository uses pre-commit to ensure code quality.

# Install pre-commit
pip install pre-commit

# Install hooks
pre-commit install
pre-commit install --hook-type commit-msg

# Run manually
pre-commit run --all-files

License

[Add license information]

Support

Acknowledgments

  • vLLM - High-performance LLM inference engine
  • GuideLLM - LLM benchmarking tool
  • Intel and AMD for CPU optimization guidance

About

an automated test suite for cpu inference performance evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors