|
| 1 | +# Llama Stack Benchmark Suite on Kubernetes |
| 2 | + |
| 3 | +## Motivation |
| 4 | + |
| 5 | +Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM. |
| 6 | + |
| 7 | +### Why This Benchmark Suite Exists |
| 8 | + |
| 9 | +**Performance Validation**: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing: |
| 10 | +- Llama Stack inference (with vLLM backend) |
| 11 | +- Direct vLLM inference calls |
| 12 | +- Both under identical Kubernetes deployment conditions |
| 13 | + |
| 14 | +**Production Readiness Assessment**: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness. |
| 15 | + |
| 16 | +**Regression Detection (TODO)**: As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments. |
| 17 | + |
| 18 | +**Resource Planning**: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about: |
| 19 | +- Kubernetes resource allocation (CPU, memory, GPU) |
| 20 | +- Auto-scaling configurations |
| 21 | +- Cost optimization strategies |
| 22 | + |
| 23 | +### Key Metrics Captured |
| 24 | + |
| 25 | +The benchmark suite measures critical performance indicators: |
| 26 | +- **Throughput**: Requests per second under sustained load |
| 27 | +- **Latency Distribution**: P50, P95, P99 response times |
| 28 | +- **Time to First Token (TTFT)**: Critical for streaming applications |
| 29 | +- **Error Rates**: Request failures and timeout analysis |
| 30 | + |
| 31 | +This data enables data-driven architectural decisions and performance optimization efforts. |
| 32 | + |
| 33 | +## Setup |
| 34 | + |
| 35 | +**1. Deploy base k8s infrastructure:** |
| 36 | +```bash |
| 37 | +cd ../k8s |
| 38 | +./apply.sh |
| 39 | +``` |
| 40 | + |
| 41 | +**2. Deploy benchmark components:** |
| 42 | +```bash |
| 43 | +cd ../k8s-benchmark |
| 44 | +./apply.sh |
| 45 | +``` |
| 46 | + |
| 47 | +**3. Verify deployment:** |
| 48 | +```bash |
| 49 | +kubectl get pods |
| 50 | +# Should see: llama-stack-benchmark-server, vllm-server, etc. |
| 51 | +``` |
| 52 | + |
| 53 | +## Quick Start |
| 54 | + |
| 55 | +### Basic Benchmarks |
| 56 | + |
| 57 | +**Benchmark Llama Stack (default):** |
| 58 | +```bash |
| 59 | +cd docs/source/distributions/k8s-benchmark/ |
| 60 | +./run-benchmark.sh |
| 61 | +``` |
| 62 | + |
| 63 | +**Benchmark vLLM direct:** |
| 64 | +```bash |
| 65 | +./run-benchmark.sh --target vllm |
| 66 | +``` |
| 67 | + |
| 68 | +### Custom Configuration |
| 69 | + |
| 70 | +**Extended benchmark with high concurrency:** |
| 71 | +```bash |
| 72 | +./run-benchmark.sh --target vllm --duration 120 --concurrent 20 |
| 73 | +``` |
| 74 | + |
| 75 | +**Short test run:** |
| 76 | +```bash |
| 77 | +./run-benchmark.sh --target stack --duration 30 --concurrent 5 |
| 78 | +``` |
| 79 | + |
| 80 | +## Command Reference |
| 81 | + |
| 82 | +### run-benchmark.sh Options |
| 83 | + |
| 84 | +```bash |
| 85 | +./run-benchmark.sh [options] |
| 86 | + |
| 87 | +Options: |
| 88 | + -t, --target <stack|vllm> Target to benchmark (default: stack) |
| 89 | + -d, --duration <seconds> Duration in seconds (default: 60) |
| 90 | + -c, --concurrent <users> Number of concurrent users (default: 10) |
| 91 | + -h, --help Show help message |
| 92 | + |
| 93 | +Examples: |
| 94 | + ./run-benchmark.sh --target vllm # Benchmark vLLM direct |
| 95 | + ./run-benchmark.sh --target stack # Benchmark Llama Stack |
| 96 | + ./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users |
| 97 | +``` |
| 98 | + |
| 99 | +## Local Testing |
| 100 | + |
| 101 | +### Running Benchmark Locally |
| 102 | + |
| 103 | +For local development without Kubernetes: |
| 104 | + |
| 105 | +**1. Start OpenAI mock server:** |
| 106 | +```bash |
| 107 | +uv run python openai-mock-server.py --port 8080 |
| 108 | +``` |
| 109 | + |
| 110 | +**2. Run benchmark against mock server:** |
| 111 | +```bash |
| 112 | +uv run python benchmark.py \ |
| 113 | + --base-url http://localhost:8080/v1 \ |
| 114 | + --model mock-inference \ |
| 115 | + --duration 30 \ |
| 116 | + --concurrent 5 |
| 117 | +``` |
| 118 | + |
| 119 | +**3. Test against local vLLM server:** |
| 120 | +```bash |
| 121 | +# If you have vLLM running locally on port 8000 |
| 122 | +uv run python benchmark.py \ |
| 123 | + --base-url http://localhost:8000/v1 \ |
| 124 | + --model meta-llama/Llama-3.2-3B-Instruct \ |
| 125 | + --duration 30 \ |
| 126 | + --concurrent 5 |
| 127 | +``` |
| 128 | + |
| 129 | +**4. Profile the running server:** |
| 130 | +```bash |
| 131 | +./profile_running_server.sh |
| 132 | +``` |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +### OpenAI Mock Server |
| 137 | + |
| 138 | +The `openai-mock-server.py` provides: |
| 139 | +- **OpenAI-compatible API** for testing without real models |
| 140 | +- **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var |
| 141 | +- **Consistent responses** for reproducible benchmarks |
| 142 | +- **Lightweight testing** without GPU requirements |
| 143 | + |
| 144 | +**Mock server usage:** |
| 145 | +```bash |
| 146 | +uv run python openai-mock-server.py --port 8080 |
| 147 | +``` |
| 148 | + |
| 149 | +The mock server is also deployed in k8s as `openai-mock-service:8080` and can be used by changing the Llama Stack configuration to use the `mock-vllm-inference` provider. |
| 150 | + |
| 151 | +## Files in this Directory |
| 152 | + |
| 153 | +- `benchmark.py` - Core benchmark script with async streaming support |
| 154 | +- `run-benchmark.sh` - Main script with target selection and configuration |
| 155 | +- `openai-mock-server.py` - Mock OpenAI API server for local testing |
| 156 | +- `README.md` - This documentation file |
0 commit comments