Skip to content

Commit 2c06b24

Browse files
authored
test: benchmark scripts (llamastack#3160)
# What does this PR do? 1. Add our own benchmark script instead of locust (doesn't support measuring streaming latency well) 2. Simplify k8s deployment 3. Add a simple profile script for locally running server ## Test Plan ❮ ./run-benchmark.sh --target stack --duration 180 --concurrent 10 ============================================================ BENCHMARK RESULTS ============================================================ Total time: 180.00s Concurrent users: 10 Total requests: 1636 Successful requests: 1636 Failed requests: 0 Success rate: 100.0% Requests per second: 9.09 Response Time Statistics: Mean: 1.095s Median: 1.721s Min: 0.136s Max: 3.218s Std Dev: 0.762s Percentiles: P50: 1.721s P90: 1.751s P95: 1.756s P99: 1.796s Time to First Token (TTFT) Statistics: Mean: 0.037s Median: 0.037s Min: 0.023s Max: 0.211s Std Dev: 0.011s TTFT Percentiles: P50: 0.037s P90: 0.040s P95: 0.044s P99: 0.055s Streaming Statistics: Mean chunks per response: 64.0 Total chunks received: 104775
1 parent 2114214 commit 2c06b24

File tree

13 files changed

+633
-328
lines changed

13 files changed

+633
-328
lines changed

docs/source/contributing/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,11 @@ new_vector_database
2323
```{include} ../../../tests/README.md
2424
```
2525

26+
## Benchmarking
27+
28+
```{include} ../../../docs/source/distributions/k8s-benchmark/README.md
29+
```
30+
2631
### Advanced Topics
2732

2833
For developers who need deeper understanding of the testing system internals:
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Llama Stack Benchmark Suite on Kubernetes
2+
3+
## Motivation
4+
5+
Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.
6+
7+
### Why This Benchmark Suite Exists
8+
9+
**Performance Validation**: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:
10+
- Llama Stack inference (with vLLM backend)
11+
- Direct vLLM inference calls
12+
- Both under identical Kubernetes deployment conditions
13+
14+
**Production Readiness Assessment**: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.
15+
16+
**Regression Detection (TODO)**: As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.
17+
18+
**Resource Planning**: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:
19+
- Kubernetes resource allocation (CPU, memory, GPU)
20+
- Auto-scaling configurations
21+
- Cost optimization strategies
22+
23+
### Key Metrics Captured
24+
25+
The benchmark suite measures critical performance indicators:
26+
- **Throughput**: Requests per second under sustained load
27+
- **Latency Distribution**: P50, P95, P99 response times
28+
- **Time to First Token (TTFT)**: Critical for streaming applications
29+
- **Error Rates**: Request failures and timeout analysis
30+
31+
This data enables data-driven architectural decisions and performance optimization efforts.
32+
33+
## Setup
34+
35+
**1. Deploy base k8s infrastructure:**
36+
```bash
37+
cd ../k8s
38+
./apply.sh
39+
```
40+
41+
**2. Deploy benchmark components:**
42+
```bash
43+
cd ../k8s-benchmark
44+
./apply.sh
45+
```
46+
47+
**3. Verify deployment:**
48+
```bash
49+
kubectl get pods
50+
# Should see: llama-stack-benchmark-server, vllm-server, etc.
51+
```
52+
53+
## Quick Start
54+
55+
### Basic Benchmarks
56+
57+
**Benchmark Llama Stack (default):**
58+
```bash
59+
cd docs/source/distributions/k8s-benchmark/
60+
./run-benchmark.sh
61+
```
62+
63+
**Benchmark vLLM direct:**
64+
```bash
65+
./run-benchmark.sh --target vllm
66+
```
67+
68+
### Custom Configuration
69+
70+
**Extended benchmark with high concurrency:**
71+
```bash
72+
./run-benchmark.sh --target vllm --duration 120 --concurrent 20
73+
```
74+
75+
**Short test run:**
76+
```bash
77+
./run-benchmark.sh --target stack --duration 30 --concurrent 5
78+
```
79+
80+
## Command Reference
81+
82+
### run-benchmark.sh Options
83+
84+
```bash
85+
./run-benchmark.sh [options]
86+
87+
Options:
88+
-t, --target <stack|vllm> Target to benchmark (default: stack)
89+
-d, --duration <seconds> Duration in seconds (default: 60)
90+
-c, --concurrent <users> Number of concurrent users (default: 10)
91+
-h, --help Show help message
92+
93+
Examples:
94+
./run-benchmark.sh --target vllm # Benchmark vLLM direct
95+
./run-benchmark.sh --target stack # Benchmark Llama Stack
96+
./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users
97+
```
98+
99+
## Local Testing
100+
101+
### Running Benchmark Locally
102+
103+
For local development without Kubernetes:
104+
105+
**1. Start OpenAI mock server:**
106+
```bash
107+
uv run python openai-mock-server.py --port 8080
108+
```
109+
110+
**2. Run benchmark against mock server:**
111+
```bash
112+
uv run python benchmark.py \
113+
--base-url http://localhost:8080/v1 \
114+
--model mock-inference \
115+
--duration 30 \
116+
--concurrent 5
117+
```
118+
119+
**3. Test against local vLLM server:**
120+
```bash
121+
# If you have vLLM running locally on port 8000
122+
uv run python benchmark.py \
123+
--base-url http://localhost:8000/v1 \
124+
--model meta-llama/Llama-3.2-3B-Instruct \
125+
--duration 30 \
126+
--concurrent 5
127+
```
128+
129+
**4. Profile the running server:**
130+
```bash
131+
./profile_running_server.sh
132+
```
133+
134+
135+
136+
### OpenAI Mock Server
137+
138+
The `openai-mock-server.py` provides:
139+
- **OpenAI-compatible API** for testing without real models
140+
- **Configurable streaming delay** via `STREAM_DELAY_SECONDS` env var
141+
- **Consistent responses** for reproducible benchmarks
142+
- **Lightweight testing** without GPU requirements
143+
144+
**Mock server usage:**
145+
```bash
146+
uv run python openai-mock-server.py --port 8080
147+
```
148+
149+
The mock server is also deployed in k8s as `openai-mock-service:8080` and can be used by changing the Llama Stack configuration to use the `mock-vllm-inference` provider.
150+
151+
## Files in this Directory
152+
153+
- `benchmark.py` - Core benchmark script with async streaming support
154+
- `run-benchmark.sh` - Main script with target selection and configuration
155+
- `openai-mock-server.py` - Mock OpenAI API server for local testing
156+
- `README.md` - This documentation file

docs/source/distributions/k8s-benchmark/apply.sh

Lines changed: 1 addition & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88

99
# Deploys the benchmark-specific components on top of the base k8s deployment (../k8s/apply.sh).
1010

11-
export MOCK_INFERENCE_PORT=8080
1211
export STREAM_DELAY_SECONDS=0.005
1312

1413
export POSTGRES_USER=llamastack
@@ -20,38 +19,18 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
2019

2120
export MOCK_INFERENCE_MODEL=mock-inference
2221

23-
# Use llama-stack-benchmark-service as the benchmark server
24-
export LOCUST_HOST=http://llama-stack-benchmark-service:8323
25-
export LOCUST_BASE_PATH=/v1/openai/v1
26-
27-
# Use vllm-service as the benchmark server
28-
# export LOCUST_HOST=http://vllm-server:8000
29-
# export LOCUST_BASE_PATH=/v1
30-
22+
export MOCK_INFERENCE_URL=openai-mock-service:8080
3123

3224
export BENCHMARK_INFERENCE_MODEL=$INFERENCE_MODEL
3325

3426
set -euo pipefail
3527
set -x
3628

3729
# Deploy benchmark-specific components
38-
# Deploy OpenAI mock server
39-
kubectl create configmap openai-mock --from-file=openai-mock-server.py \
40-
--dry-run=client -o yaml | kubectl apply --validate=false -f -
41-
42-
envsubst < openai-mock-deployment.yaml | kubectl apply --validate=false -f -
43-
44-
# Create configmap with our custom stack config
4530
kubectl create configmap llama-stack-config --from-file=stack_run_config.yaml \
4631
--dry-run=client -o yaml > stack-configmap.yaml
4732

4833
kubectl apply --validate=false -f stack-configmap.yaml
4934

5035
# Deploy our custom llama stack server (overriding the base one)
5136
envsubst < stack-k8s.yaml.template | kubectl apply --validate=false -f -
52-
53-
# Deploy Locust load testing
54-
kubectl create configmap locust-script --from-file=locustfile.py \
55-
--dry-run=client -o yaml | kubectl apply --validate=false -f -
56-
57-
envsubst < locust-k8s.yaml | kubectl apply --validate=false -f -

0 commit comments

Comments
 (0)