The vLLM Playground now includes a built-in performance benchmarking tool that measures key metrics of your vLLM server. This feature uses a custom load testing implementation to provide comprehensive performance insights.
The Performance Metrics section is located below the main chat interface in a full-width panel.
Total Requests: Number of requests to send (10-1000)
- Default: 100 requests
- Higher values give more accurate results
Request Rate: Requests per second (1-50)
- Default: 5 req/s
- Controls the load intensity
Prompt Tokens: Input length in tokens (10-2048)
- Default: 100 tokens
- Simulates typical input size
Output Tokens: Response length in tokens (10-2048)
- Default: 100 tokens
- Simulates typical output size
The benchmark measures and displays 8 key performance indicators:
- What: Requests completed per second
- Unit: req/s
- Importance: Overall system capacity
- Good Range: Depends on model size and hardware
- What: Mean response time across all requests
- Unit: milliseconds (ms)
- Importance: Typical user experience
- Good Range: < 500ms for small models, < 2000ms for large models
- What: Output tokens generated per second
- Unit: tok/s
- Importance: Generation speed
- Good Range: 50-500+ tok/s depending on GPU
- What: 50th percentile latency (median)
- Unit: ms
- Importance: Typical performance
- Good Range: Similar to average latency
- What: 95th percentile latency
- Unit: ms
- Importance: Worst-case for most users
- Good Range: < 2x average latency
- What: 99th percentile latency
- Unit: ms
- Importance: Worst-case scenarios
- Good Range: < 3x average latency
- What: Combined input + output tokens processed
- Unit: token count
- Importance: Total workload completed
- What: Percentage of successful requests
- Unit: percentage (%)
- Importance: System reliability
- Good Range: > 99%
1. Configure your model in the left panel
2. Click "Start Server"
3. Wait for server to be ready (check logs)
1. Set Total Requests (e.g., 100 for quick test, 500+ for accurate results)
2. Set Request Rate (e.g., 5 req/s for moderate load, 20+ for stress test)
3. Set Prompt/Output tokens to match your use case
1. Click "▶️ Run Benchmark"
2. Watch progress bar and logs
3. Wait for completion (time = requests / rate)
1. Review all 8 metrics
2. Compare with expectations
3. Adjust server config if needed
4. Re-run to verify improvements
Total Requests: 50
Request Rate: 10 req/s
Prompt Tokens: 100
Output Tokens: 100
Duration: ~5 seconds
Total Requests: 100
Request Rate: 5 req/s
Prompt Tokens: 100
Output Tokens: 100
Duration: ~20 seconds
Total Requests: 500
Request Rate: 20 req/s
Prompt Tokens: 200
Output Tokens: 200
Duration: ~25 seconds
Total Requests: 1000
Request Rate: 10 req/s
Prompt Tokens: 150
Output Tokens: 150
Duration: ~100 seconds
Located at the top of the metrics section:
┌────────────────────────────────────────────────┐
│ Total Requests | Request Rate | Prompt | Output│
│ [100] | [5] | [100] | [100] │
└────────────────────────────────────────────────┘
Shows real-time progress during benchmark:
┌────────────────────────────────────────────────┐
│ Running benchmark... [████░░] 80% │
└────────────────────────────────────────────────┘
Displays results in a responsive grid:
┌────────┬────────┬────────┬────────┐
│ 🚀 │ ⏱️ │ 📝 │ 📊 │
│ 5.2 │ 245 │ 312 │ 198 │
│ req/s │ ms │ tok/s │ ms │
└────────┴────────┴────────┴────────┘
POST /api/benchmark/start
- Starts a benchmark with given configuration
- Returns immediately, benchmark runs in background
- Logs progress to WebSocket stream
GET /api/benchmark/status
- Returns current benchmark status
- Includes results if completed
- Polled every second by frontend
POST /api/benchmark/stop
- Stops a running benchmark
- Cleans up resources
- Returns immediately
1. Generate sample prompt of specified length
2. Create HTTP session
3. For each request:
- Send POST to /v1/chat/completions
- Measure latency
- Collect token counts
- Rate limit between requests
- Update progress every 10%
4. Calculate statistics:
- Mean, percentiles (50, 95, 99)
- Throughput, token rates
- Success rate
5. Return comprehensive results
- Run for at least 100 requests
- Use representative prompt/output sizes
- Test during idle periods
- Run multiple times and average
- Clear GPU cache between runs
- Increase request rate gradually
- Monitor GPU utilization
- Enable tensor parallelism
- Use prefix caching
- Optimize model parameters
- Use moderate request rate (< 10)
- Test various input lengths
- Check P95/P99 for outliers
- Compare different configurations
- Identify bottlenecks
✅ Success rate > 99% ✅ P95 latency < 2x average ✅ Consistent throughput ✅ Linear scaling with rate ✅ No errors in logs
If throughput is low:
- Increase tensor parallel size
- Enable continuous batching
- Reduce model precision (dtype)
- Increase GPU memory allocation
If latency is high:
- Reduce batch size
- Decrease request rate
- Check GPU utilization
- Verify network latency
If success rate is low:
- Check server logs for errors
- Reduce request rate
- Increase timeout values
- Verify model is loaded
While this implementation doesn't use GuideLLM directly, it provides similar functionality:
✅ Built-in, no installation needed ✅ Real-time progress updates ✅ Integrated with WebUI ✅ Simple configuration ✅ Logs streamed to UI
✅ More advanced features ✅ Multiple backends ✅ Detailed reports ✅ CLI interface ✅ Batch testing
- Quick performance checks
- Integrated workflow
- Visual feedback needed
- GUI preferred
- Rapid iteration
- Detailed analysis needed
- Comparing multiple systems
- Production benchmarking
- Automated testing
- Report generation
- index.html - Added metrics section HTML
- style.css - Added ~250 lines of styling
- app.py - Added benchmark endpoints (~150 lines)
- app.js - Added benchmark logic (~150 lines)
- requirements.txt - Added numpy dependency
numpy- For percentile calculationsaiohttp- For async HTTP requestsasyncio- For concurrent execution
- Memory: < 50MB during benchmark
- CPU: Minimal (< 10%)
- Network: Depends on request rate
- GPU: No impact on WebUI (only vLLM server)
Throughput: 12.5 req/s
Avg Latency: 78 ms
Token Throughput: 450 tok/s
P50 Latency: 75 ms
P95 Latency: 95 ms
P99 Latency: 112 ms
Success Rate: 100 %
Throughput: 4.2 req/s
Avg Latency: 235 ms
Token Throughput: 185 tok/s
P50 Latency: 228 ms
P95 Latency: 298 ms
P99 Latency: 345 ms
Success Rate: 100 %
- Check: Is vLLM server running?
- Check: Are parameters valid?
- Solution: Start server first, verify logs
- Check: Server logs for errors
- Check: Network connectivity
- Solution: Reduce request rate, check server
- Check: Server was idle during test
- Check: No other clients connected
- Solution: Re-run benchmark, increase sample size
- Check: Server is responding
- Check: No timeout errors
- Solution: Stop and restart benchmark
Potential improvements:
- Export results to CSV/JSON
- Historical result comparison
- Charting and visualization
- Custom test prompts
- Concurrent request patterns
- Real-time GPU metrics
- Batch size optimization
- Cost per token calculation
Benchmark responsibly! Start with small tests and increase gradually.
Happy benchmarking! 🚀📊