Skip to content

Latest commit

 

History

History
390 lines (315 loc) · 9.9 KB

File metadata and controls

390 lines (315 loc) · 9.9 KB

📊 Performance Metrics & Benchmarking

Overview

The vLLM Playground now includes a built-in performance benchmarking tool that measures key metrics of your vLLM server. This feature uses a custom load testing implementation to provide comprehensive performance insights.

Location

The Performance Metrics section is located below the main chat interface in a full-width panel.

Features

🎯 Configurable Benchmark Parameters

Total Requests: Number of requests to send (10-1000)

  • Default: 100 requests
  • Higher values give more accurate results

Request Rate: Requests per second (1-50)

  • Default: 5 req/s
  • Controls the load intensity

Prompt Tokens: Input length in tokens (10-2048)

  • Default: 100 tokens
  • Simulates typical input size

Output Tokens: Response length in tokens (10-2048)

  • Default: 100 tokens
  • Simulates typical output size

📈 Key Metrics Displayed

The benchmark measures and displays 8 key performance indicators:

1. Throughput 🚀

  • What: Requests completed per second
  • Unit: req/s
  • Importance: Overall system capacity
  • Good Range: Depends on model size and hardware

2. Average Latency ⏱️

  • What: Mean response time across all requests
  • Unit: milliseconds (ms)
  • Importance: Typical user experience
  • Good Range: < 500ms for small models, < 2000ms for large models

3. Token Throughput 📝

  • What: Output tokens generated per second
  • Unit: tok/s
  • Importance: Generation speed
  • Good Range: 50-500+ tok/s depending on GPU

4. P50 Latency 📊

  • What: 50th percentile latency (median)
  • Unit: ms
  • Importance: Typical performance
  • Good Range: Similar to average latency

5. P95 Latency 📈

  • What: 95th percentile latency
  • Unit: ms
  • Importance: Worst-case for most users
  • Good Range: < 2x average latency

6. P99 Latency

  • What: 99th percentile latency
  • Unit: ms
  • Importance: Worst-case scenarios
  • Good Range: < 3x average latency

7. Total Tokens 💬

  • What: Combined input + output tokens processed
  • Unit: token count
  • Importance: Total workload completed

8. Success Rate

  • What: Percentage of successful requests
  • Unit: percentage (%)
  • Importance: System reliability
  • Good Range: > 99%

How to Use

Step 1: Start vLLM Server

1. Configure your model in the left panel
2. Click "Start Server"
3. Wait for server to be ready (check logs)

Step 2: Configure Benchmark

1. Set Total Requests (e.g., 100 for quick test, 500+ for accurate results)
2. Set Request Rate (e.g., 5 req/s for moderate load, 20+ for stress test)
3. Set Prompt/Output tokens to match your use case

Step 3: Run Benchmark

1. Click "▶️ Run Benchmark"
2. Watch progress bar and logs
3. Wait for completion (time = requests / rate)

Step 4: Analyze Results

1. Review all 8 metrics
2. Compare with expectations
3. Adjust server config if needed
4. Re-run to verify improvements

Benchmark Configurations

Quick Test (Fast validation)

Total Requests: 50
Request Rate: 10 req/s
Prompt Tokens: 100
Output Tokens: 100
Duration: ~5 seconds

Standard Test (Typical workload)

Total Requests: 100
Request Rate: 5 req/s
Prompt Tokens: 100
Output Tokens: 100
Duration: ~20 seconds

Stress Test (Maximum load)

Total Requests: 500
Request Rate: 20 req/s
Prompt Tokens: 200
Output Tokens: 200
Duration: ~25 seconds

Production Simulation (Real-world)

Total Requests: 1000
Request Rate: 10 req/s
Prompt Tokens: 150
Output Tokens: 150
Duration: ~100 seconds

UI Components

Benchmark Configuration Panel

Located at the top of the metrics section:

┌────────────────────────────────────────────────┐
│  Total Requests | Request Rate | Prompt | Output│
│      [100]      |     [5]      | [100]  | [100] │
└────────────────────────────────────────────────┘

Progress Indicator

Shows real-time progress during benchmark:

┌────────────────────────────────────────────────┐
│ Running benchmark...              [████░░] 80% │
└────────────────────────────────────────────────┘

Metrics Grid (8 Cards)

Displays results in a responsive grid:

┌────────┬────────┬────────┬────────┐
│   🚀   │   ⏱️   │   📝   │   📊   │
│  5.2   │  245   │  312   │  198   │
│ req/s  │   ms   │ tok/s  │   ms   │
└────────┴────────┴────────┴────────┘

Backend Implementation

API Endpoints

POST /api/benchmark/start

  • Starts a benchmark with given configuration
  • Returns immediately, benchmark runs in background
  • Logs progress to WebSocket stream

GET /api/benchmark/status

  • Returns current benchmark status
  • Includes results if completed
  • Polled every second by frontend

POST /api/benchmark/stop

  • Stops a running benchmark
  • Cleans up resources
  • Returns immediately

Benchmark Algorithm

1. Generate sample prompt of specified length
2. Create HTTP session
3. For each request:
   - Send POST to /v1/chat/completions
   - Measure latency
   - Collect token counts
   - Rate limit between requests
   - Update progress every 10%
4. Calculate statistics:
   - Mean, percentiles (50, 95, 99)
   - Throughput, token rates
   - Success rate
5. Return comprehensive results

Performance Tips

🎯 For Accurate Results

  • Run for at least 100 requests
  • Use representative prompt/output sizes
  • Test during idle periods
  • Run multiple times and average
  • Clear GPU cache between runs

⚡ For Maximum Throughput

  • Increase request rate gradually
  • Monitor GPU utilization
  • Enable tensor parallelism
  • Use prefix caching
  • Optimize model parameters

📊 For Latency Testing

  • Use moderate request rate (< 10)
  • Test various input lengths
  • Check P95/P99 for outliers
  • Compare different configurations
  • Identify bottlenecks

Interpreting Results

Good Performance Indicators

✅ Success rate > 99% ✅ P95 latency < 2x average ✅ Consistent throughput ✅ Linear scaling with rate ✅ No errors in logs

Warning Signs

⚠️ Success rate < 95% ⚠️ P99 >> P95 (high variance) ⚠️ Throughput plateaus early ⚠️ Increasing latency over time ⚠️ GPU memory errors

Optimization Strategies

If throughput is low:

  • Increase tensor parallel size
  • Enable continuous batching
  • Reduce model precision (dtype)
  • Increase GPU memory allocation

If latency is high:

  • Reduce batch size
  • Decrease request rate
  • Check GPU utilization
  • Verify network latency

If success rate is low:

  • Check server logs for errors
  • Reduce request rate
  • Increase timeout values
  • Verify model is loaded

Comparison with GuideLLM

While this implementation doesn't use GuideLLM directly, it provides similar functionality:

Our Implementation

✅ Built-in, no installation needed ✅ Real-time progress updates ✅ Integrated with WebUI ✅ Simple configuration ✅ Logs streamed to UI

GuideLLM

✅ More advanced features ✅ Multiple backends ✅ Detailed reports ✅ CLI interface ✅ Batch testing

Use Our Tool When:

  • Quick performance checks
  • Integrated workflow
  • Visual feedback needed
  • GUI preferred
  • Rapid iteration

Use GuideLLM When:

  • Detailed analysis needed
  • Comparing multiple systems
  • Production benchmarking
  • Automated testing
  • Report generation

Technical Details

Files Modified

  1. index.html - Added metrics section HTML
  2. style.css - Added ~250 lines of styling
  3. app.py - Added benchmark endpoints (~150 lines)
  4. app.js - Added benchmark logic (~150 lines)
  5. requirements.txt - Added numpy dependency

Dependencies

  • numpy - For percentile calculations
  • aiohttp - For async HTTP requests
  • asyncio - For concurrent execution

Performance Impact

  • Memory: < 50MB during benchmark
  • CPU: Minimal (< 10%)
  • Network: Depends on request rate
  • GPU: No impact on WebUI (only vLLM server)

Example Results

Small Model (opt-125m) on A100

Throughput:       12.5 req/s
Avg Latency:      78 ms
Token Throughput: 450 tok/s
P50 Latency:      75 ms
P95 Latency:      95 ms
P99 Latency:      112 ms
Success Rate:     100 %

Large Model (Llama-2-7b) on A100

Throughput:       4.2 req/s
Avg Latency:      235 ms
Token Throughput: 185 tok/s
P50 Latency:      228 ms
P95 Latency:      298 ms
P99 Latency:      345 ms
Success Rate:     100 %

Troubleshooting

Benchmark Won't Start

  • Check: Is vLLM server running?
  • Check: Are parameters valid?
  • Solution: Start server first, verify logs

Benchmark Fails Immediately

  • Check: Server logs for errors
  • Check: Network connectivity
  • Solution: Reduce request rate, check server

Results Seem Wrong

  • Check: Server was idle during test
  • Check: No other clients connected
  • Solution: Re-run benchmark, increase sample size

Progress Stuck

  • Check: Server is responding
  • Check: No timeout errors
  • Solution: Stop and restart benchmark

Future Enhancements

Potential improvements:

  • Export results to CSV/JSON
  • Historical result comparison
  • Charting and visualization
  • Custom test prompts
  • Concurrent request patterns
  • Real-time GPU metrics
  • Batch size optimization
  • Cost per token calculation

Benchmark responsibly! Start with small tests and increase gradually.

Happy benchmarking! 🚀📊