cd vllm-playground
./scripts/start.shOr manually:
pip install -r requirements.txt
python3 run.pyOpen your browser: http://localhost:7860
- Select or enter a model name
- Configure settings (optional)
- Click "Start Server"
- Wait for "Application startup complete" in logs
- Start chatting!
Model: facebook/opt-125m
Tensor Parallel: 1
GPU Memory: 50%
Perfect for quick testing and development.
Model: meta-llama/Llama-2-7b-chat-hf
Tensor Parallel: 1
GPU Memory: 90%
Enable Prefix Caching: ✓
Model: meta-llama/Llama-2-13b-chat-hf
Tensor Parallel: 2 (or 4)
GPU Memory: 90%
Enable Prefix Caching: ✓
| Setting | Description | Default |
|---|---|---|
| Model | HuggingFace model name or path | facebook/opt-125m |
| Host | Server bind address | 0.0.0.0 |
| Port | Server port | 8000 |
| Tensor Parallel Size | Number of GPUs for parallelism | 1 |
| GPU Memory Utilization | GPU memory fraction (0.0-1.0) | 0.9 |
| Data Type | Model precision | auto |
| Max Model Length | Override max sequence length | auto |
| Option | Description | When to Use |
|---|---|---|
| Trust Remote Code | Execute model code | Models with custom code |
| Enable Prefix Caching | Reuse KV cache | Repeated prompts |
| Disable Log Stats | Skip periodic stats | Cleaner logs |
| Parameter | Range | Description |
|---|---|---|
| Temperature | 0.0 - 2.0 | Higher = more random |
| Max Tokens | 1 - 4096 | Response length limit |
| Shortcut | Action |
|---|---|
| Ctrl/Cmd + Enter | Send message |
- Check model name spelling
- Ensure you have access (login with
huggingface-cli login) - Try a different model first
- Reduce GPU memory utilization
- Use a smaller model
- Increase tensor parallel size
- Set max model length
- Check if port 8000 is available
- Look for errors in logs panel
- Verify CUDA/GPU availability
- Check if WebUI is still running
- Refresh the page
- Check browser console for errors
Begin with facebook/opt-125m to verify everything works.
Watch the log panel during startup for any warnings or errors.
- Start with 70-80% if unsure
- Increase gradually based on available memory
- Leave headroom for other processes
- 0.0-0.3: Focused, deterministic
- 0.4-0.8: Balanced (default: 0.7)
- 0.9-1.5: Creative, varied
- 1.5+: Very random (use with caution)
Enable for:
- System prompts
- Few-shot examples
- Repeated context
- Clear chat between different tasks
- Adjust temperature per use case
- Start with lower max tokens for faster responses
nvidia-smiwatch -n 1 nvidia-smihuggingface-cli scan-cachehuggingface-cli loginThe WebUI exposes these endpoints:
GET /- Main interfaceGET /api/status- Server statusPOST /api/start- Start vLLM serverPOST /api/stop- Stop vLLM serverPOST /api/chat- Send chat messageGET /api/models- List common modelsWS /ws/logs- Log streaming WebSocket
| Indicator | Meaning |
|---|---|
| 🔴 Disconnected | WebUI not connected |
| 🟢 Connected | WebUI ready |
| 🔵 Server Running | vLLM server active |
WEBUI_PORT=8080 python3 run.pyUpdate "Port" field in configuration panel.
- WebUI runs on port 7860 (configurable)
- vLLM server runs on port 8000 (configurable)
- Logs are color-coded: Info (blue), Warning (yellow), Error (red)
- Chat history is maintained per session
- Server stops when WebUI is closed
- Check the logs panel for error messages
- Review this guide
- Consult vLLM documentation
- Open an issue on GitHub
Happy chatting! 🚀