A beautiful, dark-themed UI built with:
- Responsive Design: Works on desktop, tablet, and mobile
- Real-time Updates: WebSocket-powered live logs
- Smooth Animations: Polished user experience
- Intuitive Layout: Three-panel design for easy navigation
- One-Click Server Control: Start/stop vLLM servers instantly
- Full Configuration: All vLLM parameters accessible
- Status Monitoring: Real-time server status and uptime
- Multiple Models: Easy switching between different models
- Test Your Models: Chat directly with your vLLM server
- Conversation History: Maintains context across messages
- Adjustable Parameters: Temperature and max tokens sliders
- Beautiful Message UI: Clear distinction between user/assistant messages
- Real-time Streaming: See logs as they happen
- Color-Coded: Different colors for info/warning/error
- Auto-scroll: Option to follow newest logs
- Searchable: Easy to find specific log entries
┌─────────────────────────────────────────────────────────┐
│ Browser UI │
│ ┌─────────────┬──────────────┬──────────────┐ │
│ │ Config │ Chat │ Logs │ │
│ │ Panel │ Interface │ Viewer │ │
│ └─────────────┴──────────────┴──────────────┘ │
└────────────────────┬────────────────────────────────────┘
│ WebSocket + REST API
┌────────────────────▼────────────────────────────────────┐
│ FastAPI Backend (app.py) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ • Server Management │ │
│ │ • Process Control │ │
│ │ • Log Broadcasting │ │
│ │ • Chat Proxy │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────┬────────────────────────────────────┘
│ subprocess
┌────────────────────▼────────────────────────────────────┐
│ vLLM Server Process │
│ (OpenAI-compatible API on port 8000) │
└──────────────────────────────────────────────────────────┘
webui/
├── 📄 app.py # FastAPI backend server
├── 🌐 index.html # Main UI interface
├── 🚀 run.py # Launcher script
├── 📜 start.sh # Quick start bash script
├── 📦 requirements.txt # Python dependencies
├── 📖 README.md # Full documentation
├── 📝 QUICKSTART.md # Quick reference guide
├── ⚙️ example_configs.json # Example configurations
├── 🙈 .gitignore # Git ignore rules
└── 📁 static/
├── css/
│ └── style.css # Modern dark theme
└── js/
└── app.js # Frontend logic
- Model Selection Dropdown: Popular models pre-loaded
- Custom Model Input: Support for any HuggingFace model
- Server Settings: Host, port, tensor parallelism
- GPU Configuration: Memory utilization slider
- Data Type Selection: auto/float16/bfloat16/float32
- Advanced Options: Trust remote code, prefix caching
- Start/Stop Buttons: Clear visual state
- Chat History Display: Scrollable conversation view
- Message Input: Multi-line textarea with Ctrl+Enter
- Generation Parameters:
- Temperature slider (0.0 - 2.0)
- Max tokens slider (1 - 4096)
- Clear Chat Button: Start fresh conversations
- Status Indicators: Shows when server is ready
- Real-time Updates: WebSocket streaming
- Color-Coded Logs:
- 🔵 Blue: Information
- 🟡 Yellow: Warnings
- 🔴 Red: Errors
- 🟢 Green: Success
- Auto-scroll Toggle: Follow or stay in place
- Clear Logs Button: Clean up the view
- Timestamp: Each log entry timestamped
| Method | Endpoint | Purpose |
|---|---|---|
| GET | / |
Serve main UI |
| GET | /api/status |
Get server status |
| POST | /api/start |
Start vLLM server |
| POST | /api/stop |
Stop vLLM server |
| POST | /api/chat |
Send chat message |
| GET | /api/models |
List common models |
| WS | /ws/logs |
Log stream WebSocket |
- Quickly spin up models for testing
- Test different configurations
- Debug issues with live logs
- Prototype chat applications
- Compare different models easily
- Test with various parameters
- Evaluate response quality
- Benchmark performance
- Clean, professional interface
- Easy to show to stakeholders
- Real-time interaction
- No command line needed
- Learn how vLLM works
- Experiment with settings
- See the effects of parameters
- Understand model behavior
For production use, consider:
- Adding authentication
- Using HTTPS
- Limiting network access
- Validating all inputs
- Rate limiting
- Resource quotas
- First Run: Download happens on first model load (can be slow)
- GPU Memory: Start with 70-80% and adjust up
- Tensor Parallel: Use for models >13B parameters
- Prefix Caching: Enable for repeated prompts
- Log Stats: Disable for cleaner logs in production
- vLLM Docs: https://docs.vllm.ai/
- HuggingFace Models: https://huggingface.co/models
- FastAPI Docs: https://fastapi.tiangolo.com/
- WebSocket Guide: https://developer.mozilla.org/en-US/docs/Web/API/WebSocket
Want to extend the WebUI? Consider adding:
- Model temperature presets
- Save/load configurations
- Export chat history
- Multiple chat sessions
- System prompt configuration
- Token counter
- Response time metrics
- GPU utilization charts
- Model comparison mode
- API key management
Phase 1 ✅ (Current)
- Basic server management
- Chat interface
- Log streaming
- Configuration panel
Phase 2 (Future)
- Streaming responses
- Multiple sessions
- Configuration presets
- Enhanced metrics
Phase 3 (Future)
- User authentication
- Multi-user support
- Advanced monitoring
- Performance dashboards
Built with ❤️ for the vLLM community