-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Feature Request
Problem / Use Case
Currently, vLLM Playground manages a single vLLM server instance at a time. The architecture connects one Web UI to one vLLM container, and the CLI (start/stop/status) operates on a single server. While the v0.1.5 remote server feature allows connecting to one external vLLM instance, there is no support for managing multiple models or server instances simultaneously.
For production and development workflows, it is common to need multiple models running concurrently - for example, a coding assistant model alongside a general-purpose chat model, or A/B testing different model versions.
Proposed Feature
Support for managing multiple vLLM server instances simultaneously, including:
- Multi-instance dashboard - Start, stop, and monitor multiple vLLM servers from a single UI, each serving a different model on different ports/GPUs
- Model registry / catalog - A central view of available models with one-click deploy, showing which are currently running and their resource usage
- Per-model configuration - Independent configuration (GPU assignment, quantization, context length, etc.) for each running instance
- Dynamic model switching - Ability to route chat sessions to different running models without restarting the playground
- Resource-aware scheduling - Visibility into GPU/CPU/memory utilization to help decide which models can be co-located
Why This Matters
As teams scale their use of local LLM serving, the ability to orchestrate multiple models from a single interface becomes essential. This would complement the existing OpenShift/K8s deployment support and make vLLM Playground a more complete solution for both individual developers and enterprise teams.
Environment
- vLLM Playground v0.1.5
- Feature request (not a bug)