- OpenShift CLI (
oc) or kubectl installed - Logged into your OpenShift cluster:
oc login <cluster-url>
- Built and pushed Web UI container to your registry (one-time setup)
Determine if your cluster has GPU or CPU-only nodes:
# Check for GPU nodes
oc get nodes -L nvidia.com/gpu.present
# If you see "true" in the nvidia.com/gpu.present column, use GPU mode
# Otherwise, use CPU modecd openshift/
./deploy.sh --gpuWhat this does:
- Creates
vllm-playgroundnamespace - Sets up RBAC (ServiceAccount, Role, RoleBinding)
- Creates ConfigMaps for GPU and CPU modes
- Deploys GPU-mode Web UI (1 replica)
- Creates CPU-mode deployment (0 replicas, for easy switching)
- Exposes via Route (OpenShift) or Service (Kubernetes)
vLLM Image Used: vllm/vllm-openai:v0.12.0 (official community image, v0.12.0+ for Claude Code)
cd openshift/
./deploy.sh --cpuvLLM Image Used: quay.io/rh_ee_micyang/vllm-cpu:v0.11.0 (self-built, publicly accessible)
Get the URL:
# OpenShift (with Routes)
echo "https://$(oc get route vllm-playground -n vllm-playground -o jsonpath='{.spec.host}')"
# Kubernetes (with port-forward)
oc port-forward -n vllm-playground svc/vllm-playground 7860:7860
# Then visit: http://localhost:7860- Open the Web UI URL in your browser
- Select a model (e.g.,
facebook/opt-125mfor testing) - Configure parameters:
- GPU mode: Leave "Enable CPU Mode" unchecked
- CPU mode: Check "Enable CPU Mode"
- Click "Start Server"
- Wait for the vLLM service to start (check logs tab)
- Once ready, go to the Chat tab and start chatting!
oc get all -n vllm-playgroundoc get deployments -n vllm-playground
# Example output (GPU mode active):
# NAME READY UP-TO-DATE AVAILABLE AGE
# vllm-playground-gpu 1/1 1 1 5m
# vllm-playground-cpu 0/0 0 0 5m# Web UI logs (GPU mode)
oc logs -f deployment/vllm-playground-gpu -n vllm-playground
# Web UI logs (CPU mode)
oc logs -f deployment/vllm-playground-cpu -n vllm-playground
# vLLM service logs (when running)
oc logs -f vllm-service -n vllm-playgroundoc scale deployment/vllm-playground-gpu --replicas=0 -n vllm-playground
oc scale deployment/vllm-playground-cpu --replicas=1 -n vllm-playgroundoc scale deployment/vllm-playground-cpu --replicas=0 -n vllm-playground
oc scale deployment/vllm-playground-gpu --replicas=1 -n vllm-playgroundNote: After switching, restart any running vLLM service in the Web UI.
# GPU mode
oc rollout restart deployment/vllm-playground-gpu -n vllm-playground
# CPU mode
oc rollout restart deployment/vllm-playground-cpu -n vllm-playgroundcd openshift/
./undeploy.sh
# Or force without confirmation
./undeploy.sh --force
# Or manually
oc delete namespace vllm-playgroundCause: Cannot pull container images
Solution: Check the image name and registry access:
# Check pod events for details
oc describe pod <pod-name> -n vllm-playground
# Verify the configured image
oc get configmap vllm-playground-config-gpu -n vllm-playground -o yaml
# The deployment uses community images from Docker Hub/Quay.io
# No pull secrets are required for these public images
# If you see ImagePullBackOff, it may be due to:
# 1. Typo in image name
# 2. Network connectivity issues
# 3. Rate limiting from the registry
# Restart deployment
oc rollout restart deployment/vllm-playground-gpu -n vllm-playgroundCause: Insufficient resources or GPU not available
Solution 1 - Check resources:
oc describe pod <pod-name> -n vllm-playground
# Look for events like "Insufficient cpu/memory"Solution 2 - If GPU mode but no GPUs available:
# Switch to CPU mode
./deploy.sh --cpuSolution 3 - Check GPU operator:
# Ensure GPU operator is installed (for GPU clusters)
oc get pods -n nvidia-gpu-operatorCause: Running on vanilla Kubernetes (not OpenShift)
Solution: Use port-forward instead:
oc port-forward -n vllm-playground svc/vllm-playground 7860:7860
# Visit: http://localhost:7860Cause 1: Model not found or download failed
Solution: Check vLLM logs:
oc logs vllm-service -n vllm-playgroundCommon Error: LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder
This error means the vLLM pod cannot download the model from HuggingFace. This can happen because:
- Network access blocked - The pod has no internet access
- HuggingFace authentication required - Some models require a HuggingFace token
Solution A - Enable Internet Access:
Check if the namespace has network policies blocking egress:
oc get networkpolicies -n vllm-playgroundIf your cluster restricts outbound traffic, you may need to:
- Request network policy changes from your cluster admin
- Use pre-downloaded models (mount as PVC)
- Set up a model mirror/cache in your cluster
Solution B - Configure HuggingFace Token:
For gated models (Llama, Mistral, etc.), you need a HuggingFace token:
# Get your token from https://huggingface.co/settings/tokens
# Create a secret with your HuggingFace token
oc create secret generic huggingface-token \
--from-literal=HF_TOKEN=hf_xxxxxxxxxxxx \
-n vllm-playground
# Note: Currently, HF tokens must be entered via the Web UI
# Future enhancement: Auto-inject from secretThen enter your token in the Web UI under "Advanced Settings" when starting a server.
Cause 2: Insufficient memory
Solution: Use a smaller model or increase limits in kubernetes_container_manager.py
Cause 3: Wrong mode (GPU model on CPU deployment)
Solution:
- For GPU models: Use GPU deployment (
./deploy.sh --gpu) - For CPU models: Check "Enable CPU Mode" in Web UI
User Browser
│
↓
OpenShift Route (HTTPS)
│
↓
Service: vllm-playground
│
↓
Deployment: vllm-playground-gpu (active)
or vllm-playground-cpu (active)
│
├─→ Web UI Container
│ (FastAPI + React)
│ │
│ └─→ Kubernetes API
│ (Creates/Deletes vLLM Pods)
│
└─→ vLLM Service Pod (created dynamically)
(vLLM OpenAI API Server)
vllm-playground-config-gpu: GPU mode configurationvllm-playground-config-cpu: CPU mode configuration
Key settings:
VLLM_IMAGE: Container image for vLLM serviceKUBERNETES_NAMESPACE: Namespace for vLLM podsUSE_PERSISTENT_CACHE: Enable PVC for model cachingMODEL_CACHE_PVC: PVC name for model cache
Set in deployment manifest (04-webui-deployment.yaml):
KUBERNETES_NAMESPACE: Where to create vLLM podsVLLM_IMAGE: Image to use for vLLM serviceUSE_PERSISTENT_CACHE: Enable persistent model cacheMODEL_CACHE_PVC: PVC for model cache (if enabled)
- README.md - Architecture overview and deployment details
# Deploy GPU mode
./deploy.sh --gpu
# Deploy CPU mode
./deploy.sh --cpu
# Get Web UI URL
echo "https://$(oc get route vllm-playground -n vllm-playground -o jsonpath='{.spec.host}')"
# View all resources
oc get all -n vllm-playground
# View Web UI logs (GPU)
oc logs -f deployment/vllm-playground-gpu -n vllm-playground
# View vLLM logs
oc logs -f vllm-service -n vllm-playground
# Switch to CPU
oc scale deployment/vllm-playground-gpu --replicas=0 -n vllm-playground && \
oc scale deployment/vllm-playground-cpu --replicas=1 -n vllm-playground
# Switch to GPU
oc scale deployment/vllm-playground-cpu --replicas=0 -n vllm-playground && \
oc scale deployment/vllm-playground-gpu --replicas=1 -n vllm-playground
# Undeploy
./undeploy.sh