Deploying vLLM Playground to OpenShift/Kubernetes

Deploying vLLM Playground to OpenShift/Kubernetes

Architecture Options

Option 1: Static Deployment (Simplest) ⭐ RECOMMENDED FOR PROD

Architecture:

┌─────────────────┐         ┌──────────────────┐
│  Web UI Pod     │────────>│  vLLM Service    │
│  (FastAPI)      │  HTTP   │  (Always Running)│
└─────────────────┘         └──────────────────┘

Pros:

Simple, reliable, production-ready
No special permissions needed
Standard Kubernetes patterns
Easy to scale and monitor

Cons:

Can't dynamically change models via UI
Configuration changes require redeployment
Always consuming resources even when idle

Best for: Production environments, single model use case

Option 2: Dynamic Pod Management (Most Flexible) ⭐ RECOMMENDED FOR YOUR USE CASE

Architecture:

┌─────────────────┐   K8s API   ┌──────────────────┐
│  Web UI Pod     │────────────>│  Creates/Deletes │
│  (FastAPI +     │             │  vLLM Pods       │
│   K8s Client)   │             │  Dynamically     │
└─────────────────┘             └──────────────────┘

Pros:

Keep your existing UI workflow ("Start Server" button)
Dynamic model switching
Resource efficient (only run when needed)
Similar to your local Podman setup

Cons:

Requires ServiceAccount with pod creation permissions (RBAC)
More complex than static deployment
Needs proper cleanup on failures

Best for: Development, experimentation, multi-model testing

Option 3: Kubernetes Job Pattern (Good Middle Ground)

Architecture:

┌─────────────────┐  Create Job  ┌──────────────────┐
│  Web UI Pod     │─────────────>│  vLLM Job/Pod    │
│  (K8s Client)   │              │  (Runs to Completion)
└─────────────────┘              └──────────────────┘

Pros:

Automatic cleanup
Job tracking and retry logic
Good for batch/benchmark workloads

Cons:

Jobs are meant for completion, vLLM is long-running
Not ideal for interactive servers

Best for: Benchmark workloads, batch inference

✅ Implemented: Option 2 (Dynamic Pod Management)

Status: COMPLETE & VERIFIED ✅

This implementation maintains your current workflow while leveraging OpenShift's orchestration.

Implementation Overview

✅ Web UI Container runs in OpenShift
✅ Uses Kubernetes Python Client instead of Podman
✅ ServiceAccount with permissions to create/delete pods
✅ Same WebUI - users click "Start Server" and a vLLM pod is created

Key Changes from Local Setup

Local (Podman)	OpenShift/K8s
`podman run`	`client.create_namespaced_pod()`
`podman stop`	`client.delete_namespaced_pod()`
`podman logs -f`	`client.read_namespaced_pod_log()`
Container name	Pod name
Port mapping	Service + ClusterIP
Volume mounts	PVCs or hostPath
`container_manager.py`	`kubernetes_container_manager.py`

How It Works

File Substitution at Build Time:

# openshift/Containerfile (line 38)
COPY openshift/kubernetes_container_manager.py ${HOME}/vllm-playground/container_manager.py

Locally: app.py imports container_manager.py (Podman CLI)
In OpenShift: app.py imports the substituted file (Kubernetes API)
Same interface: Both managers implement identical methods
Same UX: Users see no difference!

No Podman in OpenShift - Only Kubernetes API ✅

📚 Documentation

QUICK_START.md - 5-minute deployment guide
kubernetes_container_manager.py - Kubernetes implementation

🚀 Quick Deployment

GPU Clusters (Default) ⭐

# 1. Clone repo
git clone https://github.com/micytao/vllm-playground.git

# 2. Build and push Web UI image
cd vllm-playground
podman build -f openshift/Containerfile -t your-registry/vllm-playground:latest .
podman push your-registry/vllm-playground:latest

# 3. Update image in manifest
vim openshift/manifests/04-webui-deployment.yaml  # Update image reference

# 4. Deploy to OpenShift (GPU mode)
cd openshift/
./deploy.sh --gpu  # Uses vllm/vllm-openai:v0.12.0

# 5. Get Web UI URL
echo "https://$(oc get route vllm-playground -n vllm-playground -o jsonpath='{.spec.host}')"

CPU Clusters

# Same steps 1-3 as above, then:

# 4. Deploy to OpenShift (CPU mode)
cd openshift/
./deploy.sh --cpu  # Uses quay.io/rh_ee_micyang/vllm-cpu:v0.11.0 (self-built, publicly accessible)

# 5. Get Web UI URL
echo "https://$(oc get route vllm-playground -n vllm-playground -o jsonpath='{.spec.host}')"

🗑️ Undeployment

# Quick undeploy (deletes namespace and all resources)
cd openshift/
./undeploy.sh

# OR force undeploy without confirmation
./undeploy.sh --force

# OR detailed undeploy (deletes resources individually)
./undeploy-detailed.sh

✅ Verification

The implementation has been verified for interface compatibility:

# Run verification script
python3 openshift/verify_interface.py

Result: ✅ All required methods present and signatures match!

🔒 Security Considerations for OpenShift

✅ RBAC: Minimal permissions (only pod creation in specific namespace)
✅ ServiceAccount: Dedicated SA for web UI (vllm-playground-sa)
✅ SecurityContextConstraints (SCC): OpenShift's security layer
⚙️ ResourceQuotas: Can limit how many vLLM pods can be created
⚙️ NetworkPolicies: Can restrict pod-to-pod communication

📋 Files in This Directory

File	Purpose
`kubernetes_container_manager.py`	K8s-based manager (replaces Podman)
`Containerfile`	Builds Web UI image for OpenShift
`requirements-k8s.txt`	Python deps (includes kubernetes client)
`manifests/`	Kubernetes manifests for deployment
`deploy.sh`	🚀 Automated deployment script (supports --gpu/--cpu)
`undeploy.sh`	🗑️ Automated undeployment script (fast)
`undeploy-detailed.sh`	🗑️ Detailed undeployment script
`README.md`	This file - architecture overview
`QUICK_START.md`	Quick deployment guide
`verify_interface.py`	Interface compatibility test script

🎮 GPU Support

✅ GPU support is fully enabled!

The deployment automatically detects and uses GPUs when:

CPU mode is disabled in the Web UI
GPU nodes are available in the cluster

Features:

✅ Automatic GPU resource requests
✅ GPU node targeting via node selector
✅ Multi-GPU support (tensor parallelism)
✅ Falls back to CPU mode when enabled

🖥️ CPU vs GPU Deployment

The deployment supports both CPU-only and GPU-enabled clusters:

Mode	Container Image	Use Case
GPU (default)	`vllm/vllm-openai:v0.12.0`	Production workloads on GPU clusters (official vLLM image, v0.12.0+ for Claude Code)
CPU	`quay.io/rh_ee_micyang/vllm-cpu:v0.11.0`	Development/testing on CPU-only clusters (self-built, optimized)

Container Strategy:

✅ GPU: Uses official community vLLM image (no authentication needed)
✅ CPU: Uses self-built optimized image (publicly accessible on Quay.io)
✅ No Pull Secrets: Both images are publicly accessible, no registry authentication required

Features:

✅ Easy switching between CPU and GPU modes
✅ Dedicated ConfigMaps for each mode
✅ Separate deployments (one active at a time)
✅ Single command deployment: ./deploy.sh --gpu or ./deploy.sh --cpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying vLLM Playground to OpenShift/Kubernetes

Architecture Options

Option 1: Static Deployment (Simplest) ⭐ RECOMMENDED FOR PROD

Option 2: Dynamic Pod Management (Most Flexible) ⭐ RECOMMENDED FOR YOUR USE CASE

Option 3: Kubernetes Job Pattern (Good Middle Ground)

✅ Implemented: Option 2 (Dynamic Pod Management)

Implementation Overview

Key Changes from Local Setup

How It Works

📚 Documentation

🚀 Quick Deployment

GPU Clusters (Default) ⭐

CPU Clusters

🗑️ Undeployment

✅ Verification

🔒 Security Considerations for OpenShift

📋 Files in This Directory

🎮 GPU Support

🖥️ CPU vs GPU Deployment

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Deploying vLLM Playground to OpenShift/Kubernetes

Architecture Options

Option 1: Static Deployment (Simplest) ⭐ RECOMMENDED FOR PROD

Option 2: Dynamic Pod Management (Most Flexible) ⭐ RECOMMENDED FOR YOUR USE CASE

Option 3: Kubernetes Job Pattern (Good Middle Ground)

✅ Implemented: Option 2 (Dynamic Pod Management)

Implementation Overview

Key Changes from Local Setup

How It Works

📚 Documentation

🚀 Quick Deployment

GPU Clusters (Default) ⭐

CPU Clusters

🗑️ Undeployment

✅ Verification

🔒 Security Considerations for OpenShift

📋 Files in This Directory

🎮 GPU Support

🖥️ CPU vs GPU Deployment