Skip to content

Conversation

yossiovadia
Copy link
Collaborator

Overview

Adds complete OpenShift deployment infrastructure with NVIDIA GPU support for running semantic router with specialized LLM containers. Provides zero-touch deployment automation with comprehensive validation.

Architecture

Pod: semantic-router (4 containers)
├── semantic-router - Main ExtProc service (port 50051, 8080)
├── model-a - Math specialist (port 8000, GPU 0)
├── model-b - Coding specialist (port 8001, GPU 1)
└── envoy-proxy - HTTP gateway (port 8801)

Quick Start

   # Login to OpenShift
  oc login https://your-cluster:6443
  
  # Deploy (auto-detects OpenShift server if logged in)
  ./deploy/openshift/deploy-to-openshift.sh

  # Validate deployment
  ./deploy/openshift/validate-deployment.sh

  # Cleanup
  ./deploy/openshift/cleanup-openshift.sh

OpenWebUI Integration

The deployment provides an OpenWebUI-compatible endpoint:

http://envoy-http-./v1

Configure this URL in OpenWebUI settings to use the semantic router as your LLM backend with automatic category-based routing and security guardrails.

Copy link

netlify bot commented Oct 8, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit f938960
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68e6cda91573c00008746460
😎 Deploy Preview https://deploy-preview-376--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

yossiovadia and others added 2 commits October 8, 2025 13:46
This commit adds comprehensive OpenShift deployment support with GPU-enabled
specialist model containers, providing a complete automation solution for
deploying the semantic router to OpenShift clusters.

**Core Deployment:**
- deployment.yaml: Kubernetes deployment manifest with GPU support
  * 4-container pod: semantic-router, model-a, model-b, envoy-proxy
  * CDI annotations for GPU device injection (gpu=0, gpu=1)
  * GPU node selection and tolerations
  * PVC mounts for models and cache
  * Production log levels (INFO for containers, info for Envoy)

- deploy-to-openshift.sh: Main deployment automation script (826 lines)
  * Auto-detection of OpenShift server and existing login
  * Enhanced deployment method with llm-katan specialists
  * Alternative methods: kustomize, template
  * Configurable resources, storage, logging
  * Automatic namespace creation
  * Inline Dockerfile build for llm-katan image
  * Service and route creation
  * Optional port forwarding (disabled by default)
  * Displays OpenWebUI endpoint at completion

- cleanup-openshift.sh: Cleanup automation script (494 lines)
  * Auto-detection of cluster and namespace
  * Graceful cleanup with confirmation
  * Port forwarding cleanup
  * Comprehensive resource deletion

**Configuration:**
- config-openshift.yaml: Semantic router config for OpenShift
  * Math-specialist and coding-specialist endpoints
  * Category-to-specialist routing
  * PII and jailbreak detection configuration

- envoy-openshift.yaml: Envoy proxy configuration
  * HTTP listener on port 8801
  * External processing filter
  * Specialist model routing
  * /v1/models aggregation

**Container Image:**
- Dockerfile.llm-katan: GPU-enabled specialist container image
  * Python 3.10-slim base
  * PyTorch with CUDA 12.1 support
  * llm-katan, transformers, accelerate packages
  * HuggingFace caching configuration
  * Health check endpoint

**Alternative Deployment Methods:**
- kustomization.yaml: Kustomize deployment option
- template.yaml: OpenShift template with parameters

**Documentation & Validation:**
- README.md: Comprehensive deployment documentation
- validate-deployment.sh: 12-test validation script
  * Namespace, deployment, container readiness
  * GPU detection in both specialist containers
  * Model loading verification
  * PVC, service, route checks
  * GPU node scheduling confirmation

- Makefile: Add include for tools/make/openshift.mk
- tools/make/openshift.mk: Optional make targets for OpenShift operations
  * openshift-deploy, openshift-cleanup, openshift-status
  * openshift-logs, openshift-routes, openshift-test
  * Port forwarding helpers

1. **GPU Support**: Full NVIDIA GPU support via CDI device injection
2. **Specialist Models**: Real llm-katan containers for math/coding tasks
3. **Zero-Touch Deployment**: Auto-detection of cluster, automatic builds
4. **Production Ready**: Production log levels, proper health checks
5. **Validation**: Comprehensive 12-test validation suite
6. **UX Enhancements**: OpenWebUI endpoint display, optional port forwarding
7. **Clean Separation**: Only touches deploy/openshift/ (plus minimal Makefile)

```
Pod: semantic-router
├── semantic-router (main ExtProc service, port 50051)
├── model-a (llm-katan math specialist, port 8000, GPU 0)
├── model-b (llm-katan coding specialist, port 8001, GPU 1)
└── envoy-proxy (gateway, port 8801)
```

Validated on OpenShift with NVIDIA L4 GPUs:
- All 4 containers running
- GPUs detected in both specialist containers
- Models loaded on CUDA
- PVCs bound
- Services and routes accessible
- Streaming functionality working

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Routes are created without TLS termination by default, so URLs should
use http:// not https://. This fixes the quick test commands shown at
deployment completion.

Tested and verified:
- curl http://semantic-router-api.../health works
- curl -X POST http://semantic-router-api.../api/v1/classify/intent works

Signed-off-by: Yossi Ovadia <[email protected]>
Copy link

github-actions bot commented Oct 8, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 deploy

Owners: @rootfs, @Xunzhuo
Files changed:

  • deploy/openshift/Dockerfile.llm-katan
  • deploy/openshift/README.md
  • deploy/openshift/cleanup-openshift.sh
  • deploy/openshift/config-openshift.yaml
  • deploy/openshift/deploy-to-openshift.sh
  • deploy/openshift/deployment.yaml
  • deploy/openshift/envoy-openshift.yaml
  • deploy/openshift/kustomization.yaml
  • deploy/openshift/template.yaml
  • deploy/openshift/validate-deployment.sh

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/make/openshift.mk

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • Makefile

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs rootfs requested a review from Copilot October 8, 2025 20:50
@@ -0,0 +1,45 @@
# Optimized Dockerfile for llm-katan - OpenShift compatible
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive OpenShift deployment infrastructure for the semantic router with NVIDIA GPU support and specialized LLM containers. It provides zero-touch deployment automation with validation scripts and supports a 4-container pod architecture including semantic-router service, two GPU-enabled model specialists (model-a and model-b), and an envoy-proxy for HTTP gateway functionality.

  • Complete OpenShift deployment manifests with GPU scheduling and security contexts
  • Automated deployment and validation scripts with error handling and port forwarding
  • OpenShift-specific configurations for Envoy proxy and model routing

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/make/openshift.mk Makefile targets for OpenShift operations including login, deploy, status, and debugging
deploy/openshift/validate-deployment.sh Comprehensive validation script testing all 4 containers, GPU detection, and model loading
deploy/openshift/template.yaml OpenShift template for parameterized deployment with security contexts
deploy/openshift/kustomization.yaml Kustomize configuration for OpenShift deployment with labels and annotations
deploy/openshift/envoy-openshift.yaml OpenShift-specific Envoy configuration using static clusters for pod networking
deploy/openshift/deployment.yaml Main deployment manifest with 4-container pod, GPU scheduling, and init container
deploy/openshift/deploy-to-openshift.sh Automated deployment script with login detection, build management, and port forwarding
deploy/openshift/config-openshift.yaml OpenShift-specific router configuration with localhost endpoints and model policies
deploy/openshift/cleanup-openshift.sh Comprehensive cleanup script with multiple cleanup levels and safety confirmations
deploy/openshift/README.md Documentation for OpenShift deployment with troubleshooting and monitoring guides
deploy/openshift/Dockerfile.llm-katan Dockerfile for building llm-katan specialist containers with CUDA support
Makefile Include openshift.mk in the main Makefile
Comments suppressed due to low confidence (2)

deploy/openshift/deployment.yaml:1

  • Hardcoded storage class 'gp3-csi' may not be available on all OpenShift clusters. Consider making this configurable or using the cluster's default storage class.
apiVersion: apps/v1

deploy/openshift/deployment.yaml:1

  • Hardcoded storage class 'gp3-csi' may not be available on all OpenShift clusters. Consider making this configurable or using the cluster's default storage class.
apiVersion: apps/v1

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

echo "$(RED)[ERROR]$(NC) OPENSHIFT_PASSWORD is required"; \
exit 1; \
fi
@oc login -u $(OPENSHIFT_USER) -p $(OPENSHIFT_PASSWORD) $(OPENSHIFT_SERVER) --insecure-skip-tls-verify
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --insecure-skip-tls-verify bypasses SSL certificate validation, which poses a security risk. Consider making this configurable or documenting the security implications.

Copilot uses AI. Check for mistakes.

fi

log "INFO" "Logging into OpenShift at $OPENSHIFT_SERVER as $OPENSHIFT_USER"
if ! oc login -u "$OPENSHIFT_USER" -p "$OPENSHIFT_PASSWORD" "$OPENSHIFT_SERVER" --insecure-skip-tls-verify; then
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --insecure-skip-tls-verify bypasses SSL certificate validation. This should be configurable or at least documented as a security consideration for production deployments.

Copilot uses AI. Check for mistakes.

return 0
fi

if ! oc login -u "$OPENSHIFT_USER" -p "$OPENSHIFT_PASSWORD" "$OPENSHIFT_SERVER" --insecure-skip-tls-verify; then
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --insecure-skip-tls-verify bypasses SSL certificate validation. Consider making this configurable for production environments.

Copilot uses AI. Check for mistakes.

Comment on lines +378 to +385
# Wait for python imagestream to be ready
log "INFO" "Waiting for python imagestream to be ready..."
sleep 5
while ! oc get istag python:3.10-slim -n "$NAMESPACE" &> /dev/null; do
sleep 2
done
log "SUCCESS" "Python imagestream ready"

Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script waits for 'python:3.10-slim' imagestream tag but the Dockerfile uses 'python:3.10-slim' as a base image. OpenShift may not automatically create an imagestream for external images, potentially causing an infinite loop.

Suggested change
# Wait for python imagestream to be ready
log "INFO" "Waiting for python imagestream to be ready..."
sleep 5
while ! oc get istag python:3.10-slim -n "$NAMESPACE" &> /dev/null; do
sleep 2
done
log "SUCCESS" "Python imagestream ready"

Copilot uses AI. Check for mistakes.

# Test 4: Check GPU detection in model-a
log "INFO" "Test 4: Checking GPU detection in model-a container..."
if [[ -n "$POD_NAME" ]]; then
GPU_CHECK=$(oc exec -n "$NAMESPACE" "$POD_NAME" -c model-a -- python3 -c "import torch; print('CUDA' if torch.cuda.is_available() else 'CPU')" 2>/dev/null || echo "ERROR")
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU detection logic is duplicated for model-a and model-b containers. Consider extracting this into a reusable function to reduce code duplication.

Copilot uses AI. Check for mistakes.

jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json"

# vLLM Endpoints Configuration
# IMPORTANT: Using localhost since containers are in same pod
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] While the comments explain the localhost usage, consider adding more context about why this differs from the standard Kubernetes configuration and the pod networking implications.

Suggested change
# IMPORTANT: Using localhost since containers are in same pod
# IMPORTANT: Using localhost (127.0.0.1) here because all vLLM containers and the orchestrator are running within the same Kubernetes pod.
# In standard Kubernetes configurations, containers in different pods cannot communicate via localhost; they must use service names or pod IPs over the pod network.
# If you move any of these containers to separate pods, you must update the address fields below to use the appropriate service name or pod IP instead of 127.0.0.1.

Copilot uses AI. Check for mistakes.

Comment on lines +30 to +39
# IMPORTANT: Using localhost since containers are in same pod
vllm_endpoints:
- name: "model-a-endpoint"
address: "127.0.0.1" # localhost in same pod
port: 8000
models:
- "Model-A"
weight: 1
- name: "model-b-endpoint"
address: "127.0.0.1" # localhost in same pod
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] While the comments explain the localhost usage, consider adding more context about why this differs from the standard Kubernetes configuration and the pod networking implications.

Suggested change
# IMPORTANT: Using localhost since containers are in same pod
vllm_endpoints:
- name: "model-a-endpoint"
address: "127.0.0.1" # localhost in same pod
port: 8000
models:
- "Model-A"
weight: 1
- name: "model-b-endpoint"
address: "127.0.0.1" # localhost in same pod
# IMPORTANT: Using localhost (127.0.0.1) here because all containers are running in the same pod and thus share the same network namespace.
# This differs from standard Kubernetes practice, where services typically communicate over the pod network using service names or pod IPs.
# If these containers are ever split into separate pods, using localhost will break communication; in that case, update the address to use the appropriate service name or pod IP.
vllm_endpoints:
- name: "model-a-endpoint"
address: "127.0.0.1" # localhost; only works because containers share the pod's network namespace
port: 8000
models:
- "Model-A"
weight: 1
- name: "model-b-endpoint"
address: "127.0.0.1" # localhost; only works because containers share the pod's network namespace

Copilot uses AI. Check for mistakes.

cpu: "2"
# Real LLM specialist containers using llm-katan
- name: model-a
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image reference is hardcoded with the namespace. This creates a dependency between the image registry path and the deployment namespace, making it less portable.

Suggested change
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest
image: image-registry.openshift-image-registry.svc:5000/$(NAMESPACE)/llm-katan:latest

Copilot uses AI. Check for mistakes.

cpu: "1"
nvidia.com/gpu: "1"
- name: model-b
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image reference is hardcoded with the namespace. This creates a dependency between the image registry path and the deployment namespace, making it less portable.

Suggested change
image: image-registry.openshift-image-registry.svc:5000/vllm-semantic-router-system/llm-katan:latest
image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/llm-katan:latest

Copilot uses AI. Check for mistakes.

@yossiovadia
Copy link
Collaborator Author

yossiovadia commented Oct 8, 2025 via email

@rootfs rootfs merged commit ab92fac into vllm-project:main Oct 8, 2025
9 checks passed
@JaredforReal
Copy link
Contributor

@yossiovadia I can help you set up o11y stack(Grafana+Prometheus) in OpenShift when you make it work.

@yossiovadia
Copy link
Collaborator Author

@JaredforReal - really appreciate ! i have already started ( finished actually ) to work on it .
It's likely require some enhancement & testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants