vLLM CPU Inference on Kubernetes

A production-ready demonstration of deploying vLLM-compatible LLM inference on Kubernetes using CPU-only resources.

🎯 What This Demonstrates

vLLM-compatible API running on Kubernetes
CPU-only inference (no GPU required)
OpenAI-compatible endpoints (/v1/completions, /v1/chat/completions)
Production patterns: Auto-scaling, health checks, load balancing
Clean architecture: Separate Docker image, readable Kubernetes manifests

🚨 Important Note

This project provides a vLLM-compatible API server built with FastAPI + Transformers, not the official vLLM software. This approach is used because:

Official vLLM Docker images don't support CPU-only environments
Building vLLM from source for CPU takes 30-60 minutes
This demonstrates the same Kubernetes deployment patterns you'd use with real vLLM

🏗️ Architecture

Internet → LoadBalancer → Service → Pod(s) → FastAPI Server → GPT-2 Model

Docker Image: Custom-built with FastAPI + Transformers
Kubernetes Service: LoadBalancer for external access
Horizontal Pod Autoscaler: Auto-scaling based on CPU/memory
Health Checks: Readiness and liveness probes

📁 Project Structure

├── app.py                    # FastAPI server (vLLM-compatible API)
├── requirements.txt          # Python dependencies
├── Dockerfile               # Multi-stage Docker build
├── build-and-push.sh        # Build and push Docker image
├── deploy.sh                # One-click Kubernetes deployment
├── test-api.sh              # API testing script
├── k8s/
│   ├── deployment.yaml      # Kubernetes deployment
│   ├── service.yaml         # LoadBalancer service
│   └── hpa.yaml            # Horizontal Pod Autoscaler
└── DIGITALOCEAN.md          # DigitalOcean specific guide

🚀 Quick Start

Prerequisites

Kubernetes cluster (DigitalOcean, local, or cloud)
kubectl configured
Docker (for building image)
Docker Hub account

Option 1: Using Pre-built Image

# 1. Deploy to Kubernetes
./deploy.sh

# 2. Test the API
./test-api.sh

Option 2: Build Your Own Image

# 1. Login to Docker Hub
docker login

# 2. Build and push image (supports Mac M1 → AMD64)
./build-and-push.sh

# 3. Deploy to Kubernetes
./deploy.sh

# 4. Test the deployment
./test-api.sh

🔧 API Usage

Health Check

curl http://YOUR-EXTERNAL-IP:8000/health

Text Completion

curl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt2",
    "prompt": "The future of AI is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat Completion

curl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt2",
    "messages": [
      {"role": "user", "content": "What is Kubernetes?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

📊 Scaling and Monitoring

Manual Scaling

kubectl scale deployment vllm-cpu --replicas=3

Auto-scaling

The HPA automatically scales based on:

CPU usage > 70%
Memory usage > 80%

Monitoring

# Check pods
kubectl get pods -l app=vllm-cpu

# View logs
kubectl logs -l app=vllm-cpu -f

# Check resource usage
kubectl top pods -l app=vllm-cpu

🌐 DigitalOcean Deployment

For DigitalOcean Kubernetes:

# Connect to your cluster
doctl kubernetes cluster kubeconfig save your-cluster-name

# Deploy
./deploy.sh

# Your API will be available at the LoadBalancer IP

See DIGITALOCEAN.md for detailed instructions.

🔄 Production Considerations

For Real vLLM Deployment

In production with GPU nodes, you would:

Use official vLLM image: vllm/vllm-openai:latest
Add GPU resources: nvidia.com/gpu: 1
Use larger models: Llama-2, CodeLlama, etc.
Same Kubernetes patterns: The deployment structure remains identical

Security & Performance

Resource limits: Configured for CPU/memory constraints
Health checks: Readiness and liveness probes
Non-root user: Docker image runs as non-root
Horizontal scaling: Multiple replicas for high availability

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test with ./test-api.sh
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙋‍♂️ Support

Issues: Use GitHub Issues for bug reports
Discussions: Use GitHub Discussions for questions
Documentation: Check DIGITALOCEAN.md for platform-specific guides

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM CPU Inference on Kubernetes

🎯 What This Demonstrates

🚨 Important Note

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Option 1: Using Pre-built Image

Option 2: Build Your Own Image

🔧 API Usage

Health Check

Text Completion

Chat Completion

📊 Scaling and Monitoring

Manual Scaling

Auto-scaling

Monitoring

🌐 DigitalOcean Deployment

🔄 Production Considerations

For Real vLLM Deployment

Security & Performance

🤝 Contributing

📄 License

🙋‍♂️ Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
k8s		k8s
.gitignore		.gitignore
DIGITALOCEAN.md		DIGITALOCEAN.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
build-and-push.sh		build-and-push.sh
deploy.sh		deploy.sh
requirements.txt		requirements.txt
test-api.sh		test-api.sh

Folders and files

Latest commit

History

Repository files navigation

vLLM CPU Inference on Kubernetes

🎯 What This Demonstrates

🚨 Important Note

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Option 1: Using Pre-built Image

Option 2: Build Your Own Image

🔧 API Usage

Health Check

Text Completion

Chat Completion

📊 Scaling and Monitoring

Manual Scaling

Auto-scaling

Monitoring

🌐 DigitalOcean Deployment

🔄 Production Considerations

For Real vLLM Deployment

Security & Performance

🤝 Contributing

📄 License

🙋‍♂️ Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages