Skip to content

adsmundra/vllm-kubernetes-demo

 
 

Repository files navigation

vLLM CPU Inference on Kubernetes

A production-ready demonstration of deploying vLLM-compatible LLM inference on Kubernetes using CPU-only resources.

🎯 What This Demonstrates

  • vLLM-compatible API running on Kubernetes
  • CPU-only inference (no GPU required)
  • OpenAI-compatible endpoints (/v1/completions, /v1/chat/completions)
  • Production patterns: Auto-scaling, health checks, load balancing
  • Clean architecture: Separate Docker image, readable Kubernetes manifests

🚨 Important Note

This project provides a vLLM-compatible API server built with FastAPI + Transformers, not the official vLLM software. This approach is used because:

  • Official vLLM Docker images don't support CPU-only environments
  • Building vLLM from source for CPU takes 30-60 minutes
  • This demonstrates the same Kubernetes deployment patterns you'd use with real vLLM

🏗️ Architecture

Internet → LoadBalancer → Service → Pod(s) → FastAPI Server → GPT-2 Model
  • Docker Image: Custom-built with FastAPI + Transformers
  • Kubernetes Service: LoadBalancer for external access
  • Horizontal Pod Autoscaler: Auto-scaling based on CPU/memory
  • Health Checks: Readiness and liveness probes

📁 Project Structure

├── app.py                    # FastAPI server (vLLM-compatible API)
├── requirements.txt          # Python dependencies
├── Dockerfile               # Multi-stage Docker build
├── build-and-push.sh        # Build and push Docker image
├── deploy.sh                # One-click Kubernetes deployment
├── test-api.sh              # API testing script
├── k8s/
│   ├── deployment.yaml      # Kubernetes deployment
│   ├── service.yaml         # LoadBalancer service
│   └── hpa.yaml            # Horizontal Pod Autoscaler
└── DIGITALOCEAN.md          # DigitalOcean specific guide

🚀 Quick Start

Prerequisites

  • Kubernetes cluster (DigitalOcean, local, or cloud)
  • kubectl configured
  • Docker (for building image)
  • Docker Hub account

Option 1: Using Pre-built Image

# 1. Deploy to Kubernetes
./deploy.sh

# 2. Test the API
./test-api.sh

Option 2: Build Your Own Image

# 1. Login to Docker Hub
docker login

# 2. Build and push image (supports Mac M1 → AMD64)
./build-and-push.sh

# 3. Deploy to Kubernetes
./deploy.sh

# 4. Test the deployment
./test-api.sh

🔧 API Usage

Health Check

curl http://YOUR-EXTERNAL-IP:8000/health

Text Completion

curl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt2",
    "prompt": "The future of AI is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat Completion

curl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt2",
    "messages": [
      {"role": "user", "content": "What is Kubernetes?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

📊 Scaling and Monitoring

Manual Scaling

kubectl scale deployment vllm-cpu --replicas=3

Auto-scaling

The HPA automatically scales based on:

  • CPU usage > 70%
  • Memory usage > 80%

Monitoring

# Check pods
kubectl get pods -l app=vllm-cpu

# View logs
kubectl logs -l app=vllm-cpu -f

# Check resource usage
kubectl top pods -l app=vllm-cpu

🌐 DigitalOcean Deployment

For DigitalOcean Kubernetes:

# Connect to your cluster
doctl kubernetes cluster kubeconfig save your-cluster-name

# Deploy
./deploy.sh

# Your API will be available at the LoadBalancer IP

See DIGITALOCEAN.md for detailed instructions.

🔄 Production Considerations

For Real vLLM Deployment

In production with GPU nodes, you would:

  1. Use official vLLM image: vllm/vllm-openai:latest
  2. Add GPU resources: nvidia.com/gpu: 1
  3. Use larger models: Llama-2, CodeLlama, etc.
  4. Same Kubernetes patterns: The deployment structure remains identical

Security & Performance

  • Resource limits: Configured for CPU/memory constraints
  • Health checks: Readiness and liveness probes
  • Non-root user: Docker image runs as non-root
  • Horizontal scaling: Multiple replicas for high availability

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test with ./test-api.sh
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙋‍♂️ Support

  • Issues: Use GitHub Issues for bug reports
  • Discussions: Use GitHub Discussions for questions
  • Documentation: Check DIGITALOCEAN.md for platform-specific guides

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 47.7%
  • Python 46.5%
  • Dockerfile 5.8%