A production-ready demonstration of deploying vLLM-compatible LLM inference on Kubernetes using CPU-only resources.
- vLLM-compatible API running on Kubernetes
- CPU-only inference (no GPU required)
- OpenAI-compatible endpoints (
/v1/completions,/v1/chat/completions) - Production patterns: Auto-scaling, health checks, load balancing
- Clean architecture: Separate Docker image, readable Kubernetes manifests
This project provides a vLLM-compatible API server built with FastAPI + Transformers, not the official vLLM software. This approach is used because:
- Official vLLM Docker images don't support CPU-only environments
- Building vLLM from source for CPU takes 30-60 minutes
- This demonstrates the same Kubernetes deployment patterns you'd use with real vLLM
Internet → LoadBalancer → Service → Pod(s) → FastAPI Server → GPT-2 Model
- Docker Image: Custom-built with FastAPI + Transformers
- Kubernetes Service: LoadBalancer for external access
- Horizontal Pod Autoscaler: Auto-scaling based on CPU/memory
- Health Checks: Readiness and liveness probes
├── app.py # FastAPI server (vLLM-compatible API)
├── requirements.txt # Python dependencies
├── Dockerfile # Multi-stage Docker build
├── build-and-push.sh # Build and push Docker image
├── deploy.sh # One-click Kubernetes deployment
├── test-api.sh # API testing script
├── k8s/
│ ├── deployment.yaml # Kubernetes deployment
│ ├── service.yaml # LoadBalancer service
│ └── hpa.yaml # Horizontal Pod Autoscaler
└── DIGITALOCEAN.md # DigitalOcean specific guide
- Kubernetes cluster (DigitalOcean, local, or cloud)
- kubectl configured
- Docker (for building image)
- Docker Hub account
# 1. Deploy to Kubernetes
./deploy.sh
# 2. Test the API
./test-api.sh# 1. Login to Docker Hub
docker login
# 2. Build and push image (supports Mac M1 → AMD64)
./build-and-push.sh
# 3. Deploy to Kubernetes
./deploy.sh
# 4. Test the deployment
./test-api.shcurl http://YOUR-EXTERNAL-IP:8000/healthcurl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt2",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.7
}'curl -X POST "http://YOUR-EXTERNAL-IP:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt2",
"messages": [
{"role": "user", "content": "What is Kubernetes?"}
],
"max_tokens": 100,
"temperature": 0.7
}'kubectl scale deployment vllm-cpu --replicas=3The HPA automatically scales based on:
- CPU usage > 70%
- Memory usage > 80%
# Check pods
kubectl get pods -l app=vllm-cpu
# View logs
kubectl logs -l app=vllm-cpu -f
# Check resource usage
kubectl top pods -l app=vllm-cpuFor DigitalOcean Kubernetes:
# Connect to your cluster
doctl kubernetes cluster kubeconfig save your-cluster-name
# Deploy
./deploy.sh
# Your API will be available at the LoadBalancer IPSee DIGITALOCEAN.md for detailed instructions.
In production with GPU nodes, you would:
- Use official vLLM image:
vllm/vllm-openai:latest - Add GPU resources:
nvidia.com/gpu: 1 - Use larger models: Llama-2, CodeLlama, etc.
- Same Kubernetes patterns: The deployment structure remains identical
- Resource limits: Configured for CPU/memory constraints
- Health checks: Readiness and liveness probes
- Non-root user: Docker image runs as non-root
- Horizontal scaling: Multiple replicas for high availability
- Fork the repository
- Create a feature branch
- Make your changes
- Test with
./test-api.sh - Submit a pull request
MIT License - see LICENSE file for details.
- Issues: Use GitHub Issues for bug reports
- Discussions: Use GitHub Discussions for questions
- Documentation: Check DIGITALOCEAN.md for platform-specific guides