|
| 1 | +# AI Inference with vLLM on Kubernetes |
| 2 | + |
| 3 | +## Purpose / What You'll Learn |
| 4 | + |
| 5 | +This example demonstrates how to deploy a server for AI inference using [vLLM](https://docs.vllm.ai/en/latest/) on Kubernetes. You’ll learn how to: |
| 6 | + |
| 7 | +- Set up vLLM inference server with a model downloaded from [Hugging Face](https://huggingface.co/). |
| 8 | +- Expose the inference endpoint using a Kubernetes `Service`. |
| 9 | +- Set up port forwarding from your local machine to the inference `Service` in the Kubernetes cluster. |
| 10 | +- Send a sample prediction request to the server using `curl`. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 📚 Table of Contents |
| 15 | + |
| 16 | +- [Prerequisites](#prerequisites) |
| 17 | +- [Detailed Steps & Explanation](#detailed-steps--explanation) |
| 18 | +- [Verification / Seeing it Work](#verification--seeing-it-work) |
| 19 | +- [Configuration Customization](#configuration-customization) |
| 20 | +- [Cleanup](#cleanup) |
| 21 | +- [Further Reading / Next Steps](#further-reading--next-steps) |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Prerequisites |
| 26 | + |
| 27 | +- A Kubernetes cluster with access to NVIDIA GPUs. This example was tested on GKE, but can be adapted for other cloud providers like EKS and AKS by ensuring you have a GPU-enabled node pool and have deployed the Nvidia device plugin. |
| 28 | +- Hugging Face account token with permissions for model (example model: `google/gemma-3-1b-it`) |
| 29 | +- `kubectl` configured to communicate with cluster and in PATH |
| 30 | +- `curl` binary in PATH |
| 31 | + |
| 32 | +**Note for GKE users:** To target specific GPU types, you can uncomment the GKE-specific `nodeSelector` in `vllm-deployment.yaml`. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Detailed Steps & Explanation |
| 37 | + |
| 38 | +1. Ensure Hugging Face permissions to retrieve model: |
| 39 | + |
| 40 | +```bash |
| 41 | +# Env var HF_TOKEN contains hugging face account token |
| 42 | +kubectl create secret generic hf-secret \ |
| 43 | + --from-literal=hf_token=$HF_TOKEN |
| 44 | +``` |
| 45 | + |
| 46 | +2. Apply vLLM server: |
| 47 | + |
| 48 | +```bash |
| 49 | +kubectl apply -f vllm-deployment.yaml |
| 50 | +``` |
| 51 | + |
| 52 | + - Wait for deployment to reconcile, creating vLLM pod(s): |
| 53 | + |
| 54 | +```bash |
| 55 | +kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment |
| 56 | +kubectl get pods -l app=gemma-server -w |
| 57 | +``` |
| 58 | + |
| 59 | + - View vLLM pod logs: |
| 60 | + |
| 61 | +```bash |
| 62 | +kubectl logs -f -l app=gemma-server |
| 63 | +``` |
| 64 | + |
| 65 | +Expected output: |
| 66 | + |
| 67 | +``` |
| 68 | + INFO: Automatically detected platform cuda. |
| 69 | + ... |
| 70 | + INFO [launcher.py:34] Route: /v1/chat/completions, Methods: POST |
| 71 | + ... |
| 72 | + INFO: Started server process [13] |
| 73 | + INFO: Waiting for application startup. |
| 74 | + INFO: Application startup complete. |
| 75 | + Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-1b-it-1" on port 8080. |
| 76 | +... |
| 77 | +``` |
| 78 | + |
| 79 | +3. Create service: |
| 80 | + |
| 81 | +```bash |
| 82 | +# ClusterIP service on port 8080 in front of vllm deployment |
| 83 | +kubectl apply -f vllm-service.yaml |
| 84 | +``` |
| 85 | + |
| 86 | +## Verification / Seeing it Work |
| 87 | + |
| 88 | +1. Forward local requests to vLLM service: |
| 89 | + |
| 90 | +```bash |
| 91 | +# Forward a local port (e.g., 8080) to the service port (e.g., 8080) |
| 92 | +kubectl port-forward service/vllm-service 8080:8080 |
| 93 | +``` |
| 94 | + |
| 95 | +2. Send request to local forwarding port: |
| 96 | + |
| 97 | +```bash |
| 98 | +curl -X POST http://localhost:8080/v1/chat/completions \ |
| 99 | +-H "Content-Type: application/json" \ |
| 100 | +-d '{ |
| 101 | + "model": "google/gemma-3-1b-it", |
| 102 | + "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}], |
| 103 | + "max_tokens": 100 |
| 104 | +}' |
| 105 | +``` |
| 106 | + |
| 107 | +Expected output (or similar): |
| 108 | + |
| 109 | +```json |
| 110 | +{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let’s break down quantum computing in a way that’s hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day – laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does – from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null} |
| 111 | +``` |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## Configuration Customization |
| 116 | + |
| 117 | +- Update `MODEL_ID` within deployment manifest to serve different model (ensure Hugging Face access token contains these permissions). |
| 118 | +- Change the number of `vLLM` pod replicas in the deployment manifest. |
| 119 | +--- |
| 120 | + |
| 121 | +## Cleanup |
| 122 | + |
| 123 | +```bash |
| 124 | +kubectl delete -f vllm-service.yaml |
| 125 | +kubectl delete -f vllm-deployment.yaml |
| 126 | +kubectl delete -f secret/hf_secret |
| 127 | +``` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Further Reading / Next Steps |
| 132 | + |
| 133 | +- [vLLM AI Inference Server](https://docs.vllm.ai/en/latest/) |
| 134 | +- [Hugging Face Security Tokens](https://huggingface.co/docs/hub/en/security-tokens) |
0 commit comments