Skip to content

Commit 7021833

Browse files
authored
Merge pull request #566 from seans3/vllm-ai-example
vLLM AI inference serving example
2 parents 0598f07 + 15d9100 commit 7021833

File tree

3 files changed

+205
-0
lines changed

3 files changed

+205
-0
lines changed

ai/vllm-deployment/README.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# AI Inference with vLLM on Kubernetes
2+
3+
## Purpose / What You'll Learn
4+
5+
This example demonstrates how to deploy a server for AI inference using [vLLM](https://docs.vllm.ai/en/latest/) on Kubernetes. You’ll learn how to:
6+
7+
- Set up vLLM inference server with a model downloaded from [Hugging Face](https://huggingface.co/).
8+
- Expose the inference endpoint using a Kubernetes `Service`.
9+
- Set up port forwarding from your local machine to the inference `Service` in the Kubernetes cluster.
10+
- Send a sample prediction request to the server using `curl`.
11+
12+
---
13+
14+
## 📚 Table of Contents
15+
16+
- [Prerequisites](#prerequisites)
17+
- [Detailed Steps & Explanation](#detailed-steps--explanation)
18+
- [Verification / Seeing it Work](#verification--seeing-it-work)
19+
- [Configuration Customization](#configuration-customization)
20+
- [Cleanup](#cleanup)
21+
- [Further Reading / Next Steps](#further-reading--next-steps)
22+
23+
---
24+
25+
## Prerequisites
26+
27+
- A Kubernetes cluster with access to NVIDIA GPUs. This example was tested on GKE, but can be adapted for other cloud providers like EKS and AKS by ensuring you have a GPU-enabled node pool and have deployed the Nvidia device plugin.
28+
- Hugging Face account token with permissions for model (example model: `google/gemma-3-1b-it`)
29+
- `kubectl` configured to communicate with cluster and in PATH
30+
- `curl` binary in PATH
31+
32+
**Note for GKE users:** To target specific GPU types, you can uncomment the GKE-specific `nodeSelector` in `vllm-deployment.yaml`.
33+
34+
---
35+
36+
## Detailed Steps & Explanation
37+
38+
1. Ensure Hugging Face permissions to retrieve model:
39+
40+
```bash
41+
# Env var HF_TOKEN contains hugging face account token
42+
kubectl create secret generic hf-secret \
43+
--from-literal=hf_token=$HF_TOKEN
44+
```
45+
46+
2. Apply vLLM server:
47+
48+
```bash
49+
kubectl apply -f vllm-deployment.yaml
50+
```
51+
52+
- Wait for deployment to reconcile, creating vLLM pod(s):
53+
54+
```bash
55+
kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
56+
kubectl get pods -l app=gemma-server -w
57+
```
58+
59+
- View vLLM pod logs:
60+
61+
```bash
62+
kubectl logs -f -l app=gemma-server
63+
```
64+
65+
Expected output:
66+
67+
```
68+
INFO: Automatically detected platform cuda.
69+
...
70+
INFO [launcher.py:34] Route: /v1/chat/completions, Methods: POST
71+
...
72+
INFO: Started server process [13]
73+
INFO: Waiting for application startup.
74+
INFO: Application startup complete.
75+
Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-1b-it-1" on port 8080.
76+
...
77+
```
78+
79+
3. Create service:
80+
81+
```bash
82+
# ClusterIP service on port 8080 in front of vllm deployment
83+
kubectl apply -f vllm-service.yaml
84+
```
85+
86+
## Verification / Seeing it Work
87+
88+
1. Forward local requests to vLLM service:
89+
90+
```bash
91+
# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
92+
kubectl port-forward service/vllm-service 8080:8080
93+
```
94+
95+
2. Send request to local forwarding port:
96+
97+
```bash
98+
curl -X POST http://localhost:8080/v1/chat/completions \
99+
-H "Content-Type: application/json" \
100+
-d '{
101+
"model": "google/gemma-3-1b-it",
102+
"messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
103+
"max_tokens": 100
104+
}'
105+
```
106+
107+
Expected output (or similar):
108+
109+
```json
110+
{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let’s break down quantum computing in a way that’s hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day – laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does – from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}
111+
```
112+
113+
---
114+
115+
## Configuration Customization
116+
117+
- Update `MODEL_ID` within deployment manifest to serve different model (ensure Hugging Face access token contains these permissions).
118+
- Change the number of `vLLM` pod replicas in the deployment manifest.
119+
---
120+
121+
## Cleanup
122+
123+
```bash
124+
kubectl delete -f vllm-service.yaml
125+
kubectl delete -f vllm-deployment.yaml
126+
kubectl delete -f secret/hf_secret
127+
```
128+
129+
---
130+
131+
## Further Reading / Next Steps
132+
133+
- [vLLM AI Inference Server](https://docs.vllm.ai/en/latest/)
134+
- [Hugging Face Security Tokens](https://huggingface.co/docs/hub/en/security-tokens)
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: vllm-gemma-deployment
5+
spec:
6+
replicas: 1
7+
selector:
8+
matchLabels:
9+
app: gemma-server
10+
template:
11+
metadata:
12+
labels:
13+
app: gemma-server
14+
# Labels for better functionality within GKE.
15+
# ai.gke.io/model: gemma-3-1b-it
16+
# ai.gke.io/inference-server: vllm
17+
# examples.ai.gke.io/source: user-guide
18+
spec:
19+
containers:
20+
- name: inference-server
21+
# vllm/vllm-openai:v0.10.0
22+
image: vllm/vllm-openai@sha256:05a31dc4185b042e91f4d2183689ac8a87bd845713d5c3f987563c5899878271
23+
resources:
24+
requests:
25+
cpu: "2"
26+
memory: "10Gi"
27+
ephemeral-storage: "10Gi"
28+
nvidia.com/gpu: "1"
29+
limits:
30+
cpu: "2"
31+
memory: "10Gi"
32+
ephemeral-storage: "10Gi"
33+
nvidia.com/gpu: "1"
34+
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
35+
args:
36+
- --model=$(MODEL_ID)
37+
- --tensor-parallel-size=1
38+
- --host=0.0.0.0
39+
- --port=8080
40+
env:
41+
# 1 billion parameter model (smallest gemma model)
42+
- name: MODEL_ID
43+
value: google/gemma-3-1b-it
44+
- name: HUGGING_FACE_HUB_TOKEN
45+
valueFrom:
46+
secretKeyRef:
47+
name: hf-secret
48+
key: hf_token
49+
volumeMounts:
50+
- mountPath: /dev/shm
51+
name: dshm
52+
volumes:
53+
- name: dshm
54+
emptyDir:
55+
medium: Memory
56+
# GKE specific node selectors to ensure a particular (Nvidia L4) GPU.
57+
# nodeSelector:
58+
# cloud.google.com/gke-accelerator: nvidia-l4
59+
# cloud.google.com/gke-gpu-driver-version: latest
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: vllm-service
5+
spec:
6+
selector:
7+
app: gemma-server
8+
type: ClusterIP
9+
ports:
10+
- protocol: TCP
11+
port: 8080
12+
targetPort: 8080

0 commit comments

Comments
 (0)