@@ -36,31 +36,40 @@ This example demonstrates how to deploy a server for AI inference using [vLLM](h
36
36
37
37
## Detailed Steps & Explanation
38
38
39
- 1 . Ensure Hugging Face permissions to retrieve model:
39
+ 1 . Create a namespace. This example uses ` vllm-example ` , but you can choose any name:
40
+
41
+ ``` bash
42
+ kubectl create namespace vllm-example
43
+ ```
44
+
45
+ 2 . Ensure Hugging Face permissions to retrieve model:
40
46
41
47
``` bash
42
48
# Env var HF_TOKEN contains hugging face account token
43
- kubectl create secret generic hf-secret \
49
+ # Make sure to use the same namespace as in the previous step
50
+ kubectl create secret generic hf-secret -n vllm-example \
44
51
--from-literal=hf_token=$HF_TOKEN
45
52
```
46
53
47
- 2 . Apply vLLM server:
54
+
55
+ 3 . Apply vLLM server:
48
56
49
57
``` bash
50
- kubectl apply -f vllm-deployment.yaml
58
+ # Make sure to use the same namespace as in the previous steps
59
+ kubectl apply -f vllm-deployment.yaml -n vllm-example
51
60
```
52
61
53
62
- Wait for deployment to reconcile, creating vLLM pod(s):
54
63
55
64
``` bash
56
- kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment
57
- kubectl get pods -l app=gemma-server -w
65
+ kubectl wait --for=condition=Available --timeout=900s deployment/vllm-gemma-deployment -n vllm-example
66
+ kubectl get pods -l app=gemma-server -w -n vllm-example
58
67
```
59
68
60
69
- View vLLM pod logs:
61
70
62
71
``` bash
63
- kubectl logs -f -l app=gemma-server
72
+ kubectl logs -f -l app=gemma-server -n vllm-example
64
73
```
65
74
66
75
Expected output:
@@ -77,11 +86,12 @@ Expected output:
77
86
...
78
87
```
79
88
80
- 3 . Create service:
89
+ 4 . Create service:
81
90
82
91
``` bash
83
92
# ClusterIP service on port 8080 in front of vllm deployment
84
- kubectl apply -f vllm-service.yaml
93
+ # Make sure to use the same namespace as in the previous steps
94
+ kubectl apply -f vllm-service.yaml -n vllm-example
85
95
```
86
96
87
97
## Verification / Seeing it Work
@@ -90,18 +100,19 @@ kubectl apply -f vllm-service.yaml
90
100
91
101
``` bash
92
102
# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
93
- kubectl port-forward service/vllm-service 8080:8080
103
+ # Make sure to use the same namespace as in the previous steps
104
+ kubectl port-forward service/vllm-service 8080:8080 -n vllm-example
94
105
```
95
106
96
107
2 . Send request to local forwarding port:
97
108
98
109
``` bash
99
110
curl -X POST http://localhost:8080/v1/chat/completions \
100
111
-H " Content-Type: application/json" \
101
- -d ' {
102
- "model": "google/gemma-3-1b-it",
103
- "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
104
- "max_tokens": 100
112
+ -d ' { \
113
+ "model": "google/gemma-3-1b-it", \
114
+ "messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms." }], \
115
+ "max_tokens": 100 \
105
116
}'
106
117
```
107
118
@@ -151,9 +162,11 @@ Node selectors make sure vLLM pods land on Nodes with the correct GPU, and they
151
162
## Cleanup
152
163
153
164
` ` ` bash
154
- kubectl delete -f vllm-service.yaml
155
- kubectl delete -f vllm-deployment.yaml
156
- kubectl delete -f secret/hf_secret
165
+ # Make sure to use the same namespace as in the previous steps
166
+ kubectl delete -f vllm-service.yaml -n vllm-example
167
+ kubectl delete -f vllm-deployment.yaml -n vllm-example
168
+ kubectl delete secret hf-secret -n vllm-example
169
+ kubectl delete namespace vllm-example
157
170
```
158
171
159
172
---
0 commit comments