Skip to content

Commit 2404918

Browse files
authored
[Feat] Adding a tutorial for using vLLM v1 in production stack (#390)
* Adding vLLM v1 tutorial Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * Bump helm chart version Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * Fixing yaml file format Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * Fixing yaml file format Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * fix yaml formate Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> --------- Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
1 parent 0210014 commit 2404918

File tree

5 files changed

+189
-1
lines changed

5 files changed

+189
-1
lines changed

helm/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 0.1.1
18+
version: 0.1.2
1919

2020
maintainers:
2121
- name: apostac

helm/templates/deployment-vllm-multi.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,8 +118,18 @@ spec:
118118
{{- end }}
119119
{{- if $modelSpec.lmcacheConfig }}
120120
{{- if $modelSpec.lmcacheConfig.enabled }}
121+
{{- if hasKey $modelSpec.vllmConfig "v1" }}
122+
{{- if eq (toString $modelSpec.vllmConfig.v1) "1" }}
123+
- "--kv-transfer-config"
124+
- '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
125+
{{- else }}
121126
- "--kv-transfer-config"
122127
- '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
128+
{{- end }}
129+
{{- else }}
130+
- "--kv-transfer-config"
131+
- '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
132+
{{- end }}
123133
{{- end }}
124134
{{- end }}
125135
{{- if $modelSpec.chatTemplate }}
@@ -139,6 +149,8 @@ spec:
139149
value: /tmp
140150
{{- end }}
141151
{{- with $modelSpec.vllmConfig}}
152+
- name: LMCACHE_LOG_LEVEL
153+
value: "DEBUG"
142154
{{- if hasKey . "v1" }}
143155
- name: VLLM_USE_V1
144156
value: {{ default 0 $modelSpec.vllmConfig.v1 | quote }}

tutorials/14-vllm-v1.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Tutorial: Running vLLM with v1 Configuration
2+
3+
## Introduction
4+
5+
This tutorial demonstrates how to deploy vLLM with v1 configuration enabled. The v1 configuration uses the LMCacheConnectorV1 for KV cache management, which provides improved performance and stability for certain workloads.
6+
7+
## Prerequisites
8+
9+
- A Kubernetes cluster with GPU support
10+
- Helm installed on your local machine
11+
- Completion of the following tutorials:
12+
- [00-install-kubernetes-env.md](00-install-kubernetes-env.md)
13+
- [01-minimal-helm-installation.md](01-minimal-helm-installation.md)
14+
15+
## Step 1: Understanding the Configuration
16+
17+
The configuration file `values-14-vllm-v1.yaml` includes several important settings:
18+
19+
1. Model Configuration:
20+
- Using Llama-3.1-8B-Instruct model
21+
- Single replica deployment
22+
- Resource requirements: 6 CPU, 16Gi memory, 1 GPU
23+
- 50Gi persistent storage
24+
25+
2. vLLM Configuration:
26+
- v1 mode enabled (v1: 1)
27+
- bfloat16 precision
28+
- Maximum sequence length of 4096 tokens
29+
- GPU memory utilization set to 80%
30+
31+
3. LMCache Configuration:
32+
- KV cache offloading enabled
33+
- 20GB CPU offloading buffer size
34+
35+
4. Cache Server Configuration:
36+
- Single replica cache server
37+
- Naive serialization/deserialization
38+
- Resource limits: 2 CPU, 10Gi memory
39+
40+
Feel freet to change the above parameters for your own scenario.
41+
42+
## Step 2: Deploying the Stack
43+
44+
1. First, ensure you're in the correct directory:
45+
46+
```bash
47+
cd production-stack
48+
```
49+
50+
2. Deploy the stack using Helm:
51+
52+
```bash
53+
helm install vllm helm/ -f tutorials/assets/values-14-vllm-v1.yaml
54+
```
55+
56+
3. Verify the deployment:
57+
58+
```bash
59+
kubectl get pods
60+
```
61+
62+
You should see:
63+
- A vLLM pod for the Llama model
64+
- A cache server pod
65+
66+
## Step 3: Verifying the Configuration
67+
68+
1. Check the vLLM pod logs to verify v1 configuration:
69+
70+
```bash
71+
kubectl logs -f <vllm-pod-name>
72+
```
73+
74+
Look for the following log message:
75+
76+
```log
77+
INFO 04-29 12:12:25 [factory.py:64] Creating v1 connector with name: LMCacheConnectorV1
78+
```
79+
80+
2. Forward the router service port:
81+
82+
```bash
83+
kubectl port-forward svc/vllm-router-service 30080:80
84+
```
85+
86+
## Step 4: Testing the Deployment
87+
88+
Send a test request to verify the deployment:
89+
90+
```bash
91+
curl -X POST http://localhost:30080/v1/completions \
92+
-H "Content-Type: application/json" \
93+
-d '{
94+
"model": "meta-llama/Llama-3.1-8B-Instruct",
95+
"prompt": "Explain the benefits of using v1 configuration in vLLM.",
96+
"max_tokens": 100
97+
}'
98+
```
99+
100+
Note that you need to send a prompt greater than 256 tokens in order to reuse the KV cache (the chunk size set in LMCache)
101+
102+
## Conclusion
103+
104+
This tutorial demonstrated how to deploy vLLM with v1 configuration enabled. The v1 configuration provides improved KV cache management through LMCacheConnectorV1, which can lead to better performance for certain workloads. You can adjust the configuration parameters in the values file to optimize for your specific use case.

tutorials/assets/values-06-shared-storage.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,14 @@ cacheserverSpec:
5555
labels:
5656
environment: "cacheserver"
5757
release: "cacheserver"
58+
59+
routerSpec:
60+
resources:
61+
requests:
62+
cpu: "1"
63+
memory: "2G"
64+
limits:
65+
cpu: "1"
66+
memory: "2G"
67+
routingLogic: "session"
68+
sessionKey: "x-user-id"
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
servingEngineSpec:
2+
runtimeClassName: ""
3+
modelSpec:
4+
- name: "llama3"
5+
repository: "lmcache/vllm-openai"
6+
tag: "2025-04-18"
7+
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
8+
replicaCount: 1
9+
10+
requestCPU: 6
11+
requestMemory: "16Gi"
12+
requestGPU: 1
13+
14+
pvcStorage: "50Gi"
15+
pvcAccessMode:
16+
- ReadWriteOnce
17+
18+
vllmConfig:
19+
enableChunkedPrefill: false
20+
enablePrefixCaching: false
21+
maxModelLen: 4096
22+
dtype: "bfloat16"
23+
v1: 1
24+
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
25+
26+
lmcacheConfig:
27+
enabled: true
28+
cpuOffloadingBufferSize: "20"
29+
hf_token: <your-hf-token>
30+
31+
cacheserverSpec:
32+
# -- Number of replicas
33+
replicaCount: 1
34+
35+
# -- Container port
36+
containerPort: 8080
37+
38+
# -- Service port
39+
servicePort: 81
40+
41+
# -- Serializer/Deserializer type
42+
serde: "naive"
43+
44+
# -- Cache server image (reusing the vllm image)
45+
repository: "lmcache/vllm-openai"
46+
tag: "2025-04-18"
47+
48+
# TODO (Jiayi): please adjust this once we have evictor
49+
# -- router resource requests and limits
50+
resources:
51+
requests:
52+
cpu: "2"
53+
memory: "8G"
54+
limits:
55+
cpu: "2"
56+
memory: "10G"
57+
58+
# -- Customized labels for the cache server deployment
59+
labels:
60+
environment: "cacheserver"
61+
release: "cacheserver"

0 commit comments

Comments
 (0)