Skip to content

Commit 6ce81a9

Browse files
committed
docs: add P/D disaggregation example in manifests/disaggregation
Signed-off-by: googs1025 <[email protected]>
1 parent b3f93d6 commit 6ce81a9

File tree

3 files changed

+165
-0
lines changed

3 files changed

+165
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,3 +362,6 @@ curl -X POST http://localhost:8000/v1/chat/completions \
362362
]
363363
}'
364364
```
365+
366+
### Prefill/Decode (P/D) Separation Example
367+
An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).

manifests/disaggregation/README.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
## Prefill/Decode Disaggregation Deployment Guide
2+
3+
This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture.
4+
The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) intelligently routes client requests to dedicated Prefill and Decode simulation services, enabling validation of disaggregated inference workflows.
5+
6+
### Quick Start
7+
8+
1. Deploy the Application
9+
Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster:
10+
11+
```bash
12+
kubectl apply -f vllm-sim-pd.yaml
13+
```
14+
15+
> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication.
16+
17+
2. Verify Pods Are Ready
18+
Check that all pods are running:
19+
20+
```bash
21+
kubectl get pods -l app=my-llm-pool
22+
```
23+
24+
Expected output:
25+
26+
```bash
27+
NAME READY STATUS RESTARTS AGE
28+
vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m
29+
vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m
30+
```
31+
32+
### Send a Disaggregated Request Using kubectl port-forward
33+
To access both the Decode services from your local machine, use kubectl port-forward to forward their ports to your localhost.
34+
35+
### Forward the Decode Service Port
36+
Open a terminal and run:
37+
38+
```bash
39+
kubectl port-forward svc/vllm-sim-d-service 8000:8000
40+
```
41+
42+
This command forwards port 8000 from the vllm-sim-d-service to your local machine's port 8000.
43+
44+
#### Test the Disaggregated Flow
45+
46+
Now, send a request to the forwarded Decode service port with the necessary headers:
47+
48+
```bash
49+
curl -v http://localhost:8001/v1/chat/completions \
50+
-H "Content-Type: application/json" \
51+
-H "x-prefiller-host-port: vllm-sim-p-service:8000" \
52+
-d '{
53+
"model": "meta-llama/Llama-3.1-8B-Instruct",
54+
"messages": [{"role": "user", "content": "Hello from P/D architecture!"}],
55+
"max_tokens": 32
56+
}'
57+
```
58+
59+
> Critical Header:
60+
>```
61+
>x-prefiller-host-port: vllm-sim-p-service:8000
62+
>```
63+
>This header tells the sidecar where to send the prefill request. Since we have `vllm-sim-p-service:8000`, we specify it here.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
# Prefill Deployment
3+
apiVersion: apps/v1
4+
kind: Deployment
5+
metadata:
6+
name: vllm-sim-p
7+
labels:
8+
app: my-llm-pool
9+
spec:
10+
replicas: 1
11+
selector:
12+
matchLabels:
13+
app: my-llm-pool
14+
template:
15+
metadata:
16+
labels:
17+
app: my-llm-pool
18+
llm-d.ai/role: prefill
19+
spec:
20+
containers:
21+
- name: vllm-prefill
22+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
23+
imagePullPolicy: IfNotPresent
24+
args:
25+
- "--v=4"
26+
- "--port=8000"
27+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
28+
- "--data-parallel-size=1"
29+
ports:
30+
- containerPort: 8000
31+
---
32+
# Decode Deployment (with routing-sidecar + vLLM simulator)
33+
apiVersion: apps/v1
34+
kind: Deployment
35+
metadata:
36+
name: vllm-sim-d
37+
labels:
38+
app: my-llm-pool
39+
spec:
40+
replicas: 1
41+
selector:
42+
matchLabels:
43+
app: my-llm-pool
44+
template:
45+
metadata:
46+
labels:
47+
app: my-llm-pool
48+
llm-d.ai/role: decode
49+
spec:
50+
containers:
51+
- name: routing-sidecar
52+
image: ghcr.io/llm-d/llm-d-routing-sidecar:latest
53+
imagePullPolicy: IfNotPresent
54+
args:
55+
- "--v=4"
56+
- "--port=8000"
57+
- "--vllm-port=8200"
58+
- "--connector=lmcache"
59+
- "--secure-proxy=false"
60+
ports:
61+
- containerPort: 8000
62+
- name: vllm-decode
63+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
64+
imagePullPolicy: IfNotPresent
65+
args:
66+
- "--v=4"
67+
- "--port=8200"
68+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
69+
- "--data-parallel-size=1"
70+
ports:
71+
- containerPort: 8200
72+
---
73+
apiVersion: v1
74+
kind: Service
75+
metadata:
76+
name: vllm-sim-p-service
77+
spec:
78+
selector:
79+
app: my-llm-pool
80+
llm-d.ai/role: prefill
81+
ports:
82+
- protocol: TCP
83+
port: 8000
84+
targetPort: 8000
85+
type: ClusterIP
86+
---
87+
apiVersion: v1
88+
kind: Service
89+
metadata:
90+
name: vllm-sim-d-service
91+
spec:
92+
selector:
93+
app: my-llm-pool
94+
llm-d.ai/role: decode
95+
ports:
96+
- protocol: TCP
97+
port: 8000
98+
targetPort: 8000
99+
type: ClusterIP

0 commit comments

Comments
 (0)