Skip to content

Commit 842d281

Browse files
committed
docs: add P/D disaggregation example in manifests/disaggregation
Signed-off-by: googs1025 <[email protected]>
1 parent b3f93d6 commit 842d281

File tree

3 files changed

+157
-0
lines changed

3 files changed

+157
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,3 +362,6 @@ curl -X POST http://localhost:8000/v1/chat/completions \
362362
]
363363
}'
364364
```
365+
366+
### Prefill/Decode (P/D) Separation Example
367+
An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).

manifests/disaggregation/README.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
## Prefill/Decode Disaggregation Deployment Guide
2+
3+
This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture.
4+
The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) intelligently routes client requests to dedicated Prefill and Decode simulation services, enabling validation of disaggregated inference workflows.
5+
6+
### Quick Start
7+
8+
1. Deploy the Application
9+
Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster:
10+
11+
```bash
12+
kubectl apply -f vllm-sim-pd.yaml
13+
```
14+
15+
> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication.
16+
17+
2. Verify Pods Are Ready
18+
Check that all pods are running:
19+
20+
```bash
21+
kubectl get pods -l 'llm-d.ai/role in (prefill,decode)'
22+
```
23+
24+
Expected output:
25+
26+
```bash
27+
NAME READY STATUS RESTARTS AGE
28+
vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m
29+
vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m
30+
```
31+
32+
### Send a Disaggregated Request Using kubectl port-forward
33+
To access both the Decode services from your local machine, use kubectl port-forward to forward their ports to your localhost.
34+
35+
### Forward the Decode Service Port
36+
Open a terminal and run:
37+
38+
```bash
39+
kubectl port-forward svc/vllm-sim-d-service 8000:8000
40+
```
41+
42+
This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000.
43+
44+
#### Test the Disaggregated Flow
45+
46+
Now, send a request to the forwarded Decode service port with the necessary headers:
47+
48+
```bash
49+
curl -v http://localhost:8000/v1/chat/completions \
50+
-H "Content-Type: application/json" \
51+
-H "x-prefiller-host-port: vllm-sim-p-service:8000" \
52+
-d '{
53+
"model": "meta-llama/Llama-3.1-8B-Instruct",
54+
"messages": [{"role": "user", "content": "Hello from P/D architecture!"}],
55+
"max_tokens": 32
56+
}'
57+
```
58+
59+
> Critical Header:
60+
>```
61+
>x-prefiller-host-port: vllm-sim-p-service:8000
62+
>```
63+
>This header tells the sidecar where to send the prefill request. Since we have `vllm-sim-p-service:8000`, we specify it here.~~
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
# Prefill Deployment
3+
apiVersion: apps/v1
4+
kind: Deployment
5+
metadata:
6+
name: vllm-sim-p
7+
spec:
8+
replicas: 1
9+
selector:
10+
matchLabels:
11+
llm-d.ai/role: prefill
12+
template:
13+
metadata:
14+
labels:
15+
llm-d.ai/role: prefill
16+
spec:
17+
containers:
18+
- name: vllm-prefill
19+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
20+
imagePullPolicy: IfNotPresent
21+
args:
22+
- "--v=4"
23+
- "--port=8000"
24+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
25+
- "--data-parallel-size=1"
26+
ports:
27+
- containerPort: 8000
28+
---
29+
# Decode Deployment (with routing-sidecar + vLLM simulator)
30+
apiVersion: apps/v1
31+
kind: Deployment
32+
metadata:
33+
name: vllm-sim-d
34+
spec:
35+
replicas: 1
36+
selector:
37+
matchLabels:
38+
llm-d.ai/role: decode
39+
template:
40+
metadata:
41+
labels:
42+
llm-d.ai/role: decode
43+
spec:
44+
containers:
45+
- name: routing-sidecar
46+
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.3.1-rc.1
47+
imagePullPolicy: IfNotPresent
48+
args:
49+
- "--v=4"
50+
- "--port=8000"
51+
- "--vllm-port=8200"
52+
- "--connector=lmcache"
53+
- "--secure-proxy=false"
54+
ports:
55+
- containerPort: 8000
56+
- name: vllm-decode
57+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
58+
imagePullPolicy: IfNotPresent
59+
args:
60+
- "--v=4"
61+
- "--port=8200"
62+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
63+
- "--data-parallel-size=1"
64+
ports:
65+
- containerPort: 8200
66+
---
67+
apiVersion: v1
68+
kind: Service
69+
metadata:
70+
name: vllm-sim-p-service
71+
spec:
72+
selector:
73+
llm-d.ai/role: prefill
74+
ports:
75+
- protocol: TCP
76+
port: 8000
77+
targetPort: 8000
78+
type: ClusterIP
79+
---
80+
apiVersion: v1
81+
kind: Service
82+
metadata:
83+
name: vllm-sim-d-service
84+
spec:
85+
selector:
86+
llm-d.ai/role: decode
87+
ports:
88+
- protocol: TCP
89+
port: 8000
90+
targetPort: 8000
91+
type: ClusterIP

0 commit comments

Comments
 (0)