Skip to content

Commit 5c58b12

Browse files
authored
docs: add P/D disaggregation example in manifests/disaggregation (#253)
Signed-off-by: googs1025 <[email protected]> Signed-off-by: CYJiang <[email protected]>
1 parent c5bfb72 commit 5c58b12

File tree

3 files changed

+179
-0
lines changed

3 files changed

+179
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,9 @@ curl -X POST http://localhost:8000/v1/chat/completions \
363363
}'
364364
```
365365

366+
### Prefill/Decode (P/D) Separation Example
367+
An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).
368+
366369
## Response generation
367370

368371
The `/v1/completions` and `/v1/chat/completions` endpoints produce responses based on simulator configurations and the specific request parameters.

manifests/disaggregation/README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
## Prefill/Decode Disaggregation Deployment Guide
2+
3+
This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture.
4+
5+
The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) runs alongside the Decode service and acts as a reverse proxy: it receives client requests, forwards the prefill phase to a dedicated Prefill service (based on the x-prefiller-host-port header), and then handles the decode phase locally.
6+
7+
> This is a standalone simulation setup, intended for testing and validating P/D workflows without requiring the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler).
8+
> It uses standard Kubernetes Services for internal communication between components.
9+
10+
### Quick Start
11+
12+
1. Deploy the Application
13+
Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster:
14+
15+
```bash
16+
kubectl apply -f vllm-sim-pd.yaml
17+
```
18+
19+
> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication.
20+
21+
2. Verify Pods Are Ready
22+
Check that all pods are running:
23+
24+
```bash
25+
kubectl get pods -l 'llm-d.ai/role in (prefill,decode)'
26+
```
27+
28+
Expected output:
29+
30+
```bash
31+
NAME READY STATUS RESTARTS AGE
32+
vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m
33+
vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m
34+
```
35+
36+
### Send a Disaggregated Request Using kubectl port-forward
37+
To access both the Decode services from your local machine, use `kubectl port-forward` to forward their ports to your localhost.
38+
39+
### Forward the Decode Service Port
40+
Open a terminal and run:
41+
42+
```bash
43+
kubectl port-forward svc/vllm-sim-d-service 8000:8000
44+
```
45+
46+
This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000.
47+
48+
#### Test the Disaggregated Flow
49+
50+
Now, send a request to the forwarded Decode service port with the necessary headers:
51+
52+
```bash
53+
curl -v http://localhost:8000/v1/chat/completions \
54+
-H "Content-Type: application/json" \
55+
-H "x-prefiller-host-port: vllm-sim-p-service:8000" \
56+
-d '{
57+
"model": "meta-llama/Llama-3.1-8B-Instruct",
58+
"messages": [{"role": "user", "content": "Hello from P/D architecture!"}],
59+
"max_tokens": 32
60+
}'
61+
```
62+
63+
> Critical Header:
64+
>```
65+
>x-prefiller-host-port: vllm-sim-p-service:8000
66+
>```
67+
>This header must be provided by the client in standalone mode. It tells the `routing-sidecar` where to send the prefill request. The value should be a Kubernetes Service name + port (or any resolvable `host:port` reachable from the sidecar pod).
68+
>
69+
> In production deployments using `llm-d-inference-scheduler`, this header is typically injected automatically by the scheduler or gateway—but in this standalone simulator, the client must set it explicitly.
70+
71+
72+
#### Realistic Config
73+
74+
This example already configures non-zero latency parameters to reflect real-world P/D disaggregation behavior:
75+
76+
```yaml
77+
- "--prefill-time-per-token=200" # ~200ms per input token for prefill computation
78+
- "--prefill-time-std-dev=3" # ±3ms jitter to simulate system noise
79+
```
80+
81+
Parameter meanings:
82+
- `prefill-time-per-token`: Average time (in milliseconds) to process each prompt token during the prefill phase. Higher values emphasize the cost of large prompts.
83+
- `prefill-time-std-dev`: Standard deviation (in ms) of prefill latency, introducing realistic variation across requests.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
# Prefill Deployment
3+
apiVersion: apps/v1
4+
kind: Deployment
5+
metadata:
6+
name: vllm-sim-p
7+
spec:
8+
replicas: 1
9+
selector:
10+
matchLabels:
11+
llm-d.ai/role: prefill
12+
template:
13+
metadata:
14+
labels:
15+
llm-d.ai/role: prefill
16+
spec:
17+
containers:
18+
- name: vllm-prefill
19+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
20+
imagePullPolicy: IfNotPresent
21+
args:
22+
- "--v=4"
23+
- "--port=8000"
24+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
25+
- "--data-parallel-size=1"
26+
- "--prefill-time-per-token=200"
27+
- "--prefill-time-std-dev=3"
28+
ports:
29+
- containerPort: 8000
30+
---
31+
# Decode Deployment (with routing-sidecar + vLLM simulator)
32+
apiVersion: apps/v1
33+
kind: Deployment
34+
metadata:
35+
name: vllm-sim-d
36+
spec:
37+
replicas: 1
38+
selector:
39+
matchLabels:
40+
llm-d.ai/role: decode
41+
template:
42+
metadata:
43+
labels:
44+
llm-d.ai/role: decode
45+
spec:
46+
containers:
47+
- name: routing-sidecar
48+
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.3.1-rc.1
49+
imagePullPolicy: IfNotPresent
50+
args:
51+
- "--v=4"
52+
- "--port=8000"
53+
- "--vllm-port=8200"
54+
- "--connector=nixlv2"
55+
- "--secure-proxy=false"
56+
ports:
57+
- containerPort: 8000
58+
- name: vllm-decode
59+
image: ghcr.io/llm-d/llm-d-inference-sim:latest
60+
imagePullPolicy: IfNotPresent
61+
args:
62+
- "--v=4"
63+
- "--port=8200"
64+
- "--model=meta-llama/Llama-3.1-8B-Instruct"
65+
- "--data-parallel-size=1"
66+
ports:
67+
- containerPort: 8200
68+
---
69+
apiVersion: v1
70+
kind: Service
71+
metadata:
72+
name: vllm-sim-p-service
73+
spec:
74+
selector:
75+
llm-d.ai/role: prefill
76+
ports:
77+
- protocol: TCP
78+
port: 8000
79+
targetPort: 8000
80+
type: ClusterIP
81+
---
82+
apiVersion: v1
83+
kind: Service
84+
metadata:
85+
name: vllm-sim-d-service
86+
spec:
87+
selector:
88+
llm-d.ai/role: decode
89+
ports:
90+
- protocol: TCP
91+
port: 8000
92+
targetPort: 8000
93+
type: ClusterIP

0 commit comments

Comments
 (0)