docs: add P/D disaggregation example in manifests/disaggregation (#253)

googs1025 · web-flow · commit 5c58b126f42a · 2025-11-24T07:53:03.000Z
Signed-off-by: googs1025 &lt;googs1025@gmail.com&gt;
Signed-off-by: CYJiang &lt;googs1025@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -363,6 +363,9 @@ curl -X POST http://localhost:8000/v1/chat/completions \
   }'
 ```
 
+### Prefill/Decode (P/D) Separation Example
+An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).
+
 ## Response generation
 
 The `/v1/completions` and `/v1/chat/completions` endpoints produce responses based on simulator configurations and the specific request parameters.
diff --git a/manifests/disaggregation/README.md b/manifests/disaggregation/README.md
@@ -0,0 +1,83 @@
+## Prefill/Decode Disaggregation Deployment Guide
+
+This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture. 
+
+The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) runs alongside the Decode service and acts as a reverse proxy: it receives client requests, forwards the prefill phase to a dedicated Prefill service (based on the x-prefiller-host-port header), and then handles the decode phase locally.
+
+> This is a standalone simulation setup, intended for testing and validating P/D workflows without requiring the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler). 
+> It uses standard Kubernetes Services for internal communication between components.
+
+### Quick Start
+
+1. Deploy the Application
+   Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster:
+
+```bash
+kubectl apply -f vllm-sim-pd.yaml
+```
+
+> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication.
+
+2. Verify Pods Are Ready
+   Check that all pods are running:
+
+```bash
+kubectl get pods -l 'llm-d.ai/role in (prefill,decode)'
+```
+
+Expected output:
+
+```bash
+NAME                          READY   STATUS    RESTARTS   AGE
+vllm-sim-d-685b57d694-d6qxg   2/2     Running   0          12m
+vllm-sim-p-7b768565d9-79j97   1/1     Running   0          12m
+```
+
+### Send a Disaggregated Request Using kubectl port-forward
+To access both the Decode services from your local machine, use `kubectl port-forward` to forward their ports to your localhost.
+
+### Forward the Decode Service Port
+Open a terminal and run:
+
+```bash
+kubectl port-forward svc/vllm-sim-d-service 8000:8000
+```
+
+This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000.
+
+#### Test the Disaggregated Flow
+
+Now, send a request to the forwarded Decode service port with the necessary headers:
+
+```bash
+curl -v http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-prefiller-host-port: vllm-sim-p-service:8000" \
+  -d '{
+    "model": "meta-llama/Llama-3.1-8B-Instruct",
+    "messages": [{"role": "user", "content": "Hello from P/D architecture!"}],
+    "max_tokens": 32
+  }'
+```
+
+>  Critical Header:
+>```
+>x-prefiller-host-port: vllm-sim-p-service:8000
+>```
+>This header must be provided by the client in standalone mode. It tells the `routing-sidecar` where to send the prefill request. The value should be a Kubernetes Service name + port (or any resolvable `host:port` reachable from the sidecar pod).
+>
+> In production deployments using `llm-d-inference-scheduler`, this header is typically injected automatically by the scheduler or gateway—but in this standalone simulator, the client must set it explicitly.
+
+
+#### Realistic Config
+
+This example already configures non-zero latency parameters to reflect real-world P/D disaggregation behavior:
+
+```yaml
+- "--prefill-time-per-token=200"   # ~200ms per input token for prefill computation
+- "--prefill-time-std-dev=3"       # ±3ms jitter to simulate system noise
+```
+
+Parameter meanings:
+- `prefill-time-per-token`: Average time (in milliseconds) to process each prompt token during the prefill phase. Higher values emphasize the cost of large prompts.
+- `prefill-time-std-dev`: Standard deviation (in ms) of prefill latency, introducing realistic variation across requests.
diff --git a/manifests/disaggregation/vllm-sim-pd.yaml b/manifests/disaggregation/vllm-sim-pd.yaml
@@ -0,0 +1,93 @@
+---
+# Prefill Deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm-sim-p
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      llm-d.ai/role: prefill
+  template:
+    metadata:
+      labels:
+        llm-d.ai/role: prefill
+    spec:
+      containers:
+        - name: vllm-prefill
+          image: ghcr.io/llm-d/llm-d-inference-sim:latest
+          imagePullPolicy: IfNotPresent
+          args:
+            - "--v=4"
+            - "--port=8000"
+            - "--model=meta-llama/Llama-3.1-8B-Instruct"
+            - "--data-parallel-size=1"
+            - "--prefill-time-per-token=200"
+            - "--prefill-time-std-dev=3"
+          ports:
+            - containerPort: 8000
+---
+# Decode Deployment (with routing-sidecar + vLLM simulator)
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm-sim-d
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      llm-d.ai/role: decode
+  template:
+    metadata:
+      labels:
+        llm-d.ai/role: decode
+    spec:
+      containers:
+        - name: routing-sidecar
+          image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.3.1-rc.1
+          imagePullPolicy: IfNotPresent
+          args:
+            - "--v=4"
+            - "--port=8000"
+            - "--vllm-port=8200"
+            - "--connector=nixlv2"
+            - "--secure-proxy=false"
+          ports:
+            - containerPort: 8000
+        - name: vllm-decode
+          image: ghcr.io/llm-d/llm-d-inference-sim:latest
+          imagePullPolicy: IfNotPresent
+          args:
+            - "--v=4"
+            - "--port=8200"
+            - "--model=meta-llama/Llama-3.1-8B-Instruct"
+            - "--data-parallel-size=1"
+          ports:
+            - containerPort: 8200
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-sim-p-service
+spec:
+  selector:
+    llm-d.ai/role: prefill
+  ports:
+    - protocol: TCP
+      port: 8000
+      targetPort: 8000
+  type: ClusterIP
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-sim-d-service
+spec:
+  selector:
+    llm-d.ai/role: decode
+  ports:
+    - protocol: TCP
+      port: 8000
+      targetPort: 8000
+  type: ClusterIP