Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,3 +362,6 @@ curl -X POST http://localhost:8000/v1/chat/completions \
]
}'
```

### Prefill/Decode (P/D) Separation Example
An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).
63 changes: 63 additions & 0 deletions manifests/disaggregation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
## Prefill/Decode Disaggregation Deployment Guide

This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture.
The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) intelligently routes client requests to dedicated Prefill and Decode simulation services, enabling validation of disaggregated inference workflows.

### Quick Start

1. Deploy the Application
Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster:

```bash
kubectl apply -f vllm-sim-pd.yaml
```

> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication.

2. Verify Pods Are Ready
Check that all pods are running:

```bash
kubectl get pods -l 'llm-d.ai/role in (prefill,decode)'
```

Expected output:

```bash
NAME READY STATUS RESTARTS AGE
vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m
vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m
```

### Send a Disaggregated Request Using kubectl port-forward
To access both the Decode services from your local machine, use kubectl port-forward to forward their ports to your localhost.

### Forward the Decode Service Port
Open a terminal and run:

```bash
kubectl port-forward svc/vllm-sim-d-service 8000:8000
```

This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000.

#### Test the Disaggregated Flow

Now, send a request to the forwarded Decode service port with the necessary headers:

```bash
curl -v http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-prefiller-host-port: vllm-sim-p-service:8000" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello from P/D architecture!"}],
"max_tokens": 32
}'
```

> Critical Header:
>```
>x-prefiller-host-port: vllm-sim-p-service:8000
>```
>This header tells the sidecar where to send the prefill request. Since we have `vllm-sim-p-service:8000`, we specify it here.
91 changes: 91 additions & 0 deletions manifests/disaggregation/vllm-sim-pd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
# Prefill Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-sim-p
spec:
replicas: 1
selector:
matchLabels:
llm-d.ai/role: prefill
template:
metadata:
labels:
llm-d.ai/role: prefill
spec:
containers:
- name: vllm-prefill
image: ghcr.io/llm-d/llm-d-inference-sim:latest
imagePullPolicy: IfNotPresent
args:
- "--v=4"
- "--port=8000"
- "--model=meta-llama/Llama-3.1-8B-Instruct"
- "--data-parallel-size=1"
ports:
- containerPort: 8000
---
# Decode Deployment (with routing-sidecar + vLLM simulator)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-sim-d
spec:
replicas: 1
selector:
matchLabels:
llm-d.ai/role: decode
template:
metadata:
labels:
llm-d.ai/role: decode
spec:
containers:
- name: routing-sidecar
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.3.1-rc.1
imagePullPolicy: IfNotPresent
args:
- "--v=4"
- "--port=8000"
- "--vllm-port=8200"
- "--connector=nixlv2"
- "--secure-proxy=false"
ports:
- containerPort: 8000
- name: vllm-decode
image: ghcr.io/llm-d/llm-d-inference-sim:latest
imagePullPolicy: IfNotPresent
args:
- "--v=4"
- "--port=8200"
- "--model=meta-llama/Llama-3.1-8B-Instruct"
- "--data-parallel-size=1"
ports:
- containerPort: 8200
---
apiVersion: v1
kind: Service
metadata:
name: vllm-sim-p-service
spec:
selector:
llm-d.ai/role: prefill
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: vllm-sim-d-service
spec:
selector:
llm-d.ai/role: decode
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP