|
| 1 | +## Prefill/Decode Disaggregation Deployment Guide |
| 2 | + |
| 3 | +This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture. |
| 4 | + |
| 5 | +The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) runs alongside the Decode service and acts as a reverse proxy: it receives client requests, forwards the prefill phase to a dedicated Prefill service (based on the x-prefiller-host-port header), and then handles the decode phase locally. |
| 6 | + |
| 7 | +> This is a standalone simulation setup, intended for testing and validating P/D workflows without requiring the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler). |
| 8 | +> It uses standard Kubernetes Services for internal communication between components. |
| 9 | +
|
| 10 | +### Quick Start |
| 11 | + |
| 12 | +1. Deploy the Application |
| 13 | + Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster: |
| 14 | + |
| 15 | +```bash |
| 16 | +kubectl apply -f vllm-sim-pd.yaml |
| 17 | +``` |
| 18 | + |
| 19 | +> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication. |
| 20 | +
|
| 21 | +2. Verify Pods Are Ready |
| 22 | + Check that all pods are running: |
| 23 | + |
| 24 | +```bash |
| 25 | +kubectl get pods -l 'llm-d.ai/role in (prefill,decode)' |
| 26 | +``` |
| 27 | + |
| 28 | +Expected output: |
| 29 | + |
| 30 | +```bash |
| 31 | +NAME READY STATUS RESTARTS AGE |
| 32 | +vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m |
| 33 | +vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m |
| 34 | +``` |
| 35 | + |
| 36 | +### Send a Disaggregated Request Using kubectl port-forward |
| 37 | +To access both the Decode services from your local machine, use `kubectl port-forward` to forward their ports to your localhost. |
| 38 | + |
| 39 | +### Forward the Decode Service Port |
| 40 | +Open a terminal and run: |
| 41 | + |
| 42 | +```bash |
| 43 | +kubectl port-forward svc/vllm-sim-d-service 8000:8000 |
| 44 | +``` |
| 45 | + |
| 46 | +This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000. |
| 47 | + |
| 48 | +#### Test the Disaggregated Flow |
| 49 | + |
| 50 | +Now, send a request to the forwarded Decode service port with the necessary headers: |
| 51 | + |
| 52 | +```bash |
| 53 | +curl -v http://localhost:8000/v1/chat/completions \ |
| 54 | + -H "Content-Type: application/json" \ |
| 55 | + -H "x-prefiller-host-port: vllm-sim-p-service:8000" \ |
| 56 | + -d '{ |
| 57 | + "model": "meta-llama/Llama-3.1-8B-Instruct", |
| 58 | + "messages": [{"role": "user", "content": "Hello from P/D architecture!"}], |
| 59 | + "max_tokens": 32 |
| 60 | + }' |
| 61 | +``` |
| 62 | + |
| 63 | +> Critical Header: |
| 64 | +>``` |
| 65 | +>x-prefiller-host-port: vllm-sim-p-service:8000 |
| 66 | +>``` |
| 67 | +>This header must be provided by the client in standalone mode. It tells the `routing-sidecar` where to send the prefill request. The value should be a Kubernetes Service name + port (or any resolvable `host:port` reachable from the sidecar pod). |
| 68 | +> |
| 69 | +> In production deployments using `llm-d-inference-scheduler`, this header is typically injected automatically by the scheduler or gateway—but in this standalone simulator, the client must set it explicitly. |
| 70 | +
|
| 71 | +
|
| 72 | +#### Realistic Config |
| 73 | +
|
| 74 | +This example already configures non-zero latency parameters to reflect real-world P/D disaggregation behavior: |
| 75 | +
|
| 76 | +```yaml |
| 77 | +- "--prefill-time-per-token=200" # ~200ms per input token for prefill computation |
| 78 | +- "--prefill-time-std-dev=3" # ±3ms jitter to simulate system noise |
| 79 | +``` |
| 80 | +
|
| 81 | +Parameter meanings: |
| 82 | +- `prefill-time-per-token`: Average time (in milliseconds) to process each prompt token during the prefill phase. Higher values emphasize the cost of large prompts. |
| 83 | +- `prefill-time-std-dev`: Standard deviation (in ms) of prefill latency, introducing realistic variation across requests. |
0 commit comments