|
| 1 | +## Prefill/Decode Disaggregation Deployment Guide |
| 2 | + |
| 3 | +This guide demonstrates how to deploy the LLM Disaggregation Simulator (llm-d-sim) in a Kubernetes cluster using a separated Prefill and Decode (P/D) architecture. |
| 4 | +The [`routing-sidecar`](https://github.com/llm-d/llm-d-routing-sidecar) intelligently routes client requests to dedicated Prefill and Decode simulation services, enabling validation of disaggregated inference workflows. |
| 5 | + |
| 6 | +### Quick Start |
| 7 | + |
| 8 | +1. Deploy the Application |
| 9 | + Apply the provided manifest (e.g., vllm-sim-pd.yaml) to your Kubernetes cluster: |
| 10 | + |
| 11 | +```bash |
| 12 | +kubectl apply -f vllm-sim-pd.yaml |
| 13 | +``` |
| 14 | + |
| 15 | +> This manifest defines two Deployments (vllm-sim-p for Prefill, vllm-sim-d for Decode) and two Services for internal and external communication. |
| 16 | +
|
| 17 | +2. Verify Pods Are Ready |
| 18 | + Check that all pods are running: |
| 19 | + |
| 20 | +```bash |
| 21 | +kubectl get pods -l 'llm-d.ai/role in (prefill,decode)' |
| 22 | +``` |
| 23 | + |
| 24 | +Expected output: |
| 25 | + |
| 26 | +```bash |
| 27 | +NAME READY STATUS RESTARTS AGE |
| 28 | +vllm-sim-d-685b57d694-d6qxg 2/2 Running 0 12m |
| 29 | +vllm-sim-p-7b768565d9-79j97 1/1 Running 0 12m |
| 30 | +``` |
| 31 | + |
| 32 | +### Send a Disaggregated Request Using kubectl port-forward |
| 33 | +To access both the Decode services from your local machine, use kubectl port-forward to forward their ports to your localhost. |
| 34 | + |
| 35 | +### Forward the Decode Service Port |
| 36 | +Open a terminal and run: |
| 37 | + |
| 38 | +```bash |
| 39 | +kubectl port-forward svc/vllm-sim-d-service 8000:8000 |
| 40 | +``` |
| 41 | + |
| 42 | +This command forwards port 8000 from the `vllm-sim-d-service` to your local machine's port 8000. |
| 43 | + |
| 44 | +#### Test the Disaggregated Flow |
| 45 | + |
| 46 | +Now, send a request to the forwarded Decode service port with the necessary headers: |
| 47 | + |
| 48 | +```bash |
| 49 | +curl -v http://localhost:8000/v1/chat/completions \ |
| 50 | + -H "Content-Type: application/json" \ |
| 51 | + -H "x-prefiller-host-port: vllm-sim-p-service:8000" \ |
| 52 | + -d '{ |
| 53 | + "model": "meta-llama/Llama-3.1-8B-Instruct", |
| 54 | + "messages": [{"role": "user", "content": "Hello from P/D architecture!"}], |
| 55 | + "max_tokens": 32 |
| 56 | + }' |
| 57 | +``` |
| 58 | + |
| 59 | +> Critical Header: |
| 60 | +>``` |
| 61 | +>x-prefiller-host-port: vllm-sim-p-service:8000 |
| 62 | +>``` |
| 63 | +>This header tells the sidecar where to send the prefill request. Since we have `vllm-sim-p-service:8000`, we specify it here.~~ |
0 commit comments