[Feat][PD] lastest PD support from LMCache with NIXL #669

kobe0938 · 2025-08-28T00:02:07Z

FILL IN THE PR DESCRIPTION HERE

PD support from LMCache with NIXL https://docs.lmcache.ai/disaggregated_prefill/nixl/1p1d.html

FIX (P0) XpYd support in #640

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

- Modified router service templates and configs - Updated vLLM router application logic - Enhanced request parsing and service discovery - Added custom Dockerfile for production deployment - Updated disaggregated prefill configuration Signed-off-by: Kobe Chen <[email protected]>

gemini-code-assist

Summary of Changes

Hello @kobe0938, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the latest Prefill Disaggregation (PD) features using LMCache with NIXL, addressing the XpYd support. The changes span across the router's Docker image, Helm chart configurations, and core request processing logic. A new Dockerfile is added for building the router with necessary dependencies. Helm charts are updated to expose new ports and allow detailed configuration of NIXL peer and proxy settings. The router's request handling is significantly refactored to orchestrate a multi-stage prefill and decode process, leveraging ZMQ for efficient NIXL communication. These updates enable a more robust and configurable disaggregated prefill architecture.

Highlights

New Dockerfile for Prefill Disaggregation (PD): A new Dockerfile (docker/Dockerfile.pd) has been introduced to build a vllm-router image specifically configured with lmcache and semantic_cache dependencies, streamlining the deployment process for PD-enabled environments.
Enhanced NIXL Parameter Integration in Helm Charts: The Helm charts (deployment-router.yaml, deployment-vllm-multi.yaml) have been updated to support extensive configuration of NIXL peer and proxy hosts/ports, allowing for flexible deployment of disaggregated prefill components within Kubernetes.
New Port Exposure for PD Communication: Several new ports (7100-7500) are now exposed in both the router and vLLM service deployments, enabling the necessary communication channels for the NIXL protocol to function correctly.
ZMQ Integration for NIXL in Router: The vllm_router now integrates a ZMQ pull server, managed via FastAPI's lifespan events, to efficiently receive NIXL messages and track the completion of prefill requests, which is crucial for the disaggregated prefill workflow.
Enhanced Disaggregated Prefill Request Flow: The route_disaggregated_prefill_request function has been significantly refactored to implement a multi-step process for handling disaggregated prefill requests. This includes initial tokenization, sending a prefill request with NIXL-specific KV transfer parameters, and then preparing and streaming the decode response after waiting for KV readiness.
Updated Prefill Configuration Example: The values-16-disagg-prefill.yaml tutorial asset has been updated to reflect the latest NIXL and XpYd configurations, including new image tags and specific port assignments, providing a current and accurate example for deploying disaggregated prefill setups.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for disaggregated prefill with LMCache and NIXL, a significant feature involving changes to Docker configurations, Helm charts, and core request routing logic. The implementation adds a multi-step request process (tokenize, prefill, decode) coordinated via ZMQ. My review focuses on improving the robustness, performance, and maintainability of these changes. Key suggestions include fixing a potential crash in the new ZMQ server, optimizing request handling to avoid redundant operations, enhancing the efficiency of the prefill/decode synchronization, and ensuring configurations are clean and maintainable.

gemini-code-assist · 2025-08-28T00:04:40Z

src/vllm_router/services/request_service/request.py

+    try:
+        socket = zmq_ctx.socket(zmq.PULL)
+        try:
+            from vllm_router.app import app
+            proxy_host = app.state.args.nixl_proxy_host
+            proxy_port = app.state.args.nixl_proxy_port
+        except Exception as e:
+            logger.error(f"Failed to get proxy host and port from app state: {e}")
+        proxy_url = f"{proxy_host}:{proxy_port}"
+        socket.bind(f"tcp://{proxy_url}")
+        logger.info(f"ZMQ proxy server started on {proxy_url}")
+    except Exception as e:
+        logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}")
+        socket.close()
+        return


There's a potential NameError in zmq_pull_server. If retrieving proxy_host or proxy_port from app.state.args fails, proxy_url will not be defined, but it's used in the except block on line 78 for logging, which will cause a crash. The logic should be restructured to ensure proxy_url is defined only after its components are successfully retrieved and to handle initialization failures gracefully.

Suggested change

try:

socket = zmq_ctx.socket(zmq.PULL)

try:

from vllm_router.app import app

proxy_host = app.state.args.nixl_proxy_host

proxy_port = app.state.args.nixl_proxy_port

except Exception as e:

logger.error(f"Failed to get proxy host and port from app state: {e}")

proxy_url = f"{proxy_host}:{proxy_port}"

socket.bind(f"tcp://{proxy_url}")

logger.info(f"ZMQ proxy server started on {proxy_url}")

except Exception as e:

logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}")

socket.close()

return

socket = zmq_ctx.socket(zmq.PULL)

try:

from vllm_router.app import app

proxy_host = app.state.args.nixl_proxy_host

proxy_port = app.state.args.nixl_proxy_port

proxy_url = f"tcp://{proxy_host}:{proxy_port}"

socket.bind(proxy_url)

logger.info(f"ZMQ proxy server started on {proxy_url}")

except Exception as e:

logger.error(f"Failed to start ZMQ pull server: {e}")

socket.close()

return

gemini-code-assist · 2025-08-28T00:04:40Z

src/vllm_router/services/request_service/request.py

    request_body = await request.body()
-    request_json = json.loads(request_body)
+    request_json = await request.json()  # TODO (ApostaC): merge two awaits into one


The request body is being awaited twice: first with request.body() and then internally by request.json(). This is inefficient as it reads the entire request stream twice. You can remove the await request.body() call since the request_body variable is not used elsewhere.

Suggested change

request_body = await request.body()

request_json = json.loads(request_body)

request_json = await request.json() # TODO (ApostaC): merge two awaits into one

request_json = await request.json() # TODO (ApostaC): merge two awaits into one

gemini-code-assist · 2025-08-28T00:04:40Z

src/vllm_router/services/request_service/request.py

+async def wait_decode_kv_ready(req_id: str):
+    while req_id not in finished_reqs:
+        await asyncio.sleep(0.0001)  # sleep for 0.1 ms
+    logger.debug(f"Prefill node signaled kv ready for req {req_id}")
+    finished_reqs.remove(req_id)


The wait_decode_kv_ready function uses a busy-wait loop with a very short sleep time (0.1ms). This can lead to high CPU consumption. A more efficient approach is to use an asyncio.Event for each request. The waiting coroutine can await the event, and the ZMQ server can set the event upon message arrival, avoiding polling entirely.

Example of a more efficient implementation:

# In a shared location, e.g., at the top of the file request_events: Dict[str, asyncio.Event] = {} # In zmq_pull_server, when a message is received req_id = msg.req_id if req_id in request_events: request_events[req_id].set() # In route_disaggregated_prefill_request, before waiting request_events[request_id] = asyncio.Event() # New wait_decode_kv_ready implementation async def wait_decode_kv_ready(req_id: str): if req_id in request_events: await request_events[req_id].wait() logger.debug(f"Prefill node signaled kv ready for req {req_id}") del request_events[req_id] # Clean up

gemini-code-assist · 2025-08-28T00:04:41Z

tutorials/assets/values-16-disagg-prefill.yaml

+  repository: "xiaokunchen/vllm-router"
+  tag: "08-27-v8"


The routerSpec.repository is set to a personal Docker Hub account (xiaokunchen/vllm-router). For official examples and tutorials, this should point to an official project repository to ensure users are pulling a trusted and maintained image. Please update this to the official image repository before merging.

gemini-code-assist · 2025-08-28T00:04:41Z

docker/Dockerfile.pd

+RUN . /opt/venv/bin/activate && \
+    uv pip install --upgrade --no-cache-dir pip setuptools_scm && \
+    uv pip install --no-cache-dir . && \
+    uv pip install zmq msgspec


For better dependency management and clarity, zmq and msgspec should be declared as optional dependencies in pyproject.toml instead of being installed via a separate pip install command. You could add a new group, for example [project.optional-dependencies.pd], and then install it with uv pip install --no-cache-dir .[pd]. This keeps all project dependencies declared in one place.

gemini-code-assist · 2025-08-28T00:04:41Z

helm/templates/deployment-router.yaml

+          - name: pd-port-1
+            containerPort: 7100
+          - name: pd-port-2
+            containerPort: 7200
+          - name: pd-port-3
+            containerPort: 7300
+          - name: pd-port-4
+            containerPort: 7400
+          - name: pd-port-5
+            containerPort: 7500


The pd-port-* container ports are hardcoded. This could be made more flexible and maintainable by using a range loop in the Helm template. This would allow configuring the number of ports through values.yaml instead of modifying the template for future changes.

For example, you could introduce routerSpec.pdPortCount and routerSpec.pdStartPort in values.yaml and loop like this:

{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}} {{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}} {{- range $i := until (int $pdPortCount) }} - name: pd-port-{{ add1 $i }} containerPort: {{ add $pdStartPort $i }} {{- end }}

gemini-code-assist · 2025-08-28T00:04:41Z

helm/templates/deployment-vllm-multi.yaml

+            - name: pd-port-1
+              containerPort: 7100
+            - name: pd-port-2
+              containerPort: 7200
+            - name: pd-port-3
+              containerPort: 7300
+            - name: pd-port-4
+              containerPort: 7400
+            - name: pd-port-5
+              containerPort: 7500


Similar to deployment-router.yaml, the pd-port-* container ports are hardcoded. Using a range loop with values from values.yaml would make this more flexible and easier to maintain.

For example, you could introduce servingEngineSpec.pdPortCount and servingEngineSpec.pdStartPort in values.yaml and loop like this:

{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}} {{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}} {{- range $i := until (int $pdPortCount) }} - name: pd-port-{{ add1 $i }} containerPort: {{ add $pdStartPort $i }} {{- end }}

gemini-code-assist · 2025-08-28T00:04:41Z

helm/templates/service-router.yaml

+    - name: pd-port-1
+      port: 7100
+      targetPort: pd-port-1
+      protocol: TCP
+    - name: pd-port-2
+      port: 7200
+      targetPort: pd-port-2
+      protocol: TCP
+    - name: pd-port-3
+      port: 7300
+      targetPort: pd-port-3
+      protocol: TCP
+    - name: pd-port-4
+      port: 7400
+      targetPort: pd-port-4
+      protocol: TCP
+    - name: pd-port-5
+      port: 7500
+      targetPort: pd-port-5
+      protocol: TCP


The pd-port-* definitions are hardcoded. This can be made more maintainable by using a range loop. This would allow configuring the number of ports through values.yaml instead of modifying the template.

For example:

{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}} {{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}} {{- range $i := until (int $pdPortCount) }} - name: pd-port-{{ add1 $i }} port: {{ add $pdStartPort $i }} targetPort: pd-port-{{ add1 $i }} protocol: TCP {{- end }}

gemini-code-assist · 2025-08-28T00:04:41Z

helm/templates/service-vllm.yaml

+    - name: pd-port-1
+      port: 7100
+      targetPort: pd-port-1
+      protocol: TCP
+    - name: pd-port-2
+      port: 7200
+      targetPort: pd-port-2
+      protocol: TCP
+    - name: pd-port-3
+      port: 7300
+      targetPort: pd-port-3
+      protocol: TCP
+    - name: pd-port-4
+      port: 7400
+      targetPort: pd-port-4
+      protocol: TCP
+    - name: pd-port-5
+      port: 7500
+      targetPort: pd-port-5
+      protocol: TCP


The pd-port-* definitions are hardcoded. This can be made more maintainable by using a range loop. This would allow configuring the number of ports through values.yaml instead of modifying the template.

For example:

{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}} {{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}} {{- range $i := until (int $pdPortCount) }} - name: pd-port-{{ add1 $i }} port: {{ add $pdStartPort $i }} targetPort: pd-port-{{ add1 $i }} protocol: TCP {{- end }}

gemini-code-assist · 2025-08-28T00:04:41Z

src/vllm_router/services/request_service/request.py

+    # # Check if client sessions are initialized, if not, try to initialize them
+    # if not hasattr(request.app.state, 'prefill_client') or request.app.state.prefill_client is None:
+    #     logger.warning("prefill_client not initialized, attempting to initialize client sessions")
+    #     try:
+    #         from vllm_router.service_discovery import get_service_discovery
+    #         service_discovery = get_service_discovery()
+    #         if hasattr(service_discovery, '_reinitialize_client_sessions'):
+    #             logger.info("In route_disaggregated_prefill_request: Calling _reinitialize_client_sessions")
+    #             await service_discovery._reinitialize_client_sessions()
+    #             logger.info("Successfully initialized client sessions")
+    #         else:
+    #             logger.error("Service discovery does not have _reinitialize_client_sessions method")
+    #     except Exception as e:
+    #         logger.error(f"Failed to initialize client sessions: {e}")
+    #         return JSONResponse(
+    #             status_code=500,
+    #             content={
+    #                 "error": {
+    #                     "message": "Failed to initialize client sessions",
+    #                     "type": "initialization_error",
+    #                     "code": 500,
+    #                 }
+    #             },
+    #             headers={"X-Request-Id": request_id},
+    #         )


This large block of commented-out code should be removed to improve code clarity. If it's intended for debugging, it should not be part of the final pull request.

Signed-off-by: Kobe Chen <[email protected]>

Vivo50E · 2025-08-28T06:49:07Z

Hi Kobe, should I consider the load balancing for PD disagg #640 as already handled by introducing the NIXL proxy layer, as done in this PR?

kobe0938 · 2025-09-04T02:40:30Z

Hold a bit for /chat/completion fix being merged in lmcache.

LMCache/LMCache#1511

kobe0938 added 2 commits August 27, 2025 23:47

Merge remote-tracking branch 'origin/main'

fbb433d

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

kobe0938 and others added 6 commits August 28, 2025 00:18

try to pass CI

3277c6b

Signed-off-by: Kobe Chen <[email protected]>

pass pre-commit

7466ca1

Signed-off-by: Kobe Chen <[email protected]>

pass pre-commit

ab86b8f

Signed-off-by: Kobe Chen <[email protected]>

modify CI dockerfiles

9550e1b

Signed-off-by: Kobe Chen <[email protected]>

wrap import NixlMsg into a try-catch block

c0e56ef

Signed-off-by: Kobe Chen <[email protected]>

Merge branch 'main' into main

bcede05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat][PD] lastest PD support from LMCache with NIXL #669

[Feat][PD] lastest PD support from LMCache with NIXL #669

Uh oh!

kobe0938 commented Aug 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

Vivo50E commented Aug 28, 2025

Uh oh!

kobe0938 commented Sep 4, 2025

Uh oh!

Uh oh!

[Feat][PD] lastest PD support from LMCache with NIXL #669

Are you sure you want to change the base?

[Feat][PD] lastest PD support from LMCache with NIXL #669

Uh oh!

Conversation

kobe0938 commented Aug 28, 2025

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Vivo50E commented Aug 28, 2025

Uh oh!

kobe0938 commented Sep 4, 2025

Uh oh!

Uh oh!