Skip to content

Conversation

kobe0938
Copy link
Contributor

FILL IN THE PR DESCRIPTION HERE

PD support from LMCache with NIXL https://docs.lmcache.ai/disaggregated_prefill/nixl/1p1d.html

FIX (P0) XpYd support in #640

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

- Modified router service templates and configs
- Updated vLLM router application logic
- Enhanced request parsing and service discovery
- Added custom Dockerfile for production deployment
- Updated disaggregated prefill configuration

Signed-off-by: Kobe Chen <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kobe0938, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the latest Prefill Disaggregation (PD) features using LMCache with NIXL, addressing the XpYd support. The changes span across the router's Docker image, Helm chart configurations, and core request processing logic. A new Dockerfile is added for building the router with necessary dependencies. Helm charts are updated to expose new ports and allow detailed configuration of NIXL peer and proxy settings. The router's request handling is significantly refactored to orchestrate a multi-stage prefill and decode process, leveraging ZMQ for efficient NIXL communication. These updates enable a more robust and configurable disaggregated prefill architecture.

Highlights

  • New Dockerfile for Prefill Disaggregation (PD): A new Dockerfile (docker/Dockerfile.pd) has been introduced to build a vllm-router image specifically configured with lmcache and semantic_cache dependencies, streamlining the deployment process for PD-enabled environments.
  • Enhanced NIXL Parameter Integration in Helm Charts: The Helm charts (deployment-router.yaml, deployment-vllm-multi.yaml) have been updated to support extensive configuration of NIXL peer and proxy hosts/ports, allowing for flexible deployment of disaggregated prefill components within Kubernetes.
  • New Port Exposure for PD Communication: Several new ports (7100-7500) are now exposed in both the router and vLLM service deployments, enabling the necessary communication channels for the NIXL protocol to function correctly.
  • ZMQ Integration for NIXL in Router: The vllm_router now integrates a ZMQ pull server, managed via FastAPI's lifespan events, to efficiently receive NIXL messages and track the completion of prefill requests, which is crucial for the disaggregated prefill workflow.
  • Enhanced Disaggregated Prefill Request Flow: The route_disaggregated_prefill_request function has been significantly refactored to implement a multi-step process for handling disaggregated prefill requests. This includes initial tokenization, sending a prefill request with NIXL-specific KV transfer parameters, and then preparing and streaming the decode response after waiting for KV readiness.
  • Updated Prefill Configuration Example: The values-16-disagg-prefill.yaml tutorial asset has been updated to reflect the latest NIXL and XpYd configurations, including new image tags and specific port assignments, providing a current and accurate example for deploying disaggregated prefill setups.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for disaggregated prefill with LMCache and NIXL, a significant feature involving changes to Docker configurations, Helm charts, and core request routing logic. The implementation adds a multi-step request process (tokenize, prefill, decode) coordinated via ZMQ. My review focuses on improving the robustness, performance, and maintainability of these changes. Key suggestions include fixing a potential crash in the new ZMQ server, optimizing request handling to avoid redundant operations, enhancing the efficiency of the prefill/decode synchronization, and ensuring configurations are clean and maintainable.

Comment on lines 66 to 80
try:
socket = zmq_ctx.socket(zmq.PULL)
try:
from vllm_router.app import app
proxy_host = app.state.args.nixl_proxy_host
proxy_port = app.state.args.nixl_proxy_port
except Exception as e:
logger.error(f"Failed to get proxy host and port from app state: {e}")
proxy_url = f"{proxy_host}:{proxy_port}"
socket.bind(f"tcp://{proxy_url}")
logger.info(f"ZMQ proxy server started on {proxy_url}")
except Exception as e:
logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}")
socket.close()
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential NameError in zmq_pull_server. If retrieving proxy_host or proxy_port from app.state.args fails, proxy_url will not be defined, but it's used in the except block on line 78 for logging, which will cause a crash. The logic should be restructured to ensure proxy_url is defined only after its components are successfully retrieved and to handle initialization failures gracefully.

Suggested change
try:
socket = zmq_ctx.socket(zmq.PULL)
try:
from vllm_router.app import app
proxy_host = app.state.args.nixl_proxy_host
proxy_port = app.state.args.nixl_proxy_port
except Exception as e:
logger.error(f"Failed to get proxy host and port from app state: {e}")
proxy_url = f"{proxy_host}:{proxy_port}"
socket.bind(f"tcp://{proxy_url}")
logger.info(f"ZMQ proxy server started on {proxy_url}")
except Exception as e:
logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}")
socket.close()
return
socket = zmq_ctx.socket(zmq.PULL)
try:
from vllm_router.app import app
proxy_host = app.state.args.nixl_proxy_host
proxy_port = app.state.args.nixl_proxy_port
proxy_url = f"tcp://{proxy_host}:{proxy_port}"
socket.bind(proxy_url)
logger.info(f"ZMQ proxy server started on {proxy_url}")
except Exception as e:
logger.error(f"Failed to start ZMQ pull server: {e}")
socket.close()
return

Comment on lines 237 to +238
request_body = await request.body()
request_json = json.loads(request_body)
request_json = await request.json() # TODO (ApostaC): merge two awaits into one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The request body is being awaited twice: first with request.body() and then internally by request.json(). This is inefficient as it reads the entire request stream twice. You can remove the await request.body() call since the request_body variable is not used elsewhere.

Suggested change
request_body = await request.body()
request_json = json.loads(request_body)
request_json = await request.json() # TODO (ApostaC): merge two awaits into one
request_json = await request.json() # TODO (ApostaC): merge two awaits into one

Comment on lines +431 to +435
async def wait_decode_kv_ready(req_id: str):
while req_id not in finished_reqs:
await asyncio.sleep(0.0001) # sleep for 0.1 ms
logger.debug(f"Prefill node signaled kv ready for req {req_id}")
finished_reqs.remove(req_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The wait_decode_kv_ready function uses a busy-wait loop with a very short sleep time (0.1ms). This can lead to high CPU consumption. A more efficient approach is to use an asyncio.Event for each request. The waiting coroutine can await the event, and the ZMQ server can set the event upon message arrival, avoiding polling entirely.

Example of a more efficient implementation:

# In a shared location, e.g., at the top of the file
request_events: Dict[str, asyncio.Event] = {}

# In zmq_pull_server, when a message is received
req_id = msg.req_id
if req_id in request_events:
    request_events[req_id].set()

# In route_disaggregated_prefill_request, before waiting
request_events[request_id] = asyncio.Event()

# New wait_decode_kv_ready implementation
async def wait_decode_kv_ready(req_id: str):
    if req_id in request_events:
        await request_events[req_id].wait()
        logger.debug(f"Prefill node signaled kv ready for req {req_id}")
        del request_events[req_id] # Clean up

Comment on lines +85 to +86
repository: "xiaokunchen/vllm-router"
tag: "08-27-v8"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The routerSpec.repository is set to a personal Docker Hub account (xiaokunchen/vllm-router). For official examples and tutorials, this should point to an official project repository to ensure users are pulling a trusted and maintained image. Please update this to the official image repository before merging.

RUN . /opt/venv/bin/activate && \
uv pip install --upgrade --no-cache-dir pip setuptools_scm && \
uv pip install --no-cache-dir . && \
uv pip install zmq msgspec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better dependency management and clarity, zmq and msgspec should be declared as optional dependencies in pyproject.toml instead of being installed via a separate pip install command. You could add a new group, for example [project.optional-dependencies.pd], and then install it with uv pip install --no-cache-dir .[pd]. This keeps all project dependencies declared in one place.

Comment on lines +158 to +167
- name: pd-port-1
containerPort: 7100
- name: pd-port-2
containerPort: 7200
- name: pd-port-3
containerPort: 7300
- name: pd-port-4
containerPort: 7400
- name: pd-port-5
containerPort: 7500

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pd-port-* container ports are hardcoded. This could be made more flexible and maintainable by using a range loop in the Helm template. This would allow configuring the number of ports through values.yaml instead of modifying the template for future changes.

For example, you could introduce routerSpec.pdPortCount and routerSpec.pdStartPort in values.yaml and loop like this:

{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
          - name: pd-port-{{ add1 $i }}
            containerPort: {{ add $pdStartPort $i }}
{{- end }}

Comment on lines +384 to +393
- name: pd-port-1
containerPort: 7100
- name: pd-port-2
containerPort: 7200
- name: pd-port-3
containerPort: 7300
- name: pd-port-4
containerPort: 7400
- name: pd-port-5
containerPort: 7500

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to deployment-router.yaml, the pd-port-* container ports are hardcoded. Using a range loop with values from values.yaml would make this more flexible and easier to maintain.

For example, you could introduce servingEngineSpec.pdPortCount and servingEngineSpec.pdStartPort in values.yaml and loop like this:

{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
            - name: pd-port-{{ add1 $i }}
              containerPort: {{ add $pdStartPort $i }}
{{- end }}

Comment on lines +23 to +42
- name: pd-port-1
port: 7100
targetPort: pd-port-1
protocol: TCP
- name: pd-port-2
port: 7200
targetPort: pd-port-2
protocol: TCP
- name: pd-port-3
port: 7300
targetPort: pd-port-3
protocol: TCP
- name: pd-port-4
port: 7400
targetPort: pd-port-4
protocol: TCP
- name: pd-port-5
port: 7500
targetPort: pd-port-5
protocol: TCP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pd-port-* definitions are hardcoded. This can be made more maintainable by using a range loop. This would allow configuring the number of ports through values.yaml instead of modifying the template.

For example:

{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
    - name: pd-port-{{ add1 $i }}
      port: {{ add $pdStartPort $i }}
      targetPort: pd-port-{{ add1 $i }}
      protocol: TCP
{{- end }}

Comment on lines +26 to +45
- name: pd-port-1
port: 7100
targetPort: pd-port-1
protocol: TCP
- name: pd-port-2
port: 7200
targetPort: pd-port-2
protocol: TCP
- name: pd-port-3
port: 7300
targetPort: pd-port-3
protocol: TCP
- name: pd-port-4
port: 7400
targetPort: pd-port-4
protocol: TCP
- name: pd-port-5
port: 7500
targetPort: pd-port-5
protocol: TCP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pd-port-* definitions are hardcoded. This can be made more maintainable by using a range loop. This would allow configuring the number of ports through values.yaml instead of modifying the template.

For example:

{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
    - name: pd-port-{{ add1 $i }}
      port: {{ add $pdStartPort $i }}
      targetPort: pd-port-{{ add1 $i }}
      protocol: TCP
{{- end }}

Comment on lines +450 to +474
# # Check if client sessions are initialized, if not, try to initialize them
# if not hasattr(request.app.state, 'prefill_client') or request.app.state.prefill_client is None:
# logger.warning("prefill_client not initialized, attempting to initialize client sessions")
# try:
# from vllm_router.service_discovery import get_service_discovery
# service_discovery = get_service_discovery()
# if hasattr(service_discovery, '_reinitialize_client_sessions'):
# logger.info("In route_disaggregated_prefill_request: Calling _reinitialize_client_sessions")
# await service_discovery._reinitialize_client_sessions()
# logger.info("Successfully initialized client sessions")
# else:
# logger.error("Service discovery does not have _reinitialize_client_sessions method")
# except Exception as e:
# logger.error(f"Failed to initialize client sessions: {e}")
# return JSONResponse(
# status_code=500,
# content={
# "error": {
# "message": "Failed to initialize client sessions",
# "type": "initialization_error",
# "code": 500,
# }
# },
# headers={"X-Request-Id": request_id},
# )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out code should be removed to improve code clarity. If it's intended for debugging, it should not be part of the final pull request.

@Vivo50E
Copy link

Vivo50E commented Aug 28, 2025

Hi Kobe, should I consider the load balancing for PD disagg #640 as already handled by introducing the NIXL proxy layer, as done in this PR?

@kobe0938
Copy link
Contributor Author

kobe0938 commented Sep 4, 2025

Hold a bit for /chat/completion fix being merged in lmcache.

LMCache/LMCache#1511

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants