-
Notifications
You must be signed in to change notification settings - Fork 277
[Feat][PD] lastest PD support from LMCache with NIXL #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Modified router service templates and configs - Updated vLLM router application logic - Enhanced request parsing and service discovery - Added custom Dockerfile for production deployment - Updated disaggregated prefill configuration Signed-off-by: Kobe Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kobe0938, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for the latest Prefill Disaggregation (PD) features using LMCache with NIXL, addressing the XpYd support. The changes span across the router's Docker image, Helm chart configurations, and core request processing logic. A new Dockerfile is added for building the router with necessary dependencies. Helm charts are updated to expose new ports and allow detailed configuration of NIXL peer and proxy settings. The router's request handling is significantly refactored to orchestrate a multi-stage prefill and decode process, leveraging ZMQ for efficient NIXL communication. These updates enable a more robust and configurable disaggregated prefill architecture.
Highlights
- New Dockerfile for Prefill Disaggregation (PD): A new Dockerfile (
docker/Dockerfile.pd
) has been introduced to build avllm-router
image specifically configured withlmcache
andsemantic_cache
dependencies, streamlining the deployment process for PD-enabled environments. - Enhanced NIXL Parameter Integration in Helm Charts: The Helm charts (
deployment-router.yaml
,deployment-vllm-multi.yaml
) have been updated to support extensive configuration of NIXL peer and proxy hosts/ports, allowing for flexible deployment of disaggregated prefill components within Kubernetes. - New Port Exposure for PD Communication: Several new ports (7100-7500) are now exposed in both the router and vLLM service deployments, enabling the necessary communication channels for the NIXL protocol to function correctly.
- ZMQ Integration for NIXL in Router: The
vllm_router
now integrates a ZMQ pull server, managed via FastAPI's lifespan events, to efficiently receive NIXL messages and track the completion of prefill requests, which is crucial for the disaggregated prefill workflow. - Enhanced Disaggregated Prefill Request Flow: The
route_disaggregated_prefill_request
function has been significantly refactored to implement a multi-step process for handling disaggregated prefill requests. This includes initial tokenization, sending a prefill request with NIXL-specific KV transfer parameters, and then preparing and streaming the decode response after waiting for KV readiness. - Updated Prefill Configuration Example: The
values-16-disagg-prefill.yaml
tutorial asset has been updated to reflect the latest NIXL and XpYd configurations, including new image tags and specific port assignments, providing a current and accurate example for deploying disaggregated prefill setups.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for disaggregated prefill with LMCache and NIXL, a significant feature involving changes to Docker configurations, Helm charts, and core request routing logic. The implementation adds a multi-step request process (tokenize, prefill, decode) coordinated via ZMQ. My review focuses on improving the robustness, performance, and maintainability of these changes. Key suggestions include fixing a potential crash in the new ZMQ server, optimizing request handling to avoid redundant operations, enhancing the efficiency of the prefill/decode synchronization, and ensuring configurations are clean and maintainable.
try: | ||
socket = zmq_ctx.socket(zmq.PULL) | ||
try: | ||
from vllm_router.app import app | ||
proxy_host = app.state.args.nixl_proxy_host | ||
proxy_port = app.state.args.nixl_proxy_port | ||
except Exception as e: | ||
logger.error(f"Failed to get proxy host and port from app state: {e}") | ||
proxy_url = f"{proxy_host}:{proxy_port}" | ||
socket.bind(f"tcp://{proxy_url}") | ||
logger.info(f"ZMQ proxy server started on {proxy_url}") | ||
except Exception as e: | ||
logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}") | ||
socket.close() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential NameError
in zmq_pull_server
. If retrieving proxy_host
or proxy_port
from app.state.args
fails, proxy_url
will not be defined, but it's used in the except
block on line 78 for logging, which will cause a crash. The logic should be restructured to ensure proxy_url
is defined only after its components are successfully retrieved and to handle initialization failures gracefully.
try: | |
socket = zmq_ctx.socket(zmq.PULL) | |
try: | |
from vllm_router.app import app | |
proxy_host = app.state.args.nixl_proxy_host | |
proxy_port = app.state.args.nixl_proxy_port | |
except Exception as e: | |
logger.error(f"Failed to get proxy host and port from app state: {e}") | |
proxy_url = f"{proxy_host}:{proxy_port}" | |
socket.bind(f"tcp://{proxy_url}") | |
logger.info(f"ZMQ proxy server started on {proxy_url}") | |
except Exception as e: | |
logger.error(f"Failed to bind ZMQ socket to {proxy_url}: {e}") | |
socket.close() | |
return | |
socket = zmq_ctx.socket(zmq.PULL) | |
try: | |
from vllm_router.app import app | |
proxy_host = app.state.args.nixl_proxy_host | |
proxy_port = app.state.args.nixl_proxy_port | |
proxy_url = f"tcp://{proxy_host}:{proxy_port}" | |
socket.bind(proxy_url) | |
logger.info(f"ZMQ proxy server started on {proxy_url}") | |
except Exception as e: | |
logger.error(f"Failed to start ZMQ pull server: {e}") | |
socket.close() | |
return |
request_body = await request.body() | ||
request_json = json.loads(request_body) | ||
request_json = await request.json() # TODO (ApostaC): merge two awaits into one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request body is being awaited twice: first with request.body()
and then internally by request.json()
. This is inefficient as it reads the entire request stream twice. You can remove the await request.body()
call since the request_body
variable is not used elsewhere.
request_body = await request.body() | |
request_json = json.loads(request_body) | |
request_json = await request.json() # TODO (ApostaC): merge two awaits into one | |
request_json = await request.json() # TODO (ApostaC): merge two awaits into one |
async def wait_decode_kv_ready(req_id: str): | ||
while req_id not in finished_reqs: | ||
await asyncio.sleep(0.0001) # sleep for 0.1 ms | ||
logger.debug(f"Prefill node signaled kv ready for req {req_id}") | ||
finished_reqs.remove(req_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wait_decode_kv_ready
function uses a busy-wait loop with a very short sleep time (0.1ms
). This can lead to high CPU consumption. A more efficient approach is to use an asyncio.Event
for each request. The waiting coroutine can await
the event, and the ZMQ server can set
the event upon message arrival, avoiding polling entirely.
Example of a more efficient implementation:
# In a shared location, e.g., at the top of the file
request_events: Dict[str, asyncio.Event] = {}
# In zmq_pull_server, when a message is received
req_id = msg.req_id
if req_id in request_events:
request_events[req_id].set()
# In route_disaggregated_prefill_request, before waiting
request_events[request_id] = asyncio.Event()
# New wait_decode_kv_ready implementation
async def wait_decode_kv_ready(req_id: str):
if req_id in request_events:
await request_events[req_id].wait()
logger.debug(f"Prefill node signaled kv ready for req {req_id}")
del request_events[req_id] # Clean up
repository: "xiaokunchen/vllm-router" | ||
tag: "08-27-v8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The routerSpec.repository
is set to a personal Docker Hub account (xiaokunchen/vllm-router
). For official examples and tutorials, this should point to an official project repository to ensure users are pulling a trusted and maintained image. Please update this to the official image repository before merging.
RUN . /opt/venv/bin/activate && \ | ||
uv pip install --upgrade --no-cache-dir pip setuptools_scm && \ | ||
uv pip install --no-cache-dir . && \ | ||
uv pip install zmq msgspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better dependency management and clarity, zmq
and msgspec
should be declared as optional dependencies in pyproject.toml
instead of being installed via a separate pip install
command. You could add a new group, for example [project.optional-dependencies.pd]
, and then install it with uv pip install --no-cache-dir .[pd]
. This keeps all project dependencies declared in one place.
- name: pd-port-1 | ||
containerPort: 7100 | ||
- name: pd-port-2 | ||
containerPort: 7200 | ||
- name: pd-port-3 | ||
containerPort: 7300 | ||
- name: pd-port-4 | ||
containerPort: 7400 | ||
- name: pd-port-5 | ||
containerPort: 7500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pd-port-*
container ports are hardcoded. This could be made more flexible and maintainable by using a range
loop in the Helm template. This would allow configuring the number of ports through values.yaml
instead of modifying the template for future changes.
For example, you could introduce routerSpec.pdPortCount
and routerSpec.pdStartPort
in values.yaml
and loop like this:
{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
- name: pd-port-{{ add1 $i }}
containerPort: {{ add $pdStartPort $i }}
{{- end }}
- name: pd-port-1 | ||
containerPort: 7100 | ||
- name: pd-port-2 | ||
containerPort: 7200 | ||
- name: pd-port-3 | ||
containerPort: 7300 | ||
- name: pd-port-4 | ||
containerPort: 7400 | ||
- name: pd-port-5 | ||
containerPort: 7500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to deployment-router.yaml
, the pd-port-*
container ports are hardcoded. Using a range
loop with values from values.yaml
would make this more flexible and easier to maintain.
For example, you could introduce servingEngineSpec.pdPortCount
and servingEngineSpec.pdStartPort
in values.yaml
and loop like this:
{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
- name: pd-port-{{ add1 $i }}
containerPort: {{ add $pdStartPort $i }}
{{- end }}
- name: pd-port-1 | ||
port: 7100 | ||
targetPort: pd-port-1 | ||
protocol: TCP | ||
- name: pd-port-2 | ||
port: 7200 | ||
targetPort: pd-port-2 | ||
protocol: TCP | ||
- name: pd-port-3 | ||
port: 7300 | ||
targetPort: pd-port-3 | ||
protocol: TCP | ||
- name: pd-port-4 | ||
port: 7400 | ||
targetPort: pd-port-4 | ||
protocol: TCP | ||
- name: pd-port-5 | ||
port: 7500 | ||
targetPort: pd-port-5 | ||
protocol: TCP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pd-port-*
definitions are hardcoded. This can be made more maintainable by using a range
loop. This would allow configuring the number of ports through values.yaml
instead of modifying the template.
For example:
{{- $pdStartPort := .Values.routerSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.routerSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
- name: pd-port-{{ add1 $i }}
port: {{ add $pdStartPort $i }}
targetPort: pd-port-{{ add1 $i }}
protocol: TCP
{{- end }}
- name: pd-port-1 | ||
port: 7100 | ||
targetPort: pd-port-1 | ||
protocol: TCP | ||
- name: pd-port-2 | ||
port: 7200 | ||
targetPort: pd-port-2 | ||
protocol: TCP | ||
- name: pd-port-3 | ||
port: 7300 | ||
targetPort: pd-port-3 | ||
protocol: TCP | ||
- name: pd-port-4 | ||
port: 7400 | ||
targetPort: pd-port-4 | ||
protocol: TCP | ||
- name: pd-port-5 | ||
port: 7500 | ||
targetPort: pd-port-5 | ||
protocol: TCP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pd-port-*
definitions are hardcoded. This can be made more maintainable by using a range
loop. This would allow configuring the number of ports through values.yaml
instead of modifying the template.
For example:
{{- $pdStartPort := .Values.servingEngineSpec.pdStartPort | default 7100 -}}
{{- $pdPortCount := .Values.servingEngineSpec.pdPortCount | default 5 -}}
{{- range $i := until (int $pdPortCount) }}
- name: pd-port-{{ add1 $i }}
port: {{ add $pdStartPort $i }}
targetPort: pd-port-{{ add1 $i }}
protocol: TCP
{{- end }}
# # Check if client sessions are initialized, if not, try to initialize them | ||
# if not hasattr(request.app.state, 'prefill_client') or request.app.state.prefill_client is None: | ||
# logger.warning("prefill_client not initialized, attempting to initialize client sessions") | ||
# try: | ||
# from vllm_router.service_discovery import get_service_discovery | ||
# service_discovery = get_service_discovery() | ||
# if hasattr(service_discovery, '_reinitialize_client_sessions'): | ||
# logger.info("In route_disaggregated_prefill_request: Calling _reinitialize_client_sessions") | ||
# await service_discovery._reinitialize_client_sessions() | ||
# logger.info("Successfully initialized client sessions") | ||
# else: | ||
# logger.error("Service discovery does not have _reinitialize_client_sessions method") | ||
# except Exception as e: | ||
# logger.error(f"Failed to initialize client sessions: {e}") | ||
# return JSONResponse( | ||
# status_code=500, | ||
# content={ | ||
# "error": { | ||
# "message": "Failed to initialize client sessions", | ||
# "type": "initialization_error", | ||
# "code": 500, | ||
# } | ||
# }, | ||
# headers={"X-Request-Id": request_id}, | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Hi Kobe, should I consider the load balancing for PD disagg #640 as already handled by introducing the NIXL proxy layer, as done in this PR? |
Hold a bit for /chat/completion fix being merged in lmcache. |
FILL IN THE PR DESCRIPTION HERE
PD support from LMCache with NIXL https://docs.lmcache.ai/disaggregated_prefill/nixl/1p1d.html
FIX (P0) XpYd support in #640
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-s
when doinggit commit
[Bugfix]
,[Feat]
, and[CI]
.Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Feat]
for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]
for changes to thevllm_router
(e.g., routing algorithm, router observability, etc.).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-commit
to format your code. SeeREADME.md
for installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:
header which certifies agreement with the terms of the DCO.Using
-s
withgit commit
will automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.