-
Notifications
You must be signed in to change notification settings - Fork 176
WIP: Support for vLLM Data parallel #1663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shmuelk The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Pods are comprised of one or more containers. Will each vLLM engine instance run as a separate container in the vLLM pod? |
if p.ModelServerMetricsPort == 0 { | ||
p.ModelServerMetricsPort = targetPortNumber | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that GetMetricsPort()
does not implement this default behavior, which makes sense now that targetPortNumber is a list. We need to note this as a breaking change in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no breaking change. See the code in pkg/epp/datastore/datastore.go lines 242-244. If there is only one targetPort in the InferencePool and the ModelServerMetricsPort from the command line is not zero it will be used to fill the metricsPort in the PodInfo struct. The function GetMetricsPort() simply returns what was placed in the struct earlier.
@shmuelk please create a tracker issue for adding a conformance test that includes multiple InferencePool targetPorts. |
They are run as separate processes in the same container as far as I know. What launches Data parallel in a real vLLM is a parameter to vLLM. I don't think that can add containers to the pod. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds support for the vLLM Data Parallel feature. The vLLM Data Parallel feature causes the vLLM "launcher" to launch many vLLM instances in the same Pod, each listening on a different port.
The InferencePool CRD has already been changed to support this by allowing up to eight TargetPorts to be specified. It is assumed that all pods in the InferencePool have been configured the same way WRT Data Parallelism.
In an attempt to minimize the amount of changes to the code, the datastore has been modified to create "virtual pods" from the real pods that are found by the pod reconciler. These virtual pods are all given names that are the real pod's name concatenated with the string "-rank-N", where N is a number from zero to seven. The term rank was used as that's what each of the separate vLLM "servers" in a Data Parallel configuration are called.
The former code has a notion of a globally known ports for inference and metrics scraping. This has been eliminated and instead inference port and metrics port fields have been added to the PodInfo struct. In addition a field was added, PodName, that contains the name of the real pod used to create the "virtual pods".
Lastly the API of the PreRequest extension point has been changed, removing the inference port parameter. Any PreRequest extensions must get the inference port of the pod(s) in question from the PodInfo's GetPort() API.
Which issue(s) this PR fixes:
Fixes #1519
Does this PR introduce a user-facing change?: