-
Notifications
You must be signed in to change notification settings - Fork 79
Add docker model k8s command with vLLM deployment guides #245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR integrates a new Kubernetes deployment experience into the Docker Model Runner CLI and project docs by adding a Class diagram for new k8s command integration in Docker Model CLIclassDiagram
class RootCmd {
+NewRootCmd(cli)
+AddCommand(...)
}
class K8sCmd {
+newK8sCmd()
}
class K8sDeployCmd {
+newK8sDeployCmd()
-namespace: string
-config: string
-model: string
-replicas: int
+RunE(cmd, args)
}
class K8sListConfigsCmd {
+newK8sListConfigsCmd()
+RunE(cmd, args)
}
class K8sGuideCmd {
+newK8sGuideCmd()
+RunE(cmd, args)
}
RootCmd --> K8sCmd
K8sCmd --> K8sDeployCmd
K8sCmd --> K8sListConfigsCmd
K8sCmd --> K8sGuideCmd
Class diagram for Helm chart values structure for PD disaggregation and inference schedulingclassDiagram
class ModelServiceValues {
+modelArtifacts: uri, name, size, authSecretName
+routing: servicePort, proxy, inferencePool, httpRoute, epp
+decode: create, replicas, containers[], monitoring, volumes
+prefill: create, replicas, containers[], monitoring, volumes
+accelerator: type
+multinode: bool
}
class ContainerSpec {
+name: string
+image: string
+modelCommand: string
+args: []string
+env: []EnvVar
+ports: []Port
+resources: ResourceSpec
+mountModelVolume: bool
+volumeMounts: []VolumeMount
}
ModelServiceValues "1" o-- "*" ContainerSpec
ModelServiceValues "1" o-- "*" VolumeMount
ModelServiceValues "1" o-- "*" Volume
ContainerSpec o-- ResourceSpec
ContainerSpec o-- EnvVar
ContainerSpec o-- Port
Class diagram for Gateway provider Helmfile structureclassDiagram
class Helmfile {
+releases: []Release
}
class Release {
+name: string
+chart: string
+namespace: string
+version: string
+installed: bool
+needs: []string
+values: []ValueOverride
+labels: map[string]string
}
Helmfile "1" o-- "*" Release
Release "1" o-- "*" ValueOverride
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @ericcurtin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Kubernetes deployment experience for vLLM inference servers by introducing a dedicated CLI command and a suite of detailed guides. It aims to provide production-ready configurations and tools for advanced LLM serving features, making it easier to deploy and manage vLLM across various cloud and hardware environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `k8s/guides/prereq/client-setup/install-deps.sh:164-170` </location>
<code_context>
+########################################
+# Helm diff plugin
+########################################
+if ! helm plugin list | grep -q diff; then
+ echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
+ helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
+fi
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Helm plugin installation does not check for plugin version.
Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.
```suggestion
########################################
# Helm diff plugin
########################################
HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v')
REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v')
if ! helm plugin list | grep -q diff; then
echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then
echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..."
helm plugin uninstall diff
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi
```
</issue_to_address>
### Comment 2
<location> `k8s/guides/prereq/client-setup/install-deps.sh:144-149` </location>
<code_context>
+ echo "Installing kubectl..."
+ K8S_URL="https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)"
+ curl -sLO "${K8S_URL}/bin/${OS}/${ARCH}/kubectl"
+ if [[ "$OS" == "darwin" ]]; then
+ sudo install -m 0755 kubectl /usr/local/bin/kubectl
+ else
+ sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
</code_context>
<issue_to_address>
**suggestion:** Use of 'sudo' for installing binaries may fail in environments without sudo.
Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.
```suggestion
if command -v sudo &> /dev/null; then
if [[ "$OS" == "darwin" ]]; then
sudo install -m 0755 kubectl /usr/local/bin/kubectl
echo "kubectl installed to /usr/local/bin/kubectl"
else
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
echo "kubectl installed to /usr/local/bin/kubectl"
fi
else
# Fallback: install to ~/.local/bin (create if needed)
mkdir -p "$HOME/.local/bin"
install -m 0755 kubectl "$HOME/.local/bin/kubectl"
echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl"
echo "Make sure $HOME/.local/bin is in your PATH."
fi
rm kubectl
```
</issue_to_address>
### Comment 3
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/values_tpu.yaml:46` </location>
<code_context>
+ interval: "30s"
+ containers:
+ - name: "vllm"
+ image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef"
+ modelCommand: vllmServe
+ args:
</code_context>
<issue_to_address>
**suggestion:** Hardcoded image tag may hinder future upgrades.
Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.
Suggested implementation:
```
image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"
```
You should also add the following to your Helm values file (e.g., `values.yaml`) if it does not already exist:
```yaml
vllm:
image:
tag: "e92694b6fe264a85371317295bca6643508034ef" # Default tag, can be overridden
```
This allows you to easily update the image tag in one place for future upgrades.
</issue_to_address>
### Comment 4
<location> `k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh:43` </location>
<code_context>
+GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}"
+### Base CRDs
+log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..."
+kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true
+
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Suppressing errors with '|| true' may mask installation failures.
Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.
```suggestion
if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then
log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again."
exit 1
fi
```
</issue_to_address>
### Comment 5
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/digitalocean-values.yaml:82-85` </location>
<code_context>
+ - name: torch-compile-cache
+ emptyDir: {}
+ # IMPORTANT: DigitalOcean GPU node tolerations
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+
+# Prefill disabled in inference-scheduling scenario
</code_context>
<issue_to_address>
**suggestion:** Tolerations are set for GPU scheduling but nodeSelector is not present.
Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.
Suggested implementation:
```
# IMPORTANT: DigitalOcean GPU node tolerations
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# Ensure pods are scheduled only on GPU nodes
nodeSelector:
accelerator: nvidia
```
- If your GPU nodes use a different label, replace `accelerator: nvidia` with the correct label key and value for your cluster (e.g., `kubernetes.io/instance-type: gpu` or another custom label).
- Make sure the indentation matches the rest of your YAML file.
</issue_to_address>
### Comment 6
<location> `k8s/guides/pd-disaggregation/README.md:146` </location>
<code_context>
+ hashBlockSize: 5
+```
+
+Some examples in which you might want to do selective PD might include:
+- When the prompt is short enough that the amount of work split inference into prefill and decode phases, and then open a kv transfer between those two GPUs is greater than the amount of work to do both phases on the same decode inference worker.
+- When Prefill units are at full capacity.
+
</code_context>
<issue_to_address>
**issue (typo):** Typo: 'dissagregation' should be 'disaggregation' in this section.
Please update all instances of 'dissagregation' to 'disaggregation'.
```suggestion
Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:
```
</issue_to_address>
### Comment 7
<location> `k8s/guides/predicted-latency-based-scheduling/README.md:316` </location>
<code_context>
+**Current limitations**
+- Percentile: only **p90** supported.
+- Training: only **streaming mode** supported.
+- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
+
+---
</code_context>
<issue_to_address>
**issue (typo):** Typo: 'obsevability' should be 'observability'.
Update the spelling to 'observability'.
```suggestion
- TPOT sampling: for observability, every 200th token is logged and compared with predictions.
```
</issue_to_address>
### Comment 8
<location> `k8s/guides/prereq/infrastructure/README.md:72` </location>
<code_context>
+
+## Installing on a well-lit infrastructure provider
+
+The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.
+
+* [DigitalOcean Kubernetes (DOKS)](../../../docs/infra-providers/digitalocean/README.md)
</code_context>
<issue_to_address>
**issue (typo):** Grammar: 'model servers is expected' should be 'model servers are expected'.
Update the sentence to use 'are' instead of 'is' for correct grammar.
```suggestion
The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| ######################################## | ||
| # Helm diff plugin | ||
| ######################################## | ||
| if ! helm plugin list | grep -q diff; then | ||
| echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..." | ||
| helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Helm plugin installation does not check for plugin version.
Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.
| ######################################## | |
| # Helm diff plugin | |
| ######################################## | |
| if ! helm plugin list | grep -q diff; then | |
| echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..." | |
| helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff | |
| fi | |
| ######################################## | |
| # Helm diff plugin | |
| ######################################## | |
| HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v') | |
| REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v') | |
| if ! helm plugin list | grep -q diff; then | |
| echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..." | |
| helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff | |
| elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then | |
| echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..." | |
| helm plugin uninstall diff | |
| helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff | |
| fi |
| if [[ "$OS" == "darwin" ]]; then | ||
| sudo install -m 0755 kubectl /usr/local/bin/kubectl | ||
| else | ||
| sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl | ||
| fi | ||
| rm kubectl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Use of 'sudo' for installing binaries may fail in environments without sudo.
Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.
| if [[ "$OS" == "darwin" ]]; then | |
| sudo install -m 0755 kubectl /usr/local/bin/kubectl | |
| else | |
| sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl | |
| fi | |
| rm kubectl | |
| if command -v sudo &> /dev/null; then | |
| if [[ "$OS" == "darwin" ]]; then | |
| sudo install -m 0755 kubectl /usr/local/bin/kubectl | |
| echo "kubectl installed to /usr/local/bin/kubectl" | |
| else | |
| sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl | |
| echo "kubectl installed to /usr/local/bin/kubectl" | |
| fi | |
| else | |
| # Fallback: install to ~/.local/bin (create if needed) | |
| mkdir -p "$HOME/.local/bin" | |
| install -m 0755 kubectl "$HOME/.local/bin/kubectl" | |
| echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl" | |
| echo "Make sure $HOME/.local/bin is in your PATH." | |
| fi | |
| rm kubectl |
| interval: "30s" | ||
| containers: | ||
| - name: "vllm" | ||
| image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Hardcoded image tag may hinder future upgrades.
Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.
Suggested implementation:
image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"
You should also add the following to your Helm values file (e.g., values.yaml) if it does not already exist:
vllm:
image:
tag: "e92694b6fe264a85371317295bca6643508034ef" # Default tag, can be overriddenThis allows you to easily update the image tag in one place for future upgrades.
| GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}" | ||
| ### Base CRDs | ||
| log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..." | ||
| kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Suppressing errors with '|| true' may mask installation failures.
Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.
| kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true | |
| if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then | |
| log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again." | |
| exit 1 | |
| fi |
| tolerations: | ||
| - key: "nvidia.com/gpu" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Tolerations are set for GPU scheduling but nodeSelector is not present.
Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.
Suggested implementation:
# IMPORTANT: DigitalOcean GPU node tolerations
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# Ensure pods are scheduled only on GPU nodes
nodeSelector:
accelerator: nvidia
- If your GPU nodes use a different label, replace
accelerator: nvidiawith the correct label key and value for your cluster (e.g.,kubernetes.io/instance-type: gpuor another custom label). - Make sure the indentation matches the rest of your YAML file.
|
|
||
| ## Tuning Selective PD | ||
|
|
||
| Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Typo: 'dissagregation' should be 'disaggregation' in this section.
Please update all instances of 'dissagregation' to 'disaggregation'.
| Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here: | |
| Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here: |
| **Current limitations** | ||
| - Percentile: only **p90** supported. | ||
| - Training: only **streaming mode** supported. | ||
| - TPOT sampling: for obsevability, every 200th token is logged and compared with predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Typo: 'obsevability' should be 'observability'.
Update the spelling to 'observability'.
| - TPOT sampling: for obsevability, every 200th token is logged and compared with predictions. | |
| - TPOT sampling: for observability, every 200th token is logged and compared with predictions. |
|
|
||
| ## Installing on a well-lit infrastructure provider | ||
|
|
||
| The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Grammar: 'model servers is expected' should be 'model servers are expected'.
Update the sentence to use 'are' instead of 'is' for correct grammar.
| The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators. | |
| The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature: comprehensive Kubernetes deployment support for vLLM, accessible via a new docker model k8s CLI command. The scope is impressive, covering multiple deployment scenarios (inference scheduling, P/D disaggregation), various hardware backends (CUDA, XPU, TPU), and cloud providers. The inclusion of detailed guides, Helm charts, and Dockerfiles makes this a robust solution for production deployments. My review identifies a few key areas for improvement, including fixing a critical bug in a Dockerfile, correcting a typo in a file path, improving the user experience of the new CLI command, and enhancing the maintainability of the deployment configurations.
| COPY --from=builder /usr/local/lib/libgdrapi.so.2.* /usr/local/lib/ | ||
| COPY --from=builder /usr/local/lib/libgdrapi.so* /usr/local/lib/ | ||
|
|
||
| RUN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| c := &cobra.Command{ | ||
| Use: "deploy", | ||
| Short: "Deploy vLLM on Kubernetes", | ||
| Long: "Deploy vLLM inference server on Kubernetes with the specified configuration", | ||
| RunE: func(cmd *cobra.Command, args []string) error { | ||
| if config == "" { | ||
| return fmt.Errorf("--config is required. Use 'docker model k8s list-configs' to see available configurations") | ||
| } | ||
|
|
||
| // Get the path to the k8s resources | ||
| resourcesPath, err := getK8sResourcesPath() | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| configPath := filepath.Join(resourcesPath, "configs", config) | ||
| if _, err := os.Stat(configPath); os.IsNotExist(err) { | ||
| return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config) | ||
| } | ||
|
|
||
| cmd.Printf("Deploying vLLM with configuration: %s\n", config) | ||
| cmd.Printf("Namespace: %s\n", namespace) | ||
| if model != "" { | ||
| cmd.Printf("Model: %s\n", model) | ||
| } | ||
| cmd.Printf("Replicas: %d\n", replicas) | ||
|
|
||
| // Check if kubectl is available | ||
| if _, err := exec.LookPath("kubectl"); err != nil { | ||
| return fmt.Errorf("kubectl not found in PATH. Please install kubectl to deploy to Kubernetes") | ||
| } | ||
|
|
||
| // Check if helm is available for more complex deployments | ||
| if _, err := exec.LookPath("helm"); err != nil { | ||
| cmd.PrintErrln("Warning: helm not found in PATH. Some deployment options may not be available.") | ||
| } | ||
|
|
||
| cmd.Println("\nDeployment instructions:") | ||
| cmd.Printf("1. Ensure your kubectl context is set to the correct cluster\n") | ||
| cmd.Printf("2. Create namespace if it doesn't exist: kubectl create namespace %s\n", namespace) | ||
| cmd.Printf("3. Apply the configuration: kubectl apply -f %s -n %s\n", configPath, namespace) | ||
| cmd.Printf("\nFor detailed deployment guides, run: docker model k8s guide\n") | ||
|
|
||
| return nil | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deploy subcommand's name is misleading as it doesn't perform a deployment. It only prints the kubectl commands for the user to run manually. This can be confusing for users who would expect a deploy command to execute the deployment. To improve clarity and user experience, consider renaming the command to something like show-deploy-command or guide-deploy. Alternatively, you could implement the logic to execute the kubectl commands directly, which would be more powerful.
cmd/cli/commands/k8s.go
Outdated
| } | ||
|
|
||
| // If not found, we'll create a minimal structure | ||
| homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the directory path being created for fallback resources. The code uses .docker/modelr/k8s, which is inconsistent with the path checked on line 157 (.docker/model/k8s). This should be corrected to model to ensure the command behaves as expected.
| homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s") | |
| homePath := filepath.Join(os.Getenv("HOME"), ".docker", "model", "k8s") |
| else \ | ||
| VLLM_COMMIT="$(git merge-base HEAD origin/main)"; \ | ||
| VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_${VLLM_WHEEL_ARCH}.whl"; \ | ||
| VLLM_USE_PRECOMPILED=1 uv pip install --editable .; \ | ||
| fi; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for installing vllm in this else block appears to be incorrect. The VLLM_PRECOMPILED_WHEEL_LOCATION variable is defined but then not used, as the next command installs vllm from source via --editable .. This misses the performance benefit of using precompiled components and makes the build process slower than intended. The logic should be updated to correctly use the precompiled wheel if it's available, similar to the implementation in k8s/docker/Dockerfile.cuda.
| {{- else if eq .Environment.Name "xpu" }} | ||
| - ms-inference-scheduling/values_xpu.yaml | ||
| {{- else if eq .Environment.Name "digitalocean" }} | ||
| - ms-inference-scheduling/digitalocean-values.yaml | ||
| {{- else if eq .Environment.Name "xpu" }} | ||
| - ms-inference-scheduling/values_xpu.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ⚠️ **Important - For Intel BMG GPU Users**: Before running `helmfile apply`, you must update the GPU resource type in `ms-pd/values_xpu.yaml`: | ||
|
|
||
| ```yaml | ||
| # Edit ms-pd/values_xpu.yaml | ||
| accelerator: | ||
| type: intel | ||
| resources: | ||
| intel: "gpu.intel.com/xe" # Add gpu.intel.com/xe | ||
|
|
||
| # Also update decode and prefill resource specifications: | ||
| decode: | ||
| containers: | ||
| - name: "vllm" | ||
| resources: | ||
| limits: | ||
| gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe | ||
| requests: | ||
| gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe | ||
|
|
||
| prefill: | ||
| containers: | ||
| - name: "vllm" | ||
| resources: | ||
| limits: | ||
| gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe | ||
| requests: | ||
| gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide instructs users to manually edit the ms-pd/values_xpu.yaml file to switch between different Intel GPU types (i915 vs. xe). This manual step is error-prone and hinders automation. It would be more robust to handle this through a separate helmfile environment (e.g., xpu-bmg) or a value passed to the helmfile apply command. This would make the deployment process more declarative and user-friendly.
Powered by llm-d. Signed-off-by: Eric Curtin <[email protected]>
a96e9f2 to
d0f1092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Add Kubernetes deployment support and guides for vLLM via a new CLI command and curated Helm-based recipes.
- Introduces docker model k8s CLI command (deploy/list-configs/guide)
- Adds production-ready Kubernetes guides (inference scheduling, P/D disaggregation, precise prefix cache awareness, simulator), plus Dockerfiles for CUDA/AWS/GKE/XPU images
- Includes gateway provider prerequisite helmfiles and client setup scripts
Reviewed Changes
Copilot reviewed 52 out of 52 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/cli/commands/k8s.go | Adds k8s command group with deploy/list-configs/guide subcommands |
| cmd/cli/commands/root.go | Wires newK8sCmd into the root command |
| k8s/guides/* | Adds detailed Helmfile-based guides for multiple deployment patterns |
| k8s/guides/prereq/* | Adds gateway provider and client setup prerequisites |
| k8s/docker/Dockerfile.cuda | CUDA-based build/runtime image for vLLM |
| k8s/docker/Dockerfile.aws | AWS-optimized build/runtime image with UCX/NVSHMEM/NIXL |
| k8s/docker/Dockerfile.gke | GKE-optimized image with DeepEP/DeepGEMM/FlashInfer |
| k8s/docker/Dockerfile.xpu | Intel XPU image for vLLM |
| k8s/README.md | Top-level K8s usage/readme aligned with new CLI |
Comments suppressed due to low confidence (1)
cmd/cli/commands/k8s.go:1
- The list contains 'wide-ep' which doesn't correspond to a guide directory added in this PR, while 'precise-prefix-cache-aware' exists but isn't listed. Replace 'wide-ep' with an available guide name (e.g., 'precise-prefix-cache-aware') or ensure the referenced guide is actually present.
package commands
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| git_clone_and_cd https://github.com/vllm-project/vllm.git /app/vllm releases/v0.11.0 && \ | ||
| VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/f71952c1c49fb86686b0b300b727b26282362bf4/vllm-0.11.0%2Bcu129-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 upip . | ||
|
|
||
| ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"] |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The venv path in ENTRYPOINT is incorrect; earlier logic creates the virtualenv under /app/venv, not /app/code/venv. This will fail at runtime with 'no such file or directory'. Replace with /app/venv/bin/vllm.
| ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"] | |
| ENTRYPOINT ["/app/venv/bin/vllm", "serve"] |
| RUN | ||
|
|
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty RUN instruction is invalid Dockerfile syntax and will fail the build. Remove this line or add the intended command.
| RUN |
| RUN mkdir -p /wheels | ||
|
|
||
| # Copy patches before build | ||
| COPY patches/ /tmp/patches/ |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.
| wget https://developer.download.nvidia.com/compute/redist/nvshmem/${NVSHMEM_VERSION}/source/nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz -O nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \ | ||
| tar -xf nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \ | ||
| cd nvshmem_src && \ | ||
| git apply /tmp/patches/cks_nvshmem${NVSHMEM_VERSION}.patch && \ |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.
| configPath := filepath.Join(resourcesPath, "configs", config) | ||
| if _, err := os.Stat(configPath); os.IsNotExist(err) { | ||
| return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config) | ||
| } |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deploy command checks a non-existent 'configs' directory. The repository organizes deployment options under k8s/guides// with helmfile.yaml.gotmpl. Update to look under guides/ and verify the helmfile exists.
| inferenceExtension: | ||
| replicas: 1 | ||
| image: | ||
| # both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'infernece-scheduler' to 'inference-scheduler'.
| # both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example | |
| # both downstream inference-scheduler and upstream epp image can support simulated-accelerators example |
| inferenceExtension: | ||
| replicas: 1 | ||
| image: | ||
| # both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'infernece-scheduler' to 'inference-scheduler'.
| # both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here | |
| # both downstream inference-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here |
| **Current limitations** | ||
| - Percentile: only **p90** supported. | ||
| - Training: only **streaming mode** supported. | ||
| - TPOT sampling: for obsevability, every 200th token is logged and compared with predictions. |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct 'obsevability' to 'observability'.
| - TPOT sampling: for obsevability, every 200th token is logged and compared with predictions. | |
| - TPOT sampling: for observability, every 200th token is logged and compared with predictions. |
| return c | ||
| } | ||
|
|
||
| func getK8sResourcesPath() (string, error) { |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.
| return "", fmt.Errorf("failed to create k8s resources directory: %w", err) | ||
| } | ||
|
|
||
| return homePath, nil |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.
Powered by llm-d.
Summary by Sourcery
Introduce Kubernetes deployment support for vLLM via the Docker Model CLI and comprehensive production-ready guides.
New Features:
docker model k8sCLI command withlist-configs,deploy, andguidesubcommands for managing vLLM deployments on Kubernetesk8s/guidesdirectory with Helm charts, manifests, scripts, and step-by-step guides covering multiple deployment scenarios and hardware backends (inference scheduling, PD disaggregation, wide expert parallelism, simulated accelerators, and more)Documentation: