Skip to content

Conversation

@ericcurtin
Copy link
Contributor

@ericcurtin ericcurtin commented Oct 15, 2025

Powered by llm-d.

Summary by Sourcery

Introduce Kubernetes deployment support for vLLM via the Docker Model CLI and comprehensive production-ready guides.

New Features:

  • Add docker model k8s CLI command with list-configs, deploy, and guide subcommands for managing vLLM deployments on Kubernetes
  • Extend top-level README and CLI documentation with a Kubernetes Deployment section, quickstart commands, and prerequisites
  • Include a new k8s/guides directory with Helm charts, manifests, scripts, and step-by-step guides covering multiple deployment scenarios and hardware backends (inference scheduling, PD disaggregation, wide expert parallelism, simulated accelerators, and more)

Documentation:

  • Add detailed Kubernetes deployment guides and configuration examples for various use cases and hardware environments
  • Update CLI README to document new k8s commands

Copilot AI review requested due to automatic review settings October 15, 2025 21:08
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 15, 2025

Reviewer's Guide

This PR integrates a new Kubernetes deployment experience into the Docker Model Runner CLI and project docs by adding a docker model k8s command set, enriching the top-level README with deployment quick-start and features, and shipping a full k8s directory containing well-lit path guides, Helm manifests, and prerequisite scripts for running vLLM on Kubernetes.

Class diagram for new k8s command integration in Docker Model CLI

classDiagram
    class RootCmd {
      +NewRootCmd(cli)
      +AddCommand(...)
    }
    class K8sCmd {
      +newK8sCmd()
    }
    class K8sDeployCmd {
      +newK8sDeployCmd()
      -namespace: string
      -config: string
      -model: string
      -replicas: int
      +RunE(cmd, args)
    }
    class K8sListConfigsCmd {
      +newK8sListConfigsCmd()
      +RunE(cmd, args)
    }
    class K8sGuideCmd {
      +newK8sGuideCmd()
      +RunE(cmd, args)
    }
    RootCmd --> K8sCmd
    K8sCmd --> K8sDeployCmd
    K8sCmd --> K8sListConfigsCmd
    K8sCmd --> K8sGuideCmd
Loading

Class diagram for Helm chart values structure for PD disaggregation and inference scheduling

classDiagram
    class ModelServiceValues {
      +modelArtifacts: uri, name, size, authSecretName
      +routing: servicePort, proxy, inferencePool, httpRoute, epp
      +decode: create, replicas, containers[], monitoring, volumes
      +prefill: create, replicas, containers[], monitoring, volumes
      +accelerator: type
      +multinode: bool
    }
    class ContainerSpec {
      +name: string
      +image: string
      +modelCommand: string
      +args: []string
      +env: []EnvVar
      +ports: []Port
      +resources: ResourceSpec
      +mountModelVolume: bool
      +volumeMounts: []VolumeMount
    }
    ModelServiceValues "1" o-- "*" ContainerSpec
    ModelServiceValues "1" o-- "*" VolumeMount
    ModelServiceValues "1" o-- "*" Volume
    ContainerSpec o-- ResourceSpec
    ContainerSpec o-- EnvVar
    ContainerSpec o-- Port
Loading

Class diagram for Gateway provider Helmfile structure

classDiagram
    class Helmfile {
      +releases: []Release
    }
    class Release {
      +name: string
      +chart: string
      +namespace: string
      +version: string
      +installed: bool
      +needs: []string
      +values: []ValueOverride
      +labels: map[string]string
    }
    Helmfile "1" o-- "*" Release
    Release "1" o-- "*" ValueOverride
Loading

File-Level Changes

Change Details Files
Introduce k8s subcommands in the CLI
  • Register new k8s command in root CLI
  • Implement list-configs, guide, and deploy subcommands
  • Update CLI help docs to document Kubernetes workflows
cmd/cli/commands/k8s.go
cmd/cli/commands/root.go
cmd/cli/README.md
Enhance top-level README with Kubernetes deployment guide
  • Rename and expand Kubernetes section to “Kubernetes Deployment”
  • Add feature bullets (inference scheduling, P/D disaggregation, wide-EP, multi-GPU)
  • Insert Quick Start commands and prerequisites
README.md
Add k8s resource directory with detailed guides
  • Create k8s/README.md outlining CLI usage
  • Add k8s/guides for multiple well-lit paths (inference-scheduling, P/D disaggregation, precise prefix cache, simulated accelerators)
  • Include HTTPRoute manifests, Helmfile templates, and values.yaml examples
k8s/README.md
k8s/guides/**
Provide prerequisite scripts for Kubernetes deployment
  • Add install-deps.sh to install client tools (kubectl, helm, yq, etc.)
  • Add install-gateway-provider-dependencies.sh to manage Gateway API CRDs
  • Include README docs for each script under prereq folders
k8s/guides/prereq/client-setup/install-deps.sh
k8s/guides/prereq/client-setup/README.md
k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh
k8s/guides/prereq/gateway-provider/README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ericcurtin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Kubernetes deployment experience for vLLM inference servers by introducing a dedicated CLI command and a suite of detailed guides. It aims to provide production-ready configurations and tools for advanced LLM serving features, making it easier to deploy and manage vLLM across various cloud and hardware environments.

Highlights

  • New Kubernetes CLI Command: Introduced a new docker model k8s command with subcommands (list-configs, deploy, guide) to simplify and standardize vLLM deployments on Kubernetes.
  • Comprehensive Deployment Guides: Added extensive documentation and guides for deploying vLLM on Kubernetes, covering various advanced features and hardware configurations.
  • Advanced vLLM Features Support: The new Kubernetes deployment capabilities support intelligent inference scheduling, prefill/decode disaggregation, wide expert-parallelism, and multi-GPU setups (NVIDIA, AMD, Google TPU, Intel XPU).
  • Specialized Dockerfiles: New Dockerfiles have been added for building vLLM images optimized for different environments, including AWS, CUDA, GKE, and Intel XPU, ensuring compatibility and performance across diverse infrastructures.
  • Precise Prefix Cache Aware Routing: A new feature and guide for precise prefix cache aware routing is included, leveraging vLLM KV-Events data to improve cache hit rates and optimize load balancing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `k8s/guides/prereq/client-setup/install-deps.sh:164-170` </location>
<code_context>
+########################################
+#  Helm diff plugin
+########################################
+if ! helm plugin list | grep -q diff; then
+  echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
+  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
+fi
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Helm plugin installation does not check for plugin version.

Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.

```suggestion
########################################
#  Helm diff plugin
########################################
HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v')
REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v')

if ! helm plugin list | grep -q diff; then
  echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then
  echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..."
  helm plugin uninstall diff
  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi
```
</issue_to_address>

### Comment 2
<location> `k8s/guides/prereq/client-setup/install-deps.sh:144-149` </location>
<code_context>
+  echo "Installing kubectl..."
+  K8S_URL="https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)"
+  curl -sLO "${K8S_URL}/bin/${OS}/${ARCH}/kubectl"
+  if [[ "$OS" == "darwin" ]]; then
+    sudo install -m 0755 kubectl /usr/local/bin/kubectl
+  else
+    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
</code_context>

<issue_to_address>
**suggestion:** Use of 'sudo' for installing binaries may fail in environments without sudo.

Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.

```suggestion
  if command -v sudo &> /dev/null; then
    if [[ "$OS" == "darwin" ]]; then
      sudo install -m 0755 kubectl /usr/local/bin/kubectl
      echo "kubectl installed to /usr/local/bin/kubectl"
    else
      sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
      echo "kubectl installed to /usr/local/bin/kubectl"
    fi
  else
    # Fallback: install to ~/.local/bin (create if needed)
    mkdir -p "$HOME/.local/bin"
    install -m 0755 kubectl "$HOME/.local/bin/kubectl"
    echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl"
    echo "Make sure $HOME/.local/bin is in your PATH."
  fi
  rm kubectl
```
</issue_to_address>

### Comment 3
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/values_tpu.yaml:46` </location>
<code_context>
+      interval: "30s"
+  containers:
+  - name: "vllm"
+    image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef"
+    modelCommand: vllmServe
+    args:
</code_context>

<issue_to_address>
**suggestion:** Hardcoded image tag may hinder future upgrades.

Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.

Suggested implementation:

```
    image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"

```

You should also add the following to your Helm values file (e.g., `values.yaml`) if it does not already exist:

```yaml
vllm:
  image:
    tag: "e92694b6fe264a85371317295bca6643508034ef"  # Default tag, can be overridden
```

This allows you to easily update the image tag in one place for future upgrades.
</issue_to_address>

### Comment 4
<location> `k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh:43` </location>
<code_context>
+GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}"
+### Base CRDs
+log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..."
+kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true
+
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Suppressing errors with '|| true' may mask installation failures.

Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.

```suggestion
if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then
  log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again."
  exit 1
fi
```
</issue_to_address>

### Comment 5
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/digitalocean-values.yaml:82-85` </location>
<code_context>
+  - name: torch-compile-cache
+    emptyDir: {}
+  # IMPORTANT: DigitalOcean GPU node tolerations
+  tolerations:
+  - key: "nvidia.com/gpu"
+    operator: "Exists"
+    effect: "NoSchedule"
+
+# Prefill disabled in inference-scheduling scenario
</code_context>

<issue_to_address>
**suggestion:** Tolerations are set for GPU scheduling but nodeSelector is not present.

Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.

Suggested implementation:

```
  # IMPORTANT: DigitalOcean GPU node tolerations
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  # Ensure pods are scheduled only on GPU nodes
  nodeSelector:
    accelerator: nvidia

```

- If your GPU nodes use a different label, replace `accelerator: nvidia` with the correct label key and value for your cluster (e.g., `kubernetes.io/instance-type: gpu` or another custom label).
- Make sure the indentation matches the rest of your YAML file.
</issue_to_address>

### Comment 6
<location> `k8s/guides/pd-disaggregation/README.md:146` </location>
<code_context>
+  hashBlockSize: 5
+```
+
+Some examples in which you might want to do selective PD might include:
+- When the prompt is short enough that the amount of work split inference into prefill and decode phases, and then open a kv transfer between those two GPUs is greater than the amount of work to do both phases on the same decode inference worker.
+- When Prefill units are at full capacity.
+
</code_context>

<issue_to_address>
**issue (typo):** Typo: 'dissagregation' should be 'disaggregation' in this section.

Please update all instances of 'dissagregation' to 'disaggregation'.

```suggestion
Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:
```
</issue_to_address>

### Comment 7
<location> `k8s/guides/predicted-latency-based-scheduling/README.md:316` </location>
<code_context>
+**Current limitations**
+- Percentile: only **p90** supported.  
+- Training: only **streaming mode** supported.  
+- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.  
+
+---
</code_context>

<issue_to_address>
**issue (typo):** Typo: 'obsevability' should be 'observability'.

Update the spelling to 'observability'.

```suggestion
- TPOT sampling: for observability, every 200th token is logged and compared with predictions.  
```
</issue_to_address>

### Comment 8
<location> `k8s/guides/prereq/infrastructure/README.md:72` </location>
<code_context>
+
+## Installing on a well-lit infrastructure provider
+
+The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.
+
+* [DigitalOcean Kubernetes (DOKS)](../../../docs/infra-providers/digitalocean/README.md)
</code_context>

<issue_to_address>
**issue (typo):** Grammar: 'model servers is expected' should be 'model servers are expected'.

Update the sentence to use 'are' instead of 'is' for correct grammar.

```suggestion
The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +164 to +170
########################################
# Helm diff plugin
########################################
if ! helm plugin list | grep -q diff; then
echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Helm plugin installation does not check for plugin version.

Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.

Suggested change
########################################
# Helm diff plugin
########################################
if ! helm plugin list | grep -q diff; then
echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi
########################################
# Helm diff plugin
########################################
HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v')
REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v')
if ! helm plugin list | grep -q diff; then
echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then
echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..."
helm plugin uninstall diff
helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi

Comment on lines +144 to +149
if [[ "$OS" == "darwin" ]]; then
sudo install -m 0755 kubectl /usr/local/bin/kubectl
else
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
fi
rm kubectl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Use of 'sudo' for installing binaries may fail in environments without sudo.

Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.

Suggested change
if [[ "$OS" == "darwin" ]]; then
sudo install -m 0755 kubectl /usr/local/bin/kubectl
else
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
fi
rm kubectl
if command -v sudo &> /dev/null; then
if [[ "$OS" == "darwin" ]]; then
sudo install -m 0755 kubectl /usr/local/bin/kubectl
echo "kubectl installed to /usr/local/bin/kubectl"
else
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
echo "kubectl installed to /usr/local/bin/kubectl"
fi
else
# Fallback: install to ~/.local/bin (create if needed)
mkdir -p "$HOME/.local/bin"
install -m 0755 kubectl "$HOME/.local/bin/kubectl"
echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl"
echo "Make sure $HOME/.local/bin is in your PATH."
fi
rm kubectl

interval: "30s"
containers:
- name: "vllm"
image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Hardcoded image tag may hinder future upgrades.

Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.

Suggested implementation:

    image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"

You should also add the following to your Helm values file (e.g., values.yaml) if it does not already exist:

vllm:
  image:
    tag: "e92694b6fe264a85371317295bca6643508034ef"  # Default tag, can be overridden

This allows you to easily update the image tag in one place for future upgrades.

GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}"
### Base CRDs
log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..."
kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Suppressing errors with '|| true' may mask installation failures.

Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.

Suggested change
kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true
if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then
log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again."
exit 1
fi

Comment on lines +82 to +85
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Tolerations are set for GPU scheduling but nodeSelector is not present.

Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.

Suggested implementation:

  # IMPORTANT: DigitalOcean GPU node tolerations
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  # Ensure pods are scheduled only on GPU nodes
  nodeSelector:
    accelerator: nvidia

  • If your GPU nodes use a different label, replace accelerator: nvidia with the correct label key and value for your cluster (e.g., kubernetes.io/instance-type: gpu or another custom label).
  • Make sure the indentation matches the rest of your YAML file.


## Tuning Selective PD

Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo: 'dissagregation' should be 'disaggregation' in this section.

Please update all instances of 'dissagregation' to 'disaggregation'.

Suggested change
Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:
Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:

**Current limitations**
- Percentile: only **p90** supported.
- Training: only **streaming mode** supported.
- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo: 'obsevability' should be 'observability'.

Update the spelling to 'observability'.

Suggested change
- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
- TPOT sampling: for observability, every 200th token is logged and compared with predictions.


## Installing on a well-lit infrastructure provider

The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Grammar: 'model servers is expected' should be 'model servers are expected'.

Update the sentence to use 'are' instead of 'is' for correct grammar.

Suggested change
The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.
The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: comprehensive Kubernetes deployment support for vLLM, accessible via a new docker model k8s CLI command. The scope is impressive, covering multiple deployment scenarios (inference scheduling, P/D disaggregation), various hardware backends (CUDA, XPU, TPU), and cloud providers. The inclusion of detailed guides, Helm charts, and Dockerfiles makes this a robust solution for production deployments. My review identifies a few key areas for improvement, including fixing a critical bug in a Dockerfile, correcting a typo in a file path, improving the user experience of the new CLI command, and enhancing the maintainability of the deployment configurations.

COPY --from=builder /usr/local/lib/libgdrapi.so.2.* /usr/local/lib/
COPY --from=builder /usr/local/lib/libgdrapi.so* /usr/local/lib/

RUN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This empty RUN instruction is invalid Dockerfile syntax and will cause the image build to fail. It must be removed.

Comment on lines +41 to +85
c := &cobra.Command{
Use: "deploy",
Short: "Deploy vLLM on Kubernetes",
Long: "Deploy vLLM inference server on Kubernetes with the specified configuration",
RunE: func(cmd *cobra.Command, args []string) error {
if config == "" {
return fmt.Errorf("--config is required. Use 'docker model k8s list-configs' to see available configurations")
}

// Get the path to the k8s resources
resourcesPath, err := getK8sResourcesPath()
if err != nil {
return err
}

configPath := filepath.Join(resourcesPath, "configs", config)
if _, err := os.Stat(configPath); os.IsNotExist(err) {
return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config)
}

cmd.Printf("Deploying vLLM with configuration: %s\n", config)
cmd.Printf("Namespace: %s\n", namespace)
if model != "" {
cmd.Printf("Model: %s\n", model)
}
cmd.Printf("Replicas: %d\n", replicas)

// Check if kubectl is available
if _, err := exec.LookPath("kubectl"); err != nil {
return fmt.Errorf("kubectl not found in PATH. Please install kubectl to deploy to Kubernetes")
}

// Check if helm is available for more complex deployments
if _, err := exec.LookPath("helm"); err != nil {
cmd.PrintErrln("Warning: helm not found in PATH. Some deployment options may not be available.")
}

cmd.Println("\nDeployment instructions:")
cmd.Printf("1. Ensure your kubectl context is set to the correct cluster\n")
cmd.Printf("2. Create namespace if it doesn't exist: kubectl create namespace %s\n", namespace)
cmd.Printf("3. Apply the configuration: kubectl apply -f %s -n %s\n", configPath, namespace)
cmd.Printf("\nFor detailed deployment guides, run: docker model k8s guide\n")

return nil
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The deploy subcommand's name is misleading as it doesn't perform a deployment. It only prints the kubectl commands for the user to run manually. This can be confusing for users who would expect a deploy command to execute the deployment. To improve clarity and user experience, consider renaming the command to something like show-deploy-command or guide-deploy. Alternatively, you could implement the logic to execute the kubectl commands directly, which would be more powerful.

}

// If not found, we'll create a minimal structure
homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the directory path being created for fallback resources. The code uses .docker/modelr/k8s, which is inconsistent with the path checked on line 157 (.docker/model/k8s). This should be corrected to model to ensure the command behaves as expected.

Suggested change
homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s")
homePath := filepath.Join(os.Getenv("HOME"), ".docker", "model", "k8s")

Comment on lines +439 to +443
else \
VLLM_COMMIT="$(git merge-base HEAD origin/main)"; \
VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_${VLLM_WHEEL_ARCH}.whl"; \
VLLM_USE_PRECOMPILED=1 uv pip install --editable .; \
fi; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for installing vllm in this else block appears to be incorrect. The VLLM_PRECOMPILED_WHEEL_LOCATION variable is defined but then not used, as the next command installs vllm from source via --editable .. This misses the performance benefit of using precompiled components and makes the build process slower than intended. The logic should be updated to correctly use the precompiled wheel if it's available, similar to the implementation in k8s/docker/Dockerfile.cuda.

Comment on lines +101 to +106
{{- else if eq .Environment.Name "xpu" }}
- ms-inference-scheduling/values_xpu.yaml
{{- else if eq .Environment.Name "digitalocean" }}
- ms-inference-scheduling/digitalocean-values.yaml
{{- else if eq .Environment.Name "xpu" }}
- ms-inference-scheduling/values_xpu.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The template contains a duplicate else if condition for eq .Environment.Name "xpu". The second check is redundant and should be removed to clean up the code.

{{- else if eq .Environment.Name "xpu" }}
      - ms-inference-scheduling/values_xpu.yaml

Comment on lines +146 to +173
⚠️ **Important - For Intel BMG GPU Users**: Before running `helmfile apply`, you must update the GPU resource type in `ms-pd/values_xpu.yaml`:

```yaml
# Edit ms-pd/values_xpu.yaml
accelerator:
type: intel
resources:
intel: "gpu.intel.com/xe" # Add gpu.intel.com/xe

# Also update decode and prefill resource specifications:
decode:
containers:
- name: "vllm"
resources:
limits:
gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe
requests:
gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe

prefill:
containers:
- name: "vllm"
resources:
limits:
gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe
requests:
gpu.intel.com/xe: 1 # Change from gpu.intel.com/i915 to gpu.intel.com/xe
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The guide instructs users to manually edit the ms-pd/values_xpu.yaml file to switch between different Intel GPU types (i915 vs. xe). This manual step is error-prone and hinders automation. It would be more robust to handle this through a separate helmfile environment (e.g., xpu-bmg) or a value passed to the helmfile apply command. This would make the deployment process more declarative and user-friendly.

@ericcurtin ericcurtin force-pushed the create-docker-model-k8s branch from a96e9f2 to d0f1092 Compare October 15, 2025 21:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Add Kubernetes deployment support and guides for vLLM via a new CLI command and curated Helm-based recipes.

  • Introduces docker model k8s CLI command (deploy/list-configs/guide)
  • Adds production-ready Kubernetes guides (inference scheduling, P/D disaggregation, precise prefix cache awareness, simulator), plus Dockerfiles for CUDA/AWS/GKE/XPU images
  • Includes gateway provider prerequisite helmfiles and client setup scripts

Reviewed Changes

Copilot reviewed 52 out of 52 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
cmd/cli/commands/k8s.go Adds k8s command group with deploy/list-configs/guide subcommands
cmd/cli/commands/root.go Wires newK8sCmd into the root command
k8s/guides/* Adds detailed Helmfile-based guides for multiple deployment patterns
k8s/guides/prereq/* Adds gateway provider and client setup prerequisites
k8s/docker/Dockerfile.cuda CUDA-based build/runtime image for vLLM
k8s/docker/Dockerfile.aws AWS-optimized build/runtime image with UCX/NVSHMEM/NIXL
k8s/docker/Dockerfile.gke GKE-optimized image with DeepEP/DeepGEMM/FlashInfer
k8s/docker/Dockerfile.xpu Intel XPU image for vLLM
k8s/README.md Top-level K8s usage/readme aligned with new CLI
Comments suppressed due to low confidence (1)

cmd/cli/commands/k8s.go:1

  • The list contains 'wide-ep' which doesn't correspond to a guide directory added in this PR, while 'precise-prefix-cache-aware' exists but isn't listed. Replace 'wide-ep' with an available guide name (e.g., 'precise-prefix-cache-aware') or ensure the referenced guide is actually present.
package commands

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

git_clone_and_cd https://github.com/vllm-project/vllm.git /app/vllm releases/v0.11.0 && \
VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/f71952c1c49fb86686b0b300b727b26282362bf4/vllm-0.11.0%2Bcu129-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 upip .

ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"]
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The venv path in ENTRYPOINT is incorrect; earlier logic creates the virtualenv under /app/venv, not /app/code/venv. This will fail at runtime with 'no such file or directory'. Replace with /app/venv/bin/vllm.

Suggested change
ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"]
ENTRYPOINT ["/app/venv/bin/vllm", "serve"]

Copilot uses AI. Check for mistakes.
Comment on lines +380 to +381
RUN

Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty RUN instruction is invalid Dockerfile syntax and will fail the build. Remove this line or add the intended command.

Suggested change
RUN

Copilot uses AI. Check for mistakes.
RUN mkdir -p /wheels

# Copy patches before build
COPY patches/ /tmp/patches/
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.

Copilot uses AI. Check for mistakes.
wget https://developer.download.nvidia.com/compute/redist/nvshmem/${NVSHMEM_VERSION}/source/nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz -O nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \
tar -xf nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \
cd nvshmem_src && \
git apply /tmp/patches/cks_nvshmem${NVSHMEM_VERSION}.patch && \
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +59
configPath := filepath.Join(resourcesPath, "configs", config)
if _, err := os.Stat(configPath); os.IsNotExist(err) {
return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config)
}
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deploy command checks a non-existent 'configs' directory. The repository organizes deployment options under k8s/guides// with helmfile.yaml.gotmpl. Update to look under guides/ and verify the helmfile exists.

Copilot uses AI. Check for mistakes.
inferenceExtension:
replicas: 1
image:
# both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'infernece-scheduler' to 'inference-scheduler'.

Suggested change
# both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example
# both downstream inference-scheduler and upstream epp image can support simulated-accelerators example

Copilot uses AI. Check for mistakes.
inferenceExtension:
replicas: 1
image:
# both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'infernece-scheduler' to 'inference-scheduler'.

Suggested change
# both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here
# both downstream inference-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here

Copilot uses AI. Check for mistakes.
**Current limitations**
- Percentile: only **p90** supported.
- Training: only **streaming mode** supported.
- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 'obsevability' to 'observability'.

Suggested change
- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
- TPOT sampling: for observability, every 200th token is logged and compared with predictions.

Copilot uses AI. Check for mistakes.
return c
}

func getK8sResourcesPath() (string, error) {
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.

Copilot uses AI. Check for mistakes.
return "", fmt.Errorf("failed to create k8s resources directory: %w", err)
}

return homePath, nil
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.

Copilot uses AI. Check for mistakes.
@ericcurtin ericcurtin closed this Oct 15, 2025
@ericcurtin ericcurtin deleted the create-docker-model-k8s branch October 15, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants