Add docker model k8s command with vLLM deployment guides #245

ericcurtin · 2025-10-15T21:08:39Z

Powered by llm-d.

Summary by Sourcery

Introduce Kubernetes deployment support for vLLM via the Docker Model CLI and comprehensive production-ready guides.

New Features:

Add docker model k8s CLI command with list-configs, deploy, and guide subcommands for managing vLLM deployments on Kubernetes
Extend top-level README and CLI documentation with a Kubernetes Deployment section, quickstart commands, and prerequisites
Include a new k8s/guides directory with Helm charts, manifests, scripts, and step-by-step guides covering multiple deployment scenarios and hardware backends (inference scheduling, PD disaggregation, wide expert parallelism, simulated accelerators, and more)

Documentation:

Add detailed Kubernetes deployment guides and configuration examples for various use cases and hardware environments
Update CLI README to document new k8s commands

sourcery-ai · 2025-10-15T21:08:45Z

Reviewer's Guide

This PR integrates a new Kubernetes deployment experience into the Docker Model Runner CLI and project docs by adding a docker model k8s command set, enriching the top-level README with deployment quick-start and features, and shipping a full k8s directory containing well-lit path guides, Helm manifests, and prerequisite scripts for running vLLM on Kubernetes.

Class diagram for new k8s command integration in Docker Model CLI

classDiagram
    class RootCmd {
      +NewRootCmd(cli)
      +AddCommand(...)
    }
    class K8sCmd {
      +newK8sCmd()
    }
    class K8sDeployCmd {
      +newK8sDeployCmd()
      -namespace: string
      -config: string
      -model: string
      -replicas: int
      +RunE(cmd, args)
    }
    class K8sListConfigsCmd {
      +newK8sListConfigsCmd()
      +RunE(cmd, args)
    }
    class K8sGuideCmd {
      +newK8sGuideCmd()
      +RunE(cmd, args)
    }
    RootCmd --> K8sCmd
    K8sCmd --> K8sDeployCmd
    K8sCmd --> K8sListConfigsCmd
    K8sCmd --> K8sGuideCmd

Class diagram for Helm chart values structure for PD disaggregation and inference scheduling

classDiagram
    class ModelServiceValues {
      +modelArtifacts: uri, name, size, authSecretName
      +routing: servicePort, proxy, inferencePool, httpRoute, epp
      +decode: create, replicas, containers[], monitoring, volumes
      +prefill: create, replicas, containers[], monitoring, volumes
      +accelerator: type
      +multinode: bool
    }
    class ContainerSpec {
      +name: string
      +image: string
      +modelCommand: string
      +args: []string
      +env: []EnvVar
      +ports: []Port
      +resources: ResourceSpec
      +mountModelVolume: bool
      +volumeMounts: []VolumeMount
    }
    ModelServiceValues "1" o-- "*" ContainerSpec
    ModelServiceValues "1" o-- "*" VolumeMount
    ModelServiceValues "1" o-- "*" Volume
    ContainerSpec o-- ResourceSpec
    ContainerSpec o-- EnvVar
    ContainerSpec o-- Port

Class diagram for Gateway provider Helmfile structure

classDiagram
    class Helmfile {
      +releases: []Release
    }
    class Release {
      +name: string
      +chart: string
      +namespace: string
      +version: string
      +installed: bool
      +needs: []string
      +values: []ValueOverride
      +labels: map[string]string
    }
    Helmfile "1" o-- "*" Release
    Release "1" o-- "*" ValueOverride

File-Level Changes

Change	Details	Files
Introduce `k8s` subcommands in the CLI	Register new `k8s` command in root CLI Implement `list-configs`, `guide`, and `deploy` subcommands Update CLI help docs to document Kubernetes workflows	`cmd/cli/commands/k8s.go` `cmd/cli/commands/root.go` `cmd/cli/README.md`
Enhance top-level README with Kubernetes deployment guide	Rename and expand Kubernetes section to “Kubernetes Deployment” Add feature bullets (inference scheduling, P/D disaggregation, wide-EP, multi-GPU) Insert Quick Start commands and prerequisites	`README.md`
Add `k8s` resource directory with detailed guides	Create `k8s/README.md` outlining CLI usage Add `k8s/guides` for multiple well-lit paths (inference-scheduling, P/D disaggregation, precise prefix cache, simulated accelerators) Include HTTPRoute manifests, Helmfile templates, and values.yaml examples	`k8s/README.md` `k8s/guides/**`
Provide prerequisite scripts for Kubernetes deployment	Add `install-deps.sh` to install client tools (kubectl, helm, yq, etc.) Add `install-gateway-provider-dependencies.sh` to manage Gateway API CRDs Include README docs for each script under `prereq` folders	`k8s/guides/prereq/client-setup/install-deps.sh` `k8s/guides/prereq/client-setup/README.md` `k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh` `k8s/guides/prereq/gateway-provider/README.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2025-10-15T21:09:37Z

Summary of Changes

Hello @ericcurtin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Kubernetes deployment experience for vLLM inference servers by introducing a dedicated CLI command and a suite of detailed guides. It aims to provide production-ready configurations and tools for advanced LLM serving features, making it easier to deploy and manage vLLM across various cloud and hardware environments.

Highlights

New Kubernetes CLI Command: Introduced a new docker model k8s command with subcommands (list-configs, deploy, guide) to simplify and standardize vLLM deployments on Kubernetes.
Comprehensive Deployment Guides: Added extensive documentation and guides for deploying vLLM on Kubernetes, covering various advanced features and hardware configurations.
Advanced vLLM Features Support: The new Kubernetes deployment capabilities support intelligent inference scheduling, prefill/decode disaggregation, wide expert-parallelism, and multi-GPU setups (NVIDIA, AMD, Google TPU, Intel XPU).
Specialized Dockerfiles: New Dockerfiles have been added for building vLLM images optimized for different environments, including AWS, CUDA, GKE, and Intel XPU, ensuring compatibility and performance across diverse infrastructures.
Precise Prefix Cache Aware Routing: A new feature and guide for precise prefix cache aware routing is included, leveraging vLLM KV-Events data to improve cache hit rates and optimize load balancing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `k8s/guides/prereq/client-setup/install-deps.sh:164-170` </location>
<code_context>
+########################################
+#  Helm diff plugin
+########################################
+if ! helm plugin list | grep -q diff; then
+  echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
+  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
+fi
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Helm plugin installation does not check for plugin version.

Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.

```suggestion
########################################
#  Helm diff plugin
########################################
HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v')
REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v')

if ! helm plugin list | grep -q diff; then
  echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then
  echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..."
  helm plugin uninstall diff
  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
fi
```
</issue_to_address>

### Comment 2
<location> `k8s/guides/prereq/client-setup/install-deps.sh:144-149` </location>
<code_context>
+  echo "Installing kubectl..."
+  K8S_URL="https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)"
+  curl -sLO "${K8S_URL}/bin/${OS}/${ARCH}/kubectl"
+  if [[ "$OS" == "darwin" ]]; then
+    sudo install -m 0755 kubectl /usr/local/bin/kubectl
+  else
+    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
</code_context>

<issue_to_address>
**suggestion:** Use of 'sudo' for installing binaries may fail in environments without sudo.

Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.

```suggestion
  if command -v sudo &> /dev/null; then
    if [[ "$OS" == "darwin" ]]; then
      sudo install -m 0755 kubectl /usr/local/bin/kubectl
      echo "kubectl installed to /usr/local/bin/kubectl"
    else
      sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
      echo "kubectl installed to /usr/local/bin/kubectl"
    fi
  else
    # Fallback: install to ~/.local/bin (create if needed)
    mkdir -p "$HOME/.local/bin"
    install -m 0755 kubectl "$HOME/.local/bin/kubectl"
    echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl"
    echo "Make sure $HOME/.local/bin is in your PATH."
  fi
  rm kubectl
```
</issue_to_address>

### Comment 3
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/values_tpu.yaml:46` </location>
<code_context>
+      interval: "30s"
+  containers:
+  - name: "vllm"
+    image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef"
+    modelCommand: vllmServe
+    args:
</code_context>

<issue_to_address>
**suggestion:** Hardcoded image tag may hinder future upgrades.

Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.

Suggested implementation:

```
    image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"

```

You should also add the following to your Helm values file (e.g., `values.yaml`) if it does not already exist:

```yaml
vllm:
  image:
    tag: "e92694b6fe264a85371317295bca6643508034ef"  # Default tag, can be overridden
```

This allows you to easily update the image tag in one place for future upgrades.
</issue_to_address>

### Comment 4
<location> `k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh:43` </location>
<code_context>
+GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}"
+### Base CRDs
+log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..."
+kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true
+
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Suppressing errors with '|| true' may mask installation failures.

Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.

```suggestion
if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then
  log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again."
  exit 1
fi
```
</issue_to_address>

### Comment 5
<location> `k8s/guides/inference-scheduling/ms-inference-scheduling/digitalocean-values.yaml:82-85` </location>
<code_context>
+  - name: torch-compile-cache
+    emptyDir: {}
+  # IMPORTANT: DigitalOcean GPU node tolerations
+  tolerations:
+  - key: "nvidia.com/gpu"
+    operator: "Exists"
+    effect: "NoSchedule"
+
+# Prefill disabled in inference-scheduling scenario
</code_context>

<issue_to_address>
**suggestion:** Tolerations are set for GPU scheduling but nodeSelector is not present.

Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.

Suggested implementation:

```
  # IMPORTANT: DigitalOcean GPU node tolerations
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  # Ensure pods are scheduled only on GPU nodes
  nodeSelector:
    accelerator: nvidia

```

- If your GPU nodes use a different label, replace `accelerator: nvidia` with the correct label key and value for your cluster (e.g., `kubernetes.io/instance-type: gpu` or another custom label).
- Make sure the indentation matches the rest of your YAML file.
</issue_to_address>

### Comment 6
<location> `k8s/guides/pd-disaggregation/README.md:146` </location>
<code_context>
+  hashBlockSize: 5
+```
+
+Some examples in which you might want to do selective PD might include:
+- When the prompt is short enough that the amount of work split inference into prefill and decode phases, and then open a kv transfer between those two GPUs is greater than the amount of work to do both phases on the same decode inference worker.
+- When Prefill units are at full capacity.
+
</code_context>

<issue_to_address>
**issue (typo):** Typo: 'dissagregation' should be 'disaggregation' in this section.

Please update all instances of 'dissagregation' to 'disaggregation'.

```suggestion
Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:
```
</issue_to_address>

### Comment 7
<location> `k8s/guides/predicted-latency-based-scheduling/README.md:316` </location>
<code_context>
+**Current limitations**
+- Percentile: only **p90** supported.  
+- Training: only **streaming mode** supported.  
+- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.  
+
+---
</code_context>

<issue_to_address>
**issue (typo):** Typo: 'obsevability' should be 'observability'.

Update the spelling to 'observability'.

```suggestion
- TPOT sampling: for observability, every 200th token is logged and compared with predictions.  
```
</issue_to_address>

### Comment 8
<location> `k8s/guides/prereq/infrastructure/README.md:72` </location>
<code_context>
+
+## Installing on a well-lit infrastructure provider
+
+The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.
+
+* [DigitalOcean Kubernetes (DOKS)](../../../docs/infra-providers/digitalocean/README.md)
</code_context>

<issue_to_address>
**issue (typo):** Grammar: 'model servers is expected' should be 'model servers are expected'.

Update the sentence to use 'are' instead of 'is' for correct grammar.

```suggestion
The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/prereq/client-setup/install-deps.sh

+########################################
+#  Helm diff plugin
+########################################
+if ! helm plugin list | grep -q diff; then
+  echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."
+  helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff
+fi


suggestion (bug_risk): Helm plugin installation does not check for plugin version.

Currently, the script installs the plugin only if it's missing, but does not ensure the required version is present. Please add logic to check the installed version and upgrade if it differs from ${HELMDIFF_VERSION}.

Suggested change

########################################

# Helm diff plugin

########################################

if ! helm plugin list | grep -q diff; then

echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."

helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff

fi

########################################

# Helm diff plugin

########################################

HELM_DIFF_INSTALLED_VERSION=$(helm plugin list | awk '/diff/ {print $2}' | tr -d 'v')

REQUIRED_DIFF_VERSION=$(echo "${HELMDIFF_VERSION}" | tr -d 'v')

if ! helm plugin list | grep -q diff; then

echo "📦 helm-diff plugin not found. Installing ${HELMDIFF_VERSION}..."

helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff

elif [ "${HELM_DIFF_INSTALLED_VERSION}" != "${REQUIRED_DIFF_VERSION}" ]; then

echo "📦 helm-diff plugin version (${HELM_DIFF_INSTALLED_VERSION}) does not match required (${REQUIRED_DIFF_VERSION}). Reinstalling..."

helm plugin uninstall diff

helm plugin install --version "${HELMDIFF_VERSION}" https://github.com/databus23/helm-diff

fi

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/prereq/client-setup/install-deps.sh

+  if [[ "$OS" == "darwin" ]]; then
+    sudo install -m 0755 kubectl /usr/local/bin/kubectl
+  else
+    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
+  fi
+  rm kubectl


suggestion: Use of 'sudo' for installing binaries may fail in environments without sudo.

Consider checking for 'sudo' availability or offering a non-sudo installation option to support environments where 'sudo' is not present or permitted.

Suggested change

if [[ "$OS" == "darwin" ]]; then

sudo install -m 0755 kubectl /usr/local/bin/kubectl

else

sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

fi

rm kubectl

if command -v sudo &> /dev/null; then

if [[ "$OS" == "darwin" ]]; then

sudo install -m 0755 kubectl /usr/local/bin/kubectl

echo "kubectl installed to /usr/local/bin/kubectl"

else

sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

echo "kubectl installed to /usr/local/bin/kubectl"

fi

else

# Fallback: install to ~/.local/bin (create if needed)

mkdir -p "$HOME/.local/bin"

install -m 0755 kubectl "$HOME/.local/bin/kubectl"

echo "sudo not found. kubectl installed to $HOME/.local/bin/kubectl"

echo "Make sure $HOME/.local/bin is in your PATH."

fi

rm kubectl

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/inference-scheduling/ms-inference-scheduling/values_tpu.yaml

+      interval: "30s"
+  containers:
+  - name: "vllm"
+    image: "vllm/vllm-tpu:e92694b6fe264a85371317295bca6643508034ef"


suggestion: Hardcoded image tag may hinder future upgrades.

Consider using a parameterized image tag or referencing a versioned release to simplify future upgrades and maintenance.

Suggested implementation:

image: "vllm/vllm-tpu:{{ .Values.vllm.image.tag }}"

You should also add the following to your Helm values file (e.g., values.yaml) if it does not already exist:

vllm: image: tag: "e92694b6fe264a85371317295bca6643508034ef" # Default tag, can be overridden

This allows you to easily update the image tag in one place for future upgrades.

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/prereq/gateway-provider/install-gateway-provider-dependencies.sh

+GATEWAY_API_CRD_REF="?ref=${GATEWAY_API_CRD_REVISION}"
+### Base CRDs
+log_success "📜 Base CRDs: ${LOG_ACTION_NAME}..."
+kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true


suggestion (bug_risk): Suppressing errors with '|| true' may mask installation failures.

Explicit error handling is recommended to avoid silent failures and ensure users are informed of any issues during CRD installation or deletion.

Suggested change

kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF} || true

if ! kubectl $MODE -k https://github.com/kubernetes-sigs/gateway-api/config/crd/${GATEWAY_API_CRD_REF}; then

log_error "Failed to ${LOG_ACTION_NAME} Gateway API CRDs. Please check your cluster and try again."

exit 1

fi

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/inference-scheduling/ms-inference-scheduling/digitalocean-values.yaml

+  tolerations:
+  - key: "nvidia.com/gpu"
+    operator: "Exists"
+    effect: "NoSchedule"


suggestion: Tolerations are set for GPU scheduling but nodeSelector is not present.

Consider adding a nodeSelector for GPU node labels to ensure pods are scheduled only on GPU nodes, which is especially important in mixed-node environments.

Suggested implementation:

# IMPORTANT: DigitalOcean GPU node tolerations tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" # Ensure pods are scheduled only on GPU nodes nodeSelector: accelerator: nvidia

If your GPU nodes use a different label, replace accelerator: nvidia with the correct label key and value for your cluster (e.g., kubernetes.io/instance-type: gpu or another custom label).

Make sure the indentation matches the rest of your YAML file.

sourcery-ai · 2025-10-15T21:09:53Z

k8s/guides/pd-disaggregation/README.md

+
+## Tuning Selective PD
+
+Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:


issue (typo): Typo: 'dissagregation' should be 'disaggregation' in this section.

Please update all instances of 'dissagregation' to 'disaggregation'.

Suggested change

Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:

Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode disaggregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:

sourcery-ai · 2025-10-15T21:09:54Z

k8s/guides/predicted-latency-based-scheduling/README.md

+**Current limitations**
+- Percentile: only **p90** supported.  
+- Training: only **streaming mode** supported.  
+- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.  


issue (typo): Typo: 'obsevability' should be 'observability'.

Update the spelling to 'observability'.

Suggested change

- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.

- TPOT sampling: for observability, every 200th token is logged and compared with predictions.

sourcery-ai · 2025-10-15T21:09:54Z

k8s/guides/prereq/infrastructure/README.md

+
+## Installing on a well-lit infrastructure provider
+
+The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.


issue (typo): Grammar: 'model servers is expected' should be 'model servers are expected'.

Update the sentence to use 'are' instead of 'is' for correct grammar.

Suggested change

The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.

The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers are expected to access accelerators.

gemini-code-assist

Code Review

This pull request introduces a significant and valuable feature: comprehensive Kubernetes deployment support for vLLM, accessible via a new docker model k8s CLI command. The scope is impressive, covering multiple deployment scenarios (inference scheduling, P/D disaggregation), various hardware backends (CUDA, XPU, TPU), and cloud providers. The inclusion of detailed guides, Helm charts, and Dockerfiles makes this a robust solution for production deployments. My review identifies a few key areas for improvement, including fixing a critical bug in a Dockerfile, correcting a typo in a file path, improving the user experience of the new CLI command, and enhancing the maintainability of the deployment configurations.

gemini-code-assist · 2025-10-15T21:11:09Z

k8s/docker/Dockerfile.aws

+COPY --from=builder /usr/local/lib/libgdrapi.so.2.* /usr/local/lib/
+COPY --from=builder /usr/local/lib/libgdrapi.so* /usr/local/lib/
+
+RUN 


This empty RUN instruction is invalid Dockerfile syntax and will cause the image build to fail. It must be removed.

gemini-code-assist · 2025-10-15T21:11:09Z

cmd/cli/commands/k8s.go

+	c := &cobra.Command{
+		Use:   "deploy",
+		Short: "Deploy vLLM on Kubernetes",
+		Long:  "Deploy vLLM inference server on Kubernetes with the specified configuration",
+		RunE: func(cmd *cobra.Command, args []string) error {
+			if config == "" {
+				return fmt.Errorf("--config is required. Use 'docker model k8s list-configs' to see available configurations")
+			}
+
+			// Get the path to the k8s resources
+			resourcesPath, err := getK8sResourcesPath()
+			if err != nil {
+				return err
+			}
+
+			configPath := filepath.Join(resourcesPath, "configs", config)
+			if _, err := os.Stat(configPath); os.IsNotExist(err) {
+				return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config)
+			}
+
+			cmd.Printf("Deploying vLLM with configuration: %s\n", config)
+			cmd.Printf("Namespace: %s\n", namespace)
+			if model != "" {
+				cmd.Printf("Model: %s\n", model)
+			}
+			cmd.Printf("Replicas: %d\n", replicas)
+
+			// Check if kubectl is available
+			if _, err := exec.LookPath("kubectl"); err != nil {
+				return fmt.Errorf("kubectl not found in PATH. Please install kubectl to deploy to Kubernetes")
+			}
+
+			// Check if helm is available for more complex deployments
+			if _, err := exec.LookPath("helm"); err != nil {
+				cmd.PrintErrln("Warning: helm not found in PATH. Some deployment options may not be available.")
+			}
+
+			cmd.Println("\nDeployment instructions:")
+			cmd.Printf("1. Ensure your kubectl context is set to the correct cluster\n")
+			cmd.Printf("2. Create namespace if it doesn't exist: kubectl create namespace %s\n", namespace)
+			cmd.Printf("3. Apply the configuration: kubectl apply -f %s -n %s\n", configPath, namespace)
+			cmd.Printf("\nFor detailed deployment guides, run: docker model k8s guide\n")
+
+			return nil
+		},


The deploy subcommand's name is misleading as it doesn't perform a deployment. It only prints the kubectl commands for the user to run manually. This can be confusing for users who would expect a deploy command to execute the deployment. To improve clarity and user experience, consider renaming the command to something like show-deploy-command or guide-deploy. Alternatively, you could implement the logic to execute the kubectl commands directly, which would be more powerful.

gemini-code-assist · 2025-10-15T21:11:09Z

cmd/cli/commands/k8s.go

+	}
+
+	// If not found, we'll create a minimal structure
+	homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s")


There is a typo in the directory path being created for fallback resources. The code uses .docker/modelr/k8s, which is inconsistent with the path checked on line 157 (.docker/model/k8s). This should be corrected to model to ensure the command behaves as expected.

Suggested change

homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s")

homePath := filepath.Join(os.Getenv("HOME"), ".docker", "model", "k8s")

gemini-code-assist · 2025-10-15T21:11:09Z

k8s/docker/Dockerfile.aws

+    else \
+        VLLM_COMMIT="$(git merge-base HEAD origin/main)"; \
+        VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_${VLLM_WHEEL_ARCH}.whl"; \
+        VLLM_USE_PRECOMPILED=1 uv pip install --editable .; \
+    fi; \


The logic for installing vllm in this else block appears to be incorrect. The VLLM_PRECOMPILED_WHEEL_LOCATION variable is defined but then not used, as the next command installs vllm from source via --editable .. This misses the performance benefit of using precompiled components and makes the build process slower than intended. The logic should be updated to correctly use the precompiled wheel if it's available, similar to the implementation in k8s/docker/Dockerfile.cuda.

gemini-code-assist · 2025-10-15T21:11:09Z

k8s/guides/inference-scheduling/helmfile.yaml.gotmpl

+    {{- else if eq .Environment.Name "xpu" }}
+      - ms-inference-scheduling/values_xpu.yaml
+    {{- else if eq .Environment.Name "digitalocean" }}
+      - ms-inference-scheduling/digitalocean-values.yaml
+    {{- else if eq .Environment.Name "xpu" }}
+      - ms-inference-scheduling/values_xpu.yaml


The template contains a duplicate else if condition for eq .Environment.Name "xpu". The second check is redundant and should be removed to clean up the code.

{{- else if eq .Environment.Name "xpu" }} - ms-inference-scheduling/values_xpu.yaml

gemini-code-assist · 2025-10-15T21:11:09Z

k8s/guides/pd-disaggregation/README.xpu.md

+⚠️ **Important - For Intel BMG GPU Users**: Before running `helmfile apply`, you must update the GPU resource type in `ms-pd/values_xpu.yaml`:
+
+```yaml
+# Edit ms-pd/values_xpu.yaml
+accelerator:
+  type: intel
+  resources:
+    intel: "gpu.intel.com/xe"  # Add gpu.intel.com/xe
+
+# Also update decode and prefill resource specifications:
+decode:
+  containers:
+  - name: "vllm"
+    resources:
+      limits:
+        gpu.intel.com/xe: 1  # Change from gpu.intel.com/i915 to gpu.intel.com/xe
+      requests:
+        gpu.intel.com/xe: 1  # Change from gpu.intel.com/i915 to gpu.intel.com/xe
+
+prefill:
+  containers:
+  - name: "vllm"
+    resources:
+      limits:
+        gpu.intel.com/xe: 1  # Change from gpu.intel.com/i915 to gpu.intel.com/xe
+      requests:
+        gpu.intel.com/xe: 1  # Change from gpu.intel.com/i915 to gpu.intel.com/xe
+```


The guide instructs users to manually edit the ms-pd/values_xpu.yaml file to switch between different Intel GPU types (i915 vs. xe). This manual step is error-prone and hinders automation. It would be more robust to handle this through a separate helmfile environment (e.g., xpu-bmg) or a value passed to the helmfile apply command. This would make the deployment process more declarative and user-friendly.

Powered by llm-d. Signed-off-by: Eric Curtin <[email protected]>

Copilot

Pull Request Overview

Add Kubernetes deployment support and guides for vLLM via a new CLI command and curated Helm-based recipes.

Introduces docker model k8s CLI command (deploy/list-configs/guide)
Adds production-ready Kubernetes guides (inference scheduling, P/D disaggregation, precise prefix cache awareness, simulator), plus Dockerfiles for CUDA/AWS/GKE/XPU images
Includes gateway provider prerequisite helmfiles and client setup scripts

Reviewed Changes

Copilot reviewed 52 out of 52 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
cmd/cli/commands/k8s.go	Adds k8s command group with deploy/list-configs/guide subcommands
cmd/cli/commands/root.go	Wires newK8sCmd into the root command
k8s/guides/*	Adds detailed Helmfile-based guides for multiple deployment patterns
k8s/guides/prereq/*	Adds gateway provider and client setup prerequisites
k8s/docker/Dockerfile.cuda	CUDA-based build/runtime image for vLLM
k8s/docker/Dockerfile.aws	AWS-optimized build/runtime image with UCX/NVSHMEM/NIXL
k8s/docker/Dockerfile.gke	GKE-optimized image with DeepEP/DeepGEMM/FlashInfer
k8s/docker/Dockerfile.xpu	Intel XPU image for vLLM
k8s/README.md	Top-level K8s usage/readme aligned with new CLI

Comments suppressed due to low confidence (1)

cmd/cli/commands/k8s.go:1

The list contains 'wide-ep' which doesn't correspond to a guide directory added in this PR, while 'precise-prefix-cache-aware' exists but isn't listed. Replace 'wide-ep' with an available guide name (e.g., 'precise-prefix-cache-aware') or ensure the referenced guide is actually present.

package commands

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-15T21:13:31Z

k8s/docker/Dockerfile.gke

+    git_clone_and_cd https://github.com/vllm-project/vllm.git /app/vllm releases/v0.11.0 && \
+    VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/f71952c1c49fb86686b0b300b727b26282362bf4/vllm-0.11.0%2Bcu129-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 upip . 
+
+ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"]


The venv path in ENTRYPOINT is incorrect; earlier logic creates the virtualenv under /app/venv, not /app/code/venv. This will fail at runtime with 'no such file or directory'. Replace with /app/venv/bin/vllm.

Suggested change

ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"]

ENTRYPOINT ["/app/venv/bin/vllm", "serve"]

Copilot · 2025-10-15T21:13:32Z

k8s/docker/Dockerfile.aws

+RUN 
+


This empty RUN instruction is invalid Dockerfile syntax and will fail the build. Remove this line or add the intended command.

Suggested change

RUN

Copilot · 2025-10-15T21:13:32Z

k8s/docker/Dockerfile.cuda

+RUN mkdir -p /wheels
+
+# Copy patches before build
+COPY patches/ /tmp/patches/


The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.

Copilot · 2025-10-15T21:13:32Z

k8s/docker/Dockerfile.cuda

+    wget https://developer.download.nvidia.com/compute/redist/nvshmem/${NVSHMEM_VERSION}/source/nvshmem_src_cuda12-all-all-${NVSHMEM_VERSION}.tar.gz -O nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \
+    tar -xf nvshmem_src_cuda${CUDA_MAJOR}.tar.gz && \
+    cd nvshmem_src && \
+    git apply /tmp/patches/cks_nvshmem${NVSHMEM_VERSION}.patch && \


The Dockerfile assumes a patches directory and a cks_nvshmem${NVSHMEM_VERSION}.patch file in the build context, but no patches/ directory was added in this PR. The build will fail. Either add the patches/ directory (with a version-matching patch filename), or guard this step behind a build arg and skip when patches are not present.

Copilot · 2025-10-15T21:13:32Z

cmd/cli/commands/k8s.go

+			configPath := filepath.Join(resourcesPath, "configs", config)
+			if _, err := os.Stat(configPath); os.IsNotExist(err) {
+				return fmt.Errorf("configuration '%s' not found. Use 'docker model k8s list-configs' to see available configurations", config)
+			}


The deploy command checks a non-existent 'configs' directory. The repository organizes deployment options under k8s/guides// with helmfile.yaml.gotmpl. Update to look under guides/ and verify the helmfile exists.

Copilot · 2025-10-15T21:13:35Z

k8s/guides/simulated-accelerators/gaie-sim/values.yaml

+inferenceExtension:
+  replicas: 1
+  image:
+    # both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example


Correct 'infernece-scheduler' to 'inference-scheduler'.

Suggested change

# both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example

# both downstream inference-scheduler and upstream epp image can support simulated-accelerators example

Copilot · 2025-10-15T21:13:35Z

k8s/guides/precise-prefix-cache-aware/gaie-kv-events/values.yaml

+inferenceExtension:
+  replicas: 1
+  image:
+    # both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here


Correct 'infernece-scheduler' to 'inference-scheduler'.

Suggested change

# both downstream infernece-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here

# both downstream inference-scheduler and upstream epp images can support precise KV Cache awareness based on the configurations here

Copilot · 2025-10-15T21:13:35Z

k8s/guides/predicted-latency-based-scheduling/README.md

+**Current limitations**
+- Percentile: only **p90** supported.  
+- Training: only **streaming mode** supported.  
+- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.  


Correct 'obsevability' to 'observability'.

Suggested change

- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.

- TPOT sampling: for observability, every 200th token is logged and compared with predictions.

Copilot · 2025-10-15T21:13:36Z

cmd/cli/commands/k8s.go

+	return c
+}
+
+func getK8sResourcesPath() (string, error) {


[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.

Copilot · 2025-10-15T21:13:36Z

cmd/cli/commands/k8s.go

+		return "", fmt.Errorf("failed to create k8s resources directory: %w", err)
+	}
+
+	return homePath, nil


[nitpick] Consider improving deploy UX by resolving a concrete guide path and validating a helmfile exists (e.g., k8s/guides//helmfile.yaml.gotmpl). Example fix: replace the 'configs' check with something like: configPath := filepath.Join(resourcesPath, "guides", config); helmfilePath := filepath.Join(configPath, "helmfile.yaml.gotmpl"); if _, err := os.Stat(helmfilePath); os.IsNotExist(err) { return fmt.Errorf("guide '%s' not found...", config) }.

Copilot AI review requested due to automatic review settings October 15, 2025 21:08

sourcery-ai bot reviewed Oct 15, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

Add docker model k8s command with vLLM deployment guides

d0f1092

Powered by llm-d. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the create-docker-model-k8s branch from a96e9f2 to d0f1092 Compare October 15, 2025 21:11

Copilot AI reviewed Oct 15, 2025

View reviewed changes

ericcurtin closed this Oct 15, 2025

ericcurtin deleted the create-docker-model-k8s branch October 15, 2025 21:17


		## Tuning Selective PD

		Selective PD is a feature in the `inference-scheduler` within the context of prefill-decode dissagregation, although it is disabled by default. This features enables routing to just decode even with the P/D deployed. To enable it, you will need to set `threshold` value for the `pd-profile-handler` plugin, in the [GAIE values file](./gaie-pd/values.yaml). You can see the value of this here:

	- TPOT sampling: for obsevability, every 200th token is logged and compared with predictions.
	- TPOT sampling: for observability, every 200th token is logged and compared with predictions.


		## Installing on a well-lit infrastructure provider

		The following documentation describes llm-d tested setup for cluster infrastructure providers as well as specific deployment settings that will impact how model servers is expected to access accelerators.

	homePath := filepath.Join(os.Getenv("HOME"), ".docker", "modelr", "k8s")
	homePath := filepath.Join(os.Getenv("HOME"), ".docker", "model", "k8s")

	ENTRYPOINT ["/app/code/venv/bin/vllm", "serve"]
	ENTRYPOINT ["/app/venv/bin/vllm", "serve"]

	# both downstream infernece-scheduler and upstream epp image can support simulated-accelerators example
	# both downstream inference-scheduler and upstream epp image can support simulated-accelerators example

Add docker model k8s command with vLLM deployment guides #245

Add docker model k8s command with vLLM deployment guides #245

Uh oh!

Conversation

ericcurtin commented Oct 15, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for new k8s command integration in Docker Model CLI

Class diagram for Helm chart values structure for PD disaggregation and inference scheduling

Class diagram for Gateway provider Helmfile structure

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Oct 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

ericcurtin commented Oct 15, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 15, 2025 •

edited

Loading