dynamo/docs/kubernetes/installation-guide.md at main · drivenets/dynamo

title
Detailed Installation Guide

Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.

Before You Start

Determine your cluster environment:

Shared/Multi-Tenant Cluster (K8s cluster with existing Dynamo artifacts):

CRDs already installed cluster-wide - skip CRD installation step
A cluster-wide Dynamo operator is likely already running
Do NOT install another operator - use the existing cluster-wide operator
Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)

Dedicated Cluster (full cluster admin access):

You install CRDs yourself
Can use cluster-wide operator (default)

Local Development (Minikube, testing):

See Minikube Setup first, then follow installation steps below

To check if CRDs already exist:

kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed

To check if a cluster-wide operator already exists:

# Check for cluster-wide operator and show its namespace
kubectl get clusterrolebinding -o json | \
  jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
  "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'

# If a cluster-wide operator exists: Do NOT install another operator
# Only install namespace-restricted mode if you specifically need namespace isolation

Installation Paths

Platform is installed using Dynamo Kubernetes Platform helm chart.

Path A: Pre-built Artifacts

Use case: Production deployment, shared or dedicated clusters
Source: NGC published Helm charts
Time: ~10 minutes
Jump to: Path A

Path B: Custom Build from Source

Use case: Contributing to Dynamo, using latest features from main branch, customization
Requirements: Docker build environment
Time: ~30 minutes
Jump to: Path B

All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:

helm install ...
  -f your-values.yaml

and/or setting values as flags to the helm install command, as follows:

helm install ...
  --set "your-value=your-value"

Prerequisites

Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:

Required Tools

Tool	Minimum Version	Description	Installation
kubectl	v1.24+	Kubernetes command-line tool	Install kubectl
Helm	v3.0+	Kubernetes package manager	Install Helm
Docker	Latest	Container runtime (Path B only)	Install Docker

Cluster and Access Requirements

Kubernetes cluster v1.24+ with admin or namespace-scoped access
Cluster type determined (shared vs dedicated) — see Before You Start
CRD status checked if on a shared cluster
NGC credentials (optional) — required only if pulling NVIDIA images from NGC

Verify Installation

Run the following to confirm your tools are correctly installed:

# Verify tools and versions
kubectl version --client  # Should show v1.24+
helm version              # Should show v3.0+
docker version            # Required for Path B only

# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

Pre-Deployment Checks

Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:

./deploy/pre-deployment/pre-deployment-check.sh

This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See Pre-Deployment Checks for details.

No cluster? See Minikube Setup for local development.

Estimated installation time: 5-30 minutes depending on path

Path A: Production Install

Install from NGC published artifacts.

# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install Platform (CRDs are automatically installed by the chart)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

Warning

v0.9.0 Helm Chart Issue: The initial v0.9.0 dynamo-platform Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use RELEASE_VERSION=0.9.0-post1 or add --set dynamo-operator.controllerManager.manager.image.tag=0.9.0 to your helm install command.

For Shared/Multi-Tenant Clusters:

If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:

# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true

Note: Use the full path dynamo-operator.namespaceRestriction.enabled=true (not just namespaceRestriction.enabled=true).

If you see this validation error, you need namespace restriction:

VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...

Tip

For multinode deployments, you need to install multinode orchestration components:

Option 1 (Recommended): Grove + KAI Scheduler

For production environments, Grove and KAI Scheduler should be installed separately from the dynamo-platform chart. This allows independent lifecycle management, version pinning, and upgrade control.

Compatibility Matrix:

dynamo-platform	kai-scheduler	Grove
1.0.x	>= v0.13.0	>= v0.1.0-alpha.6

After installing them separately, enable Dynamo integration:

--set "global.kai-scheduler.enabled=true"
--set "global.grove.enabled=true"

For development/testing only, you can install them as bundled subcharts:

--set "global.grove.install=true"
--set "global.kai-scheduler.install=true"

Note: global.kai-scheduler.install / global.grove.install control whether the bundled subcharts are deployed. When set, integration is automatically enabled. global.kai-scheduler.enabled / global.grove.enabled can be set independently when using externally-managed installations.

Option 2: LeaderWorkerSet (LWS) + Volcano

If using LWS for multinode deployments, you must also install Volcano (required dependency):
- LWS Installation
- Volcano Installation (required for gang scheduling with LWS)
These must be installed manually before deploying multinode workloads with LWS.

See the Multinode Deployment Guide for details on orchestrator selection.

Tip

By default, Model Express Server is not used. If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:

--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"

Tip

By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces. If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true. You can also change the restricted namespace by setting the targetNamespace property.

--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional

GPU Discovery for DynamoGraphDeploymentRequests with Namespace-Scoped Operators

GPU discovery is enabled by default for namespace-scoped operators. The Helm chart automatically provisions a ClusterRole/ClusterRoleBinding granting the operator read-only access to node GPU labels.

To disable GPU discovery (if your installer lacks ClusterRole creation permissions):

helm install dynamo-platform ... --set dynamo-operator.gpuDiscovery.enabled=false

When GPU discovery is disabled, you must provide hardware configuration manually in each DynamoGraphDeploymentRequest:

spec:
  hardware:
    numGpusPerNode: 8
    gpuSku: "H100-SXM5-80GB"
    vramMb: 81920

Note: If GPU discovery is disabled and no hardware config is provided, the DGDR will be rejected at admission time with a clear error message.

→ Verify Installation

Path B: Custom Build from Source

Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.

Note: This gives you access to the latest unreleased features and fixes on the main branch.

# 1. Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=${RELEASE_VERSION}

# 2. Build operator
cd deploy/operator

# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .

# 2.2 Alternative 2 : Build and push the operator image for a single platform
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG

cd -

# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=${DOCKER_SERVER} \
  --docker-username=${DOCKER_USERNAME} \
  --docker-password=${DOCKER_PASSWORD} \
  --namespace=${NAMESPACE}

cd deploy/helm/charts

# 4. Install Platform (CRDs are automatically installed by the chart)
helm dep build ./platform/

# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.

NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
helm install dynamo-platform ./platform/ \
  --namespace "${NAMESPACE}" \
  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
  ${NS_RESTRICT_FLAGS}

→ Verify Installation

Verify Installation

# Check CRDs
kubectl get crd | grep dynamo

# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running

Next Steps

Deploy Model/Workflow

# Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

# Port forward and test
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models

Explore Backend Guides
Optional:
- Set up Prometheus & Grafana
- SLA Planner Guide (for SLA-aware scheduling and autoscaling)

Troubleshooting

"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"

VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...

Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.

Solution: Add namespace restriction to your installation:

--set dynamo-operator.namespaceRestriction.enabled=true

Note: Use the full path dynamo-operator.namespaceRestriction.enabled=true (not just namespaceRestriction.enabled=true).

CRDs already exist

Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).

Solution: Skip step 2 (CRD installation), proceed directly to platform installation.

To check if CRDs exist:

kubectl get crd | grep dynamo

Pods not starting?

kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}

HuggingFace model access?

kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}

Bitnami etcd "unrecognized" image?

ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.

This error that you might encounter during helm install is due to bitnami changing their docker repository to a secure one.

just add the following to the helm install command:

--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"

Clean uninstall?

To uninstall the platform, you can run the following command:

helm uninstall dynamo-platform --namespace ${NAMESPACE}

To uninstall the CRDs, follow these steps:

Get all of the dynamo CRDs installed in your cluster:

kubectl get crd | grep "dynamo.*nvidia.com"

You should see something like this:

dynamocomponentdeployments.nvidia.com               2025-10-21T14:49:52Z
dynamocomponents.nvidia.com                         2025-10-25T05:16:10Z
dynamographdeploymentrequests.nvidia.com            2025-11-24T05:26:04Z
dynamographdeployments.nvidia.com                   2025-09-04T20:56:40Z
dynamographdeploymentscalingadapters.nvidia.com     2025-12-09T21:05:59Z
dynamomodels.nvidia.com                             2025-11-07T00:19:43Z

Delete each CRD one by one:

kubectl delete crd <crd-name>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Before You Start

Installation Paths

Prerequisites

Required Tools

Cluster and Access Requirements

Verify Installation

Pre-Deployment Checks

Path A: Production Install

GPU Discovery for DynamoGraphDeploymentRequests with Namespace-Scoped Operators

Path B: Custom Build from Source

Verify Installation

Next Steps

Troubleshooting

Advanced Options

FilesExpand file tree

installation-guide.md

Latest commit

History

installation-guide.md

File metadata and controls

Before You Start

Installation Paths

Prerequisites

Required Tools

Cluster and Access Requirements

Verify Installation

Pre-Deployment Checks

Path A: Production Install

GPU Discovery for DynamoGraphDeploymentRequests with Namespace-Scoped Operators

Path B: Custom Build from Source

Verify Installation

Next Steps

Troubleshooting

Advanced Options