- NVIDIA GPU Operator installed with DCGM HostEngine enabled.
- NVIDIA Datacenter Driver major version
510or newer on the cluster nodes. - DCGM HostEngine
4.2.3or newer. - A DCGM service endpoint reachable from the cluster (defaults to
nvidia-dcgm.gpu-operator.svc:5555). - Access to GitHub Container Registry (
ghcr.io) from your cluster/network.
Set shared variables once for the examples below:
# Namespace (override if needed)
NS=fleet-intelligence
CHART_VERSION='<version>' # e.g. 0.3.2 or 0.3.2-rc.1
# DCGM endpoint (usually the default is correct)
DCGM_URL='nvidia-dcgm.gpu-operator.svc:5555'
# Enrollment configuration - Go to the Fleet Intelligence UI to:
# 1. Generate an enrollment token (ENROLL_TOKEN)
# 2. Get the enrollment endpoint URL (ENROLL_ENDPOINT)
ENROLL_ENDPOINT='<enroll-endpoint>'
ENROLL_TOKEN='<enroll-token>'
ENROLL_TOKEN_SECRET_NAME='fleet-intelligence-enroll-token' # Recommended secret namekubectl create namespace "$NS" || trueIf you need to enroll nodes, create the token Secret. The secret name should match the ENROLL_TOKEN_SECRET_NAME variable set above:
kubectl create secret generic "$ENROLL_TOKEN_SECRET_NAME" \
--namespace "$NS" \
--from-literal=token="$ENROLL_TOKEN"Install:
helm install fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS" \
--set enroll.enabled=true \
--set enroll.endpoint="$ENROLL_ENDPOINT" \
--set enroll.tokenSecretName="$ENROLL_TOKEN_SECRET_NAME"Install (no enrollment):
helm install fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS"Upgrade:
helm upgrade fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS" \
--set enroll.enabled=true \
--set enroll.endpoint="$ENROLL_ENDPOINT" \
--set enroll.tokenSecretName="$ENROLL_TOKEN_SECRET_NAME"Upgrade (no enrollment):
helm upgrade fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS"Upgrade and explicitly remove persisted enrollment metadata:
helm upgrade fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS" \
--set enroll.enabled=false \
--set enroll.unenroll=trueenroll.enabled and enroll.unenroll are mutually exclusive. Setting both to true causes Helm template rendering to fail.
To use a different image registry/repository, add:
--set image.repository="<custom-image-repo>"If DCGM is exposed at a different service name or port, set env.DCGM_URL:
--set env.DCGM_URL="$DCGM_URL"After installation, verify the agent is running correctly:
# Check DaemonSet status
kubectl get daemonset fleet-intelligence-agent -n "$NS"
# Check pods (should be one per GPU node)
kubectl get pods -n "$NS" -l app.kubernetes.io/name=fleet-intelligence-agent
# View pod logs
kubectl logs -n "$NS" -l app.kubernetes.io/name=fleet-intelligence-agent --tail=50
# Watch rollout status
kubectl rollout status daemonset/fleet-intelligence-agent -n "$NS"Check a specific pod in detail:
# Get a pod name
POD_NAME=$(kubectl get pods -n "$NS" -l app.kubernetes.io/name=fleet-intelligence-agent -o jsonpath='{.items[0].metadata.name}')
# Describe the pod
kubectl describe pod -n "$NS" "$POD_NAME"
# View full logs
kubectl logs -n "$NS" "$POD_NAME" --followPods not starting:
# Check pod events
kubectl describe pod -n "$NS" -l app.kubernetes.io/name=fleet-intelligence-agentCommon issues:
- ImagePullBackOff: Verify nodes can reach
ghcr.ioand the image tag exists - Pending: Check node labels match
nodeSelector(default:nvidia.com/gpu.present=true) - CrashLoopBackOff: Check logs for errors
Enrollment failures:
# Check init container logs
kubectl logs -n "$NS" "$POD_NAME" -c enroll
# Verify enrollment secret exists
kubectl get secret "$ENROLL_TOKEN_SECRET_NAME" -n "$NS"
# Check secret content (verify token is not empty)
kubectl get secret "$ENROLL_TOKEN_SECRET_NAME" -n "$NS" -o jsonpath='{.data.token}' | base64 -d | wc -cDCGM connection issues:
# Verify DCGM service is accessible
kubectl get svc -n gpu-operator nvidia-dcgm
# Test DCGM connectivity from a pod
kubectl exec -n "$NS" "$POD_NAME" -- curl -v telnet://nvidia-dcgm.gpu-operator.svc:5555
# Check DCGM URL environment variable
kubectl get pods -n "$NS" "$POD_NAME" -o jsonpath='{.spec.containers[0].env[?(@.name=="DCGM_URL")].value}'If DCGM is at a different location, update the URL:
helm upgrade fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS" \
--reuse-values \
--set env.DCGM_URL="<dcgm-service>:<port>"By default, the agent automatically deploys only to GPU nodes using the nodeSelector:
nodeSelector:
nvidia.com/gpu.present: "true"This label is automatically set by the NVIDIA GPU Operator or Device Plugin, so no manual node labeling is required.
If you need a different node selector or tolerations for GPU taints, you can override them.
Using --set (quote the tolerations for zsh, and escape dots in the label key):
helm upgrade --install fleet-intelligence-agent oci://ghcr.io/nvidia/charts/fleet-intelligence-agent \
--version "$CHART_VERSION" \
--namespace "$NS" \
--set-string nodeSelector.nvidia\\.com/gpu\\.deploy\\.dcgm=true \
--set 'tolerations[0].key=nvidia.com/gpu' \
--set 'tolerations[0].operator=Exists' \
--set 'tolerations[0].effect=NoSchedule'Using a values file:
nodeSelector:
nvidia.com/gpu.deploy.dcgm: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"