Adding upgrade install test with ADDITIONAL_NAMESPACES by visheshtanksale · Pull Request #959 · NVIDIA/k8s-dra-driver-gpu

visheshtanksale · 2026-03-20T20:03:42Z

Fixes #683

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

copy-pr-bot · 2026-03-20T20:03:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

Adds a new Bats regression test covering Helm upgrade/install when the chart is installed into the gpu-operator namespace and configured with ADDITIONAL_NAMESPACES, matching the GPU Operator-style “namespace flip” scenario described in issue #683.

Changes:

Add a new Bats test that uninstalls from nvidia-dra-driver-gpu, installs last-stable into gpu-operator with ADDITIONAL_NAMESPACES, upgrades to current-dev, then restores baseline state.
Extend iupgrade_wait to accept an optional namespace argument (defaulting to nvidia-dra-driver-gpu) and log pods from that namespace.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
tests/bats/test_basics.bats	Adds upgrade test exercising install/upgrade in `gpu-operator` with `ADDITIONAL_NAMESPACES`, plus restore steps.
tests/bats/helpers.sh	Enhances `iupgrade_wait` with an optional namespace parameter and updates pod logging accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T20:08:19Z

tests/bats/test_basics.bats

+  # This test covers a GPU Operator-style deployment where the DRA driver runs
+  # in the gpu-operator namespace and watches nvidia-dra-driver-gpu for
+  # workloads via the ADDITIONAL_NAMESPACES env var.
+  local _iargs=(
+    "--set" "controller.containers.computeDomain.env[0].name=ADDITIONAL_NAMESPACES"
+    "--set" "controller.containers.computeDomain.env[0].value=nvidia-dra-driver-gpu"
+  )


The test description says it covers “watches nvidia-dra-driver-gpu for workloads via the ADDITIONAL_NAMESPACES env var”, but the test never asserts that the controller in gpu-operator actually has ADDITIONAL_NAMESPACES set (or that cross-namespace watching works). Add at least one concrete assertion after install/upgrade (e.g., check the controller pod/deployment in gpu-operator contains the env var/value, and/or run a minimal workload/CR in nvidia-dra-driver-gpu that requires the controller to observe that namespace).

This seems like a legit ask. The test currently sets ADDITIONAL_NAMESPACES but is not verifying the controller is actually watching the additional namespace.

Copilot · 2026-03-20T20:08:19Z

tests/bats/helpers.sh

  kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=15s

  # Again, log current state.
-  kubectl get pods -n nvidia-dra-driver-gpu
+  kubectl get pods -n "${NAMESPACE}"

  # That one should be obvious now, but make that guarantee explicit for
  # consuming tests.


iupgrade_wait now accepts a namespace, but it still waits on pods cluster-wide (kubectl wait ... -A) using only nvidia-dra-driver-gpu-component=... selectors. This can match pods from other namespaces/releases and either hang or report readiness for the wrong install. Consider scoping waits to the selected namespace and/or adding app.kubernetes.io/instance=${TEST_HELM_RELEASE_NAME} to the selectors (for both kubelet-plugin and controller waits) so the function reliably waits for the pods it just installed.

Copilot · 2026-03-20T20:08:20Z

tests/bats/test_basics.bats

+  # Uninstall from nvidia-dra-driver-gpu namespace (state from previous tests).
+  helm uninstall "${TEST_HELM_RELEASE_NAME}" -n nvidia-dra-driver-gpu --wait --timeout=30s
+  kubectl wait --for=delete pods -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu --timeout=10s


This test assumes the release currently exists in nvidia-dra-driver-gpu. If someone runs only this test (or a prior test failed before install), helm uninstall will fail and kubectl wait --for=delete ... can error when there are no matching pods. To reduce flakiness, make the cleanup idempotent (e.g., ignore uninstall errors when the release isn’t present and only kubectl wait if pods exist / use --ignore-not-found).

shivamerla · 2026-03-23T20:03:24Z

tests/bats/test_basics.bats

+  iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs gpu-operator
+
+  # Verify CRD is present
+  kubectl get crd computedomains.resource.nvidia.com


@visheshtanksale I believe the goal of the original issue is to ensure that driver installation can be switched across namespaces while still allowing management of previously deployed DaemonSets (e.g., IMEX, MPS).

To achieve this, we should add tests that cover the prepare/unprepare flow using RCTs with IMEX and MPS, and verify that resources in the previous namespace are properly cleaned up during unprepare.

Additionally, for the IMEX use case, we should ensure that the IMEX daemon controller in the previous namespace can correctly create and update ComputeDomainClique statuses, and that the CD controller is able to properly sync with those statuses. @jgehrcke please add if there are additional use cases we need to test.

guptaNswati · 2026-03-23T20:55:35Z

tests/bats/test_basics.bats

+  kubectl get crd computedomains.resource.nvidia.com
+
+  # Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu.
+  helm uninstall "${TEST_HELM_RELEASE_NAME}" -n gpu-operator --wait --timeout=30s


where is this getting set? does it make sense to check if the release exists before uninstalling?

guptaNswati · 2026-03-23T21:16:29Z

tests/bats/test_basics.bats

+  # Verify CRD is present
+  kubectl get crd computedomains.resource.nvidia.com
+
+  # Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu.


Should there be a clean up step, like this https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/tests/bats/test_gpu_updowngrade.bats#L33

visheshtanksale · 2026-03-25T16:32:31Z

Closing this for now based on the comments in the attached issue

Adding upgrade install test with ADDITIONAL_NAMESPACES

d21a6b7

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Mar 20, 2026

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Mar 20, 2026

visheshtanksale requested review from Copilot, guptaNswati and shivamerla March 20, 2026 20:04

Copilot started reviewing on behalf of visheshtanksale March 20, 2026 20:04 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

shivamerla reviewed Mar 23, 2026

View reviewed changes

guptaNswati reviewed Mar 23, 2026

View reviewed changes

visheshtanksale closed this Mar 25, 2026

github-project-automation bot moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding upgrade install test with ADDITIONAL_NAMESPACES#959

Adding upgrade install test with ADDITIONAL_NAMESPACES#959
visheshtanksale wants to merge 1 commit intoNVIDIA:mainfrom
visheshtanksale:upgrade-test

visheshtanksale commented Mar 20, 2026

Uh oh!

copy-pr-bot bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

guptaNswati Mar 23, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

shivamerla Mar 23, 2026 •

edited

Loading

Uh oh!

guptaNswati Mar 23, 2026

Uh oh!

guptaNswati Mar 23, 2026

Uh oh!

visheshtanksale commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

visheshtanksale commented Mar 20, 2026

Uh oh!

copy-pr-bot bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

guptaNswati Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

shivamerla Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

guptaNswati Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

visheshtanksale commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shivamerla Mar 23, 2026 •

edited

Loading