Skip to content

Adding upgrade install test with ADDITIONAL_NAMESPACES#959

Closed
visheshtanksale wants to merge 1 commit intoNVIDIA:mainfrom
visheshtanksale:upgrade-test
Closed

Adding upgrade install test with ADDITIONAL_NAMESPACES#959
visheshtanksale wants to merge 1 commit intoNVIDIA:mainfrom
visheshtanksale:upgrade-test

Conversation

@visheshtanksale
Copy link

Fixes #683

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Bats regression test covering Helm upgrade/install when the chart is installed into the gpu-operator namespace and configured with ADDITIONAL_NAMESPACES, matching the GPU Operator-style “namespace flip” scenario described in issue #683.

Changes:

  • Add a new Bats test that uninstalls from nvidia-dra-driver-gpu, installs last-stable into gpu-operator with ADDITIONAL_NAMESPACES, upgrades to current-dev, then restores baseline state.
  • Extend iupgrade_wait to accept an optional namespace argument (defaulting to nvidia-dra-driver-gpu) and log pods from that namespace.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/bats/test_basics.bats Adds upgrade test exercising install/upgrade in gpu-operator with ADDITIONAL_NAMESPACES, plus restore steps.
tests/bats/helpers.sh Enhances iupgrade_wait with an optional namespace parameter and updates pod logging accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +103 to +109
# This test covers a GPU Operator-style deployment where the DRA driver runs
# in the gpu-operator namespace and watches nvidia-dra-driver-gpu for
# workloads via the ADDITIONAL_NAMESPACES env var.
local _iargs=(
"--set" "controller.containers.computeDomain.env[0].name=ADDITIONAL_NAMESPACES"
"--set" "controller.containers.computeDomain.env[0].value=nvidia-dra-driver-gpu"
)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test description says it covers “watches nvidia-dra-driver-gpu for workloads via the ADDITIONAL_NAMESPACES env var”, but the test never asserts that the controller in gpu-operator actually has ADDITIONAL_NAMESPACES set (or that cross-namespace watching works). Add at least one concrete assertion after install/upgrade (e.g., check the controller pod/deployment in gpu-operator contains the env var/value, and/or run a minimal workload/CR in nvidia-dra-driver-gpu that requires the controller to observe that namespace).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a legit ask. The test currently sets ADDITIONAL_NAMESPACES but is not verifying the controller is actually watching the additional namespace.

Comment on lines 86 to 92
kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=15s

# Again, log current state.
kubectl get pods -n nvidia-dra-driver-gpu
kubectl get pods -n "${NAMESPACE}"

# That one should be obvious now, but make that guarantee explicit for
# consuming tests.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iupgrade_wait now accepts a namespace, but it still waits on pods cluster-wide (kubectl wait ... -A) using only nvidia-dra-driver-gpu-component=... selectors. This can match pods from other namespaces/releases and either hang or report readiness for the wrong install. Consider scoping waits to the selected namespace and/or adding app.kubernetes.io/instance=${TEST_HELM_RELEASE_NAME} to the selectors (for both kubelet-plugin and controller waits) so the function reliably waits for the pods it just installed.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
# Uninstall from nvidia-dra-driver-gpu namespace (state from previous tests).
helm uninstall "${TEST_HELM_RELEASE_NAME}" -n nvidia-dra-driver-gpu --wait --timeout=30s
kubectl wait --for=delete pods -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu --timeout=10s
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assumes the release currently exists in nvidia-dra-driver-gpu. If someone runs only this test (or a prior test failed before install), helm uninstall will fail and kubectl wait --for=delete ... can error when there are no matching pods. To reduce flakiness, make the cleanup idempotent (e.g., ignore uninstall errors when the release isn’t present and only kubectl wait if pods exist / use --ignore-not-found).

Copilot uses AI. Check for mistakes.
iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs gpu-operator

# Verify CRD is present
kubectl get crd computedomains.resource.nvidia.com
Copy link
Collaborator

@shivamerla shivamerla Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@visheshtanksale I believe the goal of the original issue is to ensure that driver installation can be switched across namespaces while still allowing management of previously deployed DaemonSets (e.g., IMEX, MPS).

To achieve this, we should add tests that cover the prepare/unprepare flow using RCTs with IMEX and MPS, and verify that resources in the previous namespace are properly cleaned up during unprepare.

Additionally, for the IMEX use case, we should ensure that the IMEX daemon controller in the previous namespace can correctly create and update ComputeDomainClique statuses, and that the CD controller is able to properly sync with those statuses. @jgehrcke please add if there are additional use cases we need to test.

kubectl get crd computedomains.resource.nvidia.com

# Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu.
helm uninstall "${TEST_HELM_RELEASE_NAME}" -n gpu-operator --wait --timeout=30s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this getting set? does it make sense to check if the release exists before uninstalling?

# Verify CRD is present
kubectl get crd computedomains.resource.nvidia.com

# Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@visheshtanksale
Copy link
Author

Closing this for now based on the comments in the attached issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

tests: cover upgrade across namespaces

4 participants