Adding upgrade install test with ADDITIONAL_NAMESPACES#959
Adding upgrade install test with ADDITIONAL_NAMESPACES#959visheshtanksale wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Bats regression test covering Helm upgrade/install when the chart is installed into the gpu-operator namespace and configured with ADDITIONAL_NAMESPACES, matching the GPU Operator-style “namespace flip” scenario described in issue #683.
Changes:
- Add a new Bats test that uninstalls from
nvidia-dra-driver-gpu, installs last-stable intogpu-operatorwithADDITIONAL_NAMESPACES, upgrades to current-dev, then restores baseline state. - Extend
iupgrade_waitto accept an optional namespace argument (defaulting tonvidia-dra-driver-gpu) and log pods from that namespace.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/bats/test_basics.bats | Adds upgrade test exercising install/upgrade in gpu-operator with ADDITIONAL_NAMESPACES, plus restore steps. |
| tests/bats/helpers.sh | Enhances iupgrade_wait with an optional namespace parameter and updates pod logging accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # This test covers a GPU Operator-style deployment where the DRA driver runs | ||
| # in the gpu-operator namespace and watches nvidia-dra-driver-gpu for | ||
| # workloads via the ADDITIONAL_NAMESPACES env var. | ||
| local _iargs=( | ||
| "--set" "controller.containers.computeDomain.env[0].name=ADDITIONAL_NAMESPACES" | ||
| "--set" "controller.containers.computeDomain.env[0].value=nvidia-dra-driver-gpu" | ||
| ) |
There was a problem hiding this comment.
The test description says it covers “watches nvidia-dra-driver-gpu for workloads via the ADDITIONAL_NAMESPACES env var”, but the test never asserts that the controller in gpu-operator actually has ADDITIONAL_NAMESPACES set (or that cross-namespace watching works). Add at least one concrete assertion after install/upgrade (e.g., check the controller pod/deployment in gpu-operator contains the env var/value, and/or run a minimal workload/CR in nvidia-dra-driver-gpu that requires the controller to observe that namespace).
There was a problem hiding this comment.
This seems like a legit ask. The test currently sets ADDITIONAL_NAMESPACES but is not verifying the controller is actually watching the additional namespace.
| kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=15s | ||
|
|
||
| # Again, log current state. | ||
| kubectl get pods -n nvidia-dra-driver-gpu | ||
| kubectl get pods -n "${NAMESPACE}" | ||
|
|
||
| # That one should be obvious now, but make that guarantee explicit for | ||
| # consuming tests. |
There was a problem hiding this comment.
iupgrade_wait now accepts a namespace, but it still waits on pods cluster-wide (kubectl wait ... -A) using only nvidia-dra-driver-gpu-component=... selectors. This can match pods from other namespaces/releases and either hang or report readiness for the wrong install. Consider scoping waits to the selected namespace and/or adding app.kubernetes.io/instance=${TEST_HELM_RELEASE_NAME} to the selectors (for both kubelet-plugin and controller waits) so the function reliably waits for the pods it just installed.
| # Uninstall from nvidia-dra-driver-gpu namespace (state from previous tests). | ||
| helm uninstall "${TEST_HELM_RELEASE_NAME}" -n nvidia-dra-driver-gpu --wait --timeout=30s | ||
| kubectl wait --for=delete pods -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu --timeout=10s |
There was a problem hiding this comment.
This test assumes the release currently exists in nvidia-dra-driver-gpu. If someone runs only this test (or a prior test failed before install), helm uninstall will fail and kubectl wait --for=delete ... can error when there are no matching pods. To reduce flakiness, make the cleanup idempotent (e.g., ignore uninstall errors when the release isn’t present and only kubectl wait if pods exist / use --ignore-not-found).
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs gpu-operator | ||
|
|
||
| # Verify CRD is present | ||
| kubectl get crd computedomains.resource.nvidia.com |
There was a problem hiding this comment.
@visheshtanksale I believe the goal of the original issue is to ensure that driver installation can be switched across namespaces while still allowing management of previously deployed DaemonSets (e.g., IMEX, MPS).
To achieve this, we should add tests that cover the prepare/unprepare flow using RCTs with IMEX and MPS, and verify that resources in the previous namespace are properly cleaned up during unprepare.
Additionally, for the IMEX use case, we should ensure that the IMEX daemon controller in the previous namespace can correctly create and update ComputeDomainClique statuses, and that the CD controller is able to properly sync with those statuses. @jgehrcke please add if there are additional use cases we need to test.
| kubectl get crd computedomains.resource.nvidia.com | ||
|
|
||
| # Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu. | ||
| helm uninstall "${TEST_HELM_RELEASE_NAME}" -n gpu-operator --wait --timeout=30s |
There was a problem hiding this comment.
where is this getting set? does it make sense to check if the release exists before uninstalling?
| # Verify CRD is present | ||
| kubectl get crd computedomains.resource.nvidia.com | ||
|
|
||
| # Restore state: uninstall from gpu-operator, reinstall to nvidia-dra-driver-gpu. |
There was a problem hiding this comment.
Should there be a clean up step, like this https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/tests/bats/test_gpu_updowngrade.bats#L33
|
Closing this for now based on the comments in the attached issue |
Fixes #683