Add upgrade tests for Trainer v2 #615

ChughShilpa · 2026-01-23T15:16:17Z

Resolves RHOAIENG-41638

Description

Add upgrade tests for Trainer v2

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added multiple end-to-end upgrade test suites covering Kueue, trainer sleep-based jobs, and TrainingRuntime upgrade scenarios with setup/verify flows.
Bug Fixes / Improvements
- Improved test namespace handling and capture of namespace logs/events during teardown for better diagnostics.
Refactor
- Promoted namespace helper to a shared exported utility and consolidated sleep-image lookup into common test utilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

openshift-ci · 2026-01-23T15:16:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szaher for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-01-23T15:16:41Z

📝 Walkthrough

Walkthrough

Moves GetSleepImage to a common package, adds exported namespace helpers and namespace log storage, and introduces multiple trainer upgrade integration tests (Kueue, sleep TrainJob, TrainingRuntime) with supporting helper functions.

Changes

Cohort / File(s)	Summary
Environment helpers `tests/common/environment.go`, `tests/kfto/environment.go`	Added `GetSleepImage()` in `tests/common/environment.go`; removed duplicate `GetSleepImage()` from `tests/kfto/environment.go`.
Namespace helpers & logs `tests/common/support/namespace.go`	Added exported `CreateOrGetTestNamespaceWithName` and `StoreNamespaceLogs`; `DeleteTestNamespace` now calls `StoreNamespaceLogs`; imported k8s API error utilities for NotFound checks.
Upgrade namespace usage updates `tests/kfto/kfto_kueue_mnist_upgrade_training_test.go`, `tests/kfto/kfto_upgrade_sleep_test.go`	Replaced calls to unexported helper with `CreateOrGetTestNamespaceWithName`; removed local helper `createOrGetUpgradeTestNamespace`.
Trainer Kueue upgrade tests `tests/trainer/trainer_kueue_upgrade_training_test.go`	New tests `TestSetupUpgradeTrainJob` and `TestRunUpgradeTrainJob` with ResourceFlavor, ClusterQueue, LocalQueue, and TrainJob lifecycle for upgrade scenarios.
Trainer sleep upgrade tests `tests/trainer/trainer_upgrade_sleep_test.go`	New tests `TestSetupSleepTrainJob` and `TestVerifySleepTrainJob`; added `createSleepTrainJob`, `sleepPodRestartCount`, and `podPhase` helpers.
TrainingRuntime upgrade tests `tests/trainer/trainer_trainingruntime_upgrade_test.go`	New tests `TestSetupTrainingRuntime`, `TestVerifyTrainingRuntime`, `TestSetupAllTrainingRuntimes`, `TestVerifyAllTrainingRuntimes`; added helpers to create TrainingRuntime and store/verify runtime names in a ConfigMap.

Sequence Diagram(s)

(Skipped — changes are test additions and helper relocations without a new multi-component runtime control flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through tests both new and wide,
Namespaces kept and sleep images tied.
Helpers exported, logs tucked in a sack,
Upgrade jobs waiting — I peeked and came back.
A carrot, a commit, then a quiet nap.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add upgrade tests for Trainer v2' clearly and accurately summarizes the main change, which is the addition of multiple new upgrade test files and test functions for the Trainer v2 component.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/trainer/trainer_upgrade_sleep_test.go`:
- Around line 60-77: The test TestVerifySleepTrainJob currently only checks pod
count and restart count (via GetPods and sleepPodRestartCount) which can miss
non-Running phases; update the assertion to also verify the pod's phase is
Running (use the same GetPods call and add a
ContainElement(WithTransform(sleepPodPhase, Equal("Running"))) or equivalent
matcher referencing sleepPodPhase) so the test fails if the pod completed or
entered a different phase while still having zero restarts.

🧹 Nitpick comments (4)

tests/trainer/trainer_kueue_upgrade_training_test.go (3)
19-35: Consider reformatting imports per openshift-goimports convention.

Based on learnings, imports should have blank lines between different third-party groups (e.g., separating github.com, k8s.io, and sigs.k8s.io groups).
♻️ Suggested import organization
 import (
 	"testing"

 	trainerv1alpha1 "github.com/kubeflow/trainer/v2/pkg/apis/trainer/v1alpha1"
 	. "github.com/onsi/gomega"
+
 	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	"k8s.io/apimachinery/pkg/api/resource"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+
 	kueuev1beta1 "sigs.k8s.io/kueue/apis/kueue/v1beta1"
 	kueueacv1beta1 "sigs.k8s.io/kueue/client-go/applyconfiguration/kueue/v1beta1"

 	. "github.com/opendatahub-io/distributed-workloads/tests/common"
 	. "github.com/opendatahub-io/distributed-workloads/tests/common/support"
 	trainerutils "github.com/opendatahub-io/distributed-workloads/tests/trainer/utils"
 )
110-114: Minor: FieldManager value inconsistency.

The FieldManager is "application/apply-patch" here but "setup-TrainJob" in TestSetupUpgradeTrainJob. Consider using a consistent value or the common ApplyOptions from tests/common/support.go for consistency.

127-137: Defensive delete logic is appropriate, but check scope is broader than necessary.

The delete-before-create pattern handles test reruns gracefully. However, line 134 waits for all TrainJobs in the namespace to be empty (TrainJobs(test, namespace).Should(BeEmpty())), even though only one specific TrainJob was deleted. This works but could be more precise.

Consider using a targeted check:
-		test.Eventually(TrainJobs(test, namespace), TestTimeoutShort).Should(BeEmpty())
+		test.Eventually(func(g Gomega) {
+			_, err := test.Client().Trainer().TrainerV1alpha1().TrainJobs(namespace).Get(test.Ctx(), upgradeTrainJobName, metav1.GetOptions{})
+			g.Expect(errors.IsNotFound(err)).To(BeTrue())
+		}, TestTimeoutShort).Should(Succeed())
tests/common/environment.go (1)
159-161: Guard against empty SLEEP_IMAGE values.

If SLEEP_IMAGE is set but empty/whitespace, the current logic returns an empty image and the pod will fail to start. Consider trimming and falling back to the default.
♻️ Proposed update
 func GetSleepImage() string {
-	return lookupEnvOrDefault(sleepImageEnvVar, "gcr.io/k8s-staging-perf-tests/sleep@sha256:8d91ddf9f145b66475efda1a1b52269be542292891b5de2a7fad944052bab6ea")
+	v := strings.TrimSpace(os.Getenv(sleepImageEnvVar))
+	if v == "" {
+		return "gcr.io/k8s-staging-perf-tests/sleep@sha256:8d91ddf9f145b66475efda1a1b52269be542292891b5de2a7fad944052bab6ea"
+	}
+	return v
 }

coderabbitai · 2026-01-23T15:21:36Z

tests/trainer/trainer_upgrade_sleep_test.go

+func TestVerifySleepTrainJob(t *testing.T) {
+	Tags(t, PostUpgrade)
+	test := With(t)
+	namespace := GetNamespaceWithName(test, sleepNamespaceName)
+
+	// Cleanup namespace in the end
+	defer DeleteTestNamespace(test, namespace)
+
+	// Pod should be still running without restart
+	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
+		Should(
+			And(
+				HaveLen(1),
+				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
+			),
+		)
+	test.T().Logf("TrainJob %s/%s is still running after upgrade with no pod restarts", sleepNamespaceName, sleepTrainJobName)
+}


⚠️ Potential issue | 🟡 Minor

Verify phase should assert pod is still running.

Right now the check only validates restart count and length. If the pod completes or moves to another phase with zero restarts, the test would still pass.

🐛 Suggested assertion addition

test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})). Should( And( HaveLen(1), + ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))), ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))), ), )

🤖 Prompt for AI Agents

In `@tests/trainer/trainer_upgrade_sleep_test.go` around lines 60 - 77, The test TestVerifySleepTrainJob currently only checks pod count and restart count (via GetPods and sleepPodRestartCount) which can miss non-Running phases; update the assertion to also verify the pod's phase is Running (use the same GetPods call and add a ContainElement(WithTransform(sleepPodPhase, Equal("Running"))) or equivalent matcher referencing sleepPodPhase) so the test fails if the pod completed or entered a different phase while still having zero restarts.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/trainer/trainer_trainingruntime_upgrade_test.go`:
- Around line 60-83: The test currently only verifies existence; update
TestVerifyTrainingRuntime to also assert the custom TrainingRuntime's PodSpec
preserved fields by locating the found runtime from runtimes.Items, then
extracting its container spec (e.g., runtime.Spec.Template.Spec.Containers[0])
and asserting container.Image, container.Command and container.Args equal the
expected values used when creating the custom runtime (use the same expected
literals/variables as the original test setup). Use
test.Expect/...To(Equal(...)) on container.Image, container.Command and
container.Args to fail the test if they differ.

♻️ Duplicate comments (1)

tests/trainer/trainer_upgrade_sleep_test.go (1)
60-76: Also assert the pod is still Running post-upgrade.
Right now only restart count is checked, so a Completed/Failed pod with zero restarts would pass.
🐛 Suggested assertion addition
 	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
 		Should(
 			And(
 				HaveLen(1),
+				ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))),
 				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
 			),
 		)

🧹 Nitpick comments (1)

tests/common/support/namespace.go (1)
89-104: Apply options when reusing an existing namespace.
If the namespace already exists, provided options (labels/annotations) are ignored, which can leave required settings (e.g., Kueue managed) missing on reruns. Consider applying the options and updating before returning.
♻️ Possible update
 	namespace, err := t.Client().Core().CoreV1().Namespaces().Get(t.Ctx(), name, metav1.GetOptions{})
 	if err == nil {
-		return namespace
+		for _, option := range options {
+			t.Expect(option.ApplyTo(namespace)).To(gomega.Succeed())
+		}
+		if len(options) > 0 {
+			namespace, err = t.Client().Core().CoreV1().Namespaces().Update(t.Ctx(), namespace, metav1.UpdateOptions{})
+			t.Expect(err).NotTo(gomega.HaveOccurred())
+		}
+		return namespace
 	} else if errors.IsNotFound(err) {

coderabbitai · 2026-01-23T16:11:43Z

tests/trainer/trainer_trainingruntime_upgrade_test.go

+// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade
+// with its spec preserved.
+func TestVerifyTrainingRuntime(t *testing.T) {
+	Tags(t, PostUpgrade)
+	test := With(t)
+
+	namespace := GetNamespaceWithName(test, runtimeNamespaceName)
+	defer DeleteTestNamespace(test, namespace)
+
+	// Verify TrainingRuntime still exists after upgrade by listing all runtimes
+	runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(
+		test.Ctx(), metav1.ListOptions{})
+	test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")
+
+	// Find Custom TrainingRuntime in the list
+	var found bool
+	for _, runtime := range runtimes.Items {
+		if runtime.Name == customRuntimeName {
+			found = true
+			break
+		}
+	}
+	test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
+	test.T().Logf("TrainingRuntime %s/%s is preserved after upgrade", runtimeNamespaceName, customRuntimeName)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check the file exists and get its size wc -l tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 137

🏁 Script executed:

# Read the entire file to understand context, setup, and whether spec checks exist cat -n tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 6001

🏁 Script executed:

# Search for customRuntimeName definition and setup rg "customRuntimeName" tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 803

🏁 Script executed:

# Check the trainerv1alpha1 package structure to verify field paths in suggested diff fd "trainingruntime" --type f | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 133

Verify spec preservation, not just existence.

The test claims to verify spec preservation (function name and comment) but only checks that the runtime exists. Add assertions to verify the container image, command, and args survive the upgrade.

Suggested changes

- // Find Custom TrainingRuntime in the list - var found bool - for _, runtime := range runtimes.Items { - if runtime.Name == customRuntimeName { - found = true - break - } - } - test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName) + // Find Custom TrainingRuntime in the list + var foundRuntime *trainerv1alpha1.TrainingRuntime + for i := range runtimes.Items { + if runtimes.Items[i].Name == customRuntimeName { + foundRuntime = &runtimes.Items[i] + break + } + } + test.Expect(foundRuntime).NotTo(BeNil(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName) + if foundRuntime != nil { + container := foundRuntime.Spec.Template.Spec.ReplicatedJobs[0].Template.Spec.Template.Spec.Containers[0] + test.Expect(container.Image).To(Equal(GetSleepImage())) + test.Expect(container.Command).To(Equal([]string{"sleep"})) + test.Expect(container.Args).To(Equal([]string{"24h"})) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade

// with its spec preserved.

func TestVerifyTrainingRuntime(t *testing.T) {

Tags(t, PostUpgrade)

test := With(t)

namespace := GetNamespaceWithName(test, runtimeNamespaceName)

defer DeleteTestNamespace(test, namespace)

// Verify TrainingRuntime still exists after upgrade by listing all runtimes

runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(

test.Ctx(), metav1.ListOptions{})

test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")

// Find Custom TrainingRuntime in the list

var found bool

for _, runtime := range runtimes.Items {

if runtime.Name == customRuntimeName {

found = true

break

}

}

test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)

test.T().Logf("TrainingRuntime %s/%s is preserved after upgrade", runtimeNamespaceName, customRuntimeName)

// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade

// with its spec preserved.

func TestVerifyTrainingRuntime(t *testing.T) {

Tags(t, PostUpgrade)

test := With(t)

namespace := GetNamespaceWithName(test, runtimeNamespaceName)

defer DeleteTestNamespace(test, namespace)

// Verify TrainingRuntime still exists after upgrade by listing all runtimes

runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(

test.Ctx(), metav1.ListOptions{})

test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")

// Find Custom TrainingRuntime in the list

var foundRuntime *trainerv1alpha1.TrainingRuntime

for i := range runtimes.Items {

if runtimes.Items[i].Name == customRuntimeName {

foundRuntime = &runtimes.Items[i]

break

}

}

test.Expect(foundRuntime).NotTo(BeNil(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)

if foundRuntime != nil {

container := foundRuntime.Spec.Template.Spec.ReplicatedJobs[0].Template.Spec.Template.Spec.Containers[0]

test.Expect(container.Image).To(Equal(GetSleepImage()))

test.Expect(container.Command).To(Equal([]string{"sleep"}))

test.Expect(container.Args).To(Equal([]string{"24h"}))

}

🤖 Prompt for AI Agents

In `@tests/trainer/trainer_trainingruntime_upgrade_test.go` around lines 60 - 83, The test currently only verifies existence; update TestVerifyTrainingRuntime to also assert the custom TrainingRuntime's PodSpec preserved fields by locating the found runtime from runtimes.Items, then extracting its container spec (e.g., runtime.Spec.Template.Spec.Containers[0]) and asserting container.Image, container.Command and container.Args equal the expected values used when creating the custom runtime (use the same expected literals/variables as the original test setup). Use test.Expect/...To(Equal(...)) on container.Image, container.Command and container.Args to fail the test if they differ.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/trainer/trainer_upgrade_sleep_test.go`:
- Around line 114-116: The helper sleepPodRestartCount reads
pod.Status.ContainerStatuses[0] without checking the slice length, which can
panic if ContainerStatuses is empty; update sleepPodRestartCount to defensively
check len(pod.Status.ContainerStatuses) > 0 and return a safe default (e.g., 0)
when empty, otherwise return int(pod.Status.ContainerStatuses[0].RestartCount),
so callers of sleepPodRestartCount won't encounter an out-of-range panic.

♻️ Duplicate comments (1)

tests/trainer/trainer_upgrade_sleep_test.go (1)
60-77: Verify phase should assert pod is still running.

The check validates restart count and pod count but does not confirm the pod is still in Running phase. A pod that completed or crashed with zero restarts would pass this assertion.
🐛 Suggested fix
 	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
 		Should(
 			And(
 				HaveLen(1),
+				ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))),
 				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
 			),
 		)

coderabbitai · 2026-01-23T16:15:00Z

tests/trainer/trainer_upgrade_sleep_test.go

+func sleepPodRestartCount(pod corev1.Pod) int {
+	return int(pod.Status.ContainerStatuses[0].RestartCount)
+}


⚠️ Potential issue | 🟡 Minor

Add bounds check to prevent potential panic.

Accessing ContainerStatuses[0] without verifying the slice is non-empty could panic if the pod hasn't initialized container status yet.

🐛 Suggested defensive fix

func sleepPodRestartCount(pod corev1.Pod) int { + if len(pod.Status.ContainerStatuses) == 0 { + return 0 + } return int(pod.Status.ContainerStatuses[0].RestartCount) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func sleepPodRestartCount(pod corev1.Pod) int {

return int(pod.Status.ContainerStatuses[0].RestartCount)

}

func sleepPodRestartCount(pod corev1.Pod) int {

if len(pod.Status.ContainerStatuses) == 0 {

return 0

}

return int(pod.Status.ContainerStatuses[0].RestartCount)

}

🤖 Prompt for AI Agents

In `@tests/trainer/trainer_upgrade_sleep_test.go` around lines 114 - 116, The helper sleepPodRestartCount reads pod.Status.ContainerStatuses[0] without checking the slice length, which can panic if ContainerStatuses is empty; update sleepPodRestartCount to defensively check len(pod.Status.ContainerStatuses) > 0 and return a safe default (e.g., 0) when empty, otherwise return int(pod.Status.ContainerStatuses[0].RestartCount), so callers of sleepPodRestartCount won't encounter an out-of-range panic.

sutaakar · 2026-01-26T21:13:31Z

tests/common/support/namespace.go

+// CreateOrGetUpgradeTestNamespace creates a namespace with the given name if it doesn't exist,
+// or returns the existing namespace if it does. This is useful for upgrade tests where
+// the namespace needs to persist across pre-upgrade and post-upgrade test phases.
+func CreateOrGetUpgradeTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {


Nitpick -

Suggested change

func CreateOrGetUpgradeTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {

func CreateOrGetTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {

CreateOrGetTestNamespace is already exist - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/common/support/test.go#L170 and this would cause the naming conflict

In such case something like CreateOrGetTestNamespaceWithName?
Current name hints that the function is usable for upgrade only, though in reality it can be used anywhere.

sutaakar · 2026-01-26T21:27:14Z

tests/trainer/trainer_trainingruntime_upgrade_test.go

+			break
+		}
+	}
+	test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)


Nitpick - It is better to assert on whole list of runtimes, that way you get more context (list of all existing runtimes) when assertion fails.

sutaakar · 2026-01-26T21:35:49Z

tests/trainer/trainer_trainingruntime_upgrade_test.go

+	customRuntimeName    = "custom-sleep-runtime"
+)
+
+func TestSetupTrainingRuntime(t *testing.T) {


Can you cover additional scenario - all TrainingRuntimes which exist in PreUpgrade has to be kept after upgrade.
To persist runtime names you can use for example ConfigMap (or anything else).

sutaakar · 2026-01-26T21:46:25Z

tests/trainer/trainer_kueue_upgrade_training_test.go

+	upgradeTrainJobName  = "trainjob-upgrade"
+)
+
+func TestSetupUpgradeTrainJob(t *testing.T) {


I am thinking about another test case - same like this one, with the difference of using specific TrainingRuntime (i.e. torch_distributed_cuda128_torch29_py312.yaml, runtime name would have to be fetched somehow so we don't have to hardcode it here).

The reason for this additional test is to verify that runtime which is not updated still works properly. This test would have a main meaning for upgrades between stable versions, where default training runtime is updated, but specific training runtime isn't. It is good to verify that updated Trainer v2 operator still works properly with old training runtime.

WDYT?

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/trainer/trainer_trainingruntime_upgrade_test.go`:
- Around line 66-68: The current test calls DeleteTestNamespace immediately
after creating namespace via GetNamespaceWithName which can remove the custom
runtime before TestVerifyAllTrainingRuntimes and other post-upgrade checks run;
remove the early DeleteTestNamespace calls (references: GetNamespaceWithName,
DeleteTestNamespace) from within the per-test setup and instead perform
namespace deletion in a final cleanup step that runs after all post-upgrade
validations (e.g., a dedicated TestSuiteTeardown or a final test that deletes
the namespace), ensuring TestVerifyAllTrainingRuntimes and related checks run
against the created namespace first.

coderabbitai · 2026-01-28T15:16:25Z

tests/trainer/trainer_trainingruntime_upgrade_test.go

+	namespace := GetNamespaceWithName(test, runtimeNamespaceName)
+	defer DeleteTestNamespace(test, namespace)
+


⚠️ Potential issue | 🟠 Major

Avoid deleting the namespace before all post-upgrade validations finish.
DeleteTestNamespace here can remove the custom runtime before TestVerifyAllTrainingRuntimes runs, which makes the suite order-dependent and flaky.

💡 Suggested change (remove early cleanup; move to a dedicated final cleanup step)

- namespace := GetNamespaceWithName(test, runtimeNamespaceName) - defer DeleteTestNamespace(test, namespace) + namespace := GetNamespaceWithName(test, runtimeNamespaceName)

Please relocate namespace cleanup to a stage that’s guaranteed to run after all post-upgrade checks (e.g., a dedicated cleanup test or suite-level teardown).

Also applies to: 174-205

🤖 Prompt for AI Agents

In `@tests/trainer/trainer_trainingruntime_upgrade_test.go` around lines 66 - 68, The current test calls DeleteTestNamespace immediately after creating namespace via GetNamespaceWithName which can remove the custom runtime before TestVerifyAllTrainingRuntimes and other post-upgrade checks run; remove the early DeleteTestNamespace calls (references: GetNamespaceWithName, DeleteTestNamespace) from within the per-test setup and instead perform namespace deletion in a final cleanup step that runs after all post-upgrade validations (e.g., a dedicated TestSuiteTeardown or a final test that deletes the namespace), ensuring TestVerifyAllTrainingRuntimes and related checks run against the created namespace first.

ChughShilpa requested review from abhijeet-dhumal and sutaakar January 23, 2026 15:16

openshift-ci bot requested a review from pawelpaszki January 23, 2026 15:16

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

ChughShilpa force-pushed the upgradeTests branch 2 times, most recently from 86f0fce to 49efd36 Compare January 23, 2026 16:07

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

sutaakar reviewed Jan 26, 2026

View reviewed changes

Add upgrade tests for Trainer v2

99c1724

ChughShilpa force-pushed the upgradeTests branch 2 times, most recently from 49efd36 to 99c1724 Compare January 28, 2026 15:11

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

	func CreateOrGetUpgradeTestNamespace(t Test, name string, options ...Option[corev1.Namespace]) corev1.Namespace {
	func CreateOrGetTestNamespace(t Test, name string, options ...Option[corev1.Namespace]) corev1.Namespace {

		namespace := GetNamespaceWithName(test, runtimeNamespaceName)
		defer DeleteTestNamespace(test, namespace)

Add upgrade tests for Trainer v2 #615

Are you sure you want to change the base?

Add upgrade tests for Trainer v2 #615

Uh oh!

Conversation

ChughShilpa commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Jan 23, 2026

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChughShilpa Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChughShilpa commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading