Skip to content

Conversation

@ChughShilpa
Copy link
Contributor

@ChughShilpa ChughShilpa commented Jan 23, 2026

Resolves RHOAIENG-41638

Description

Add upgrade tests for Trainer v2

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests

    • Added multiple end-to-end upgrade test suites covering Kueue, trainer sleep-based jobs, and TrainingRuntime upgrade scenarios with setup/verify flows.
  • Bug Fixes / Improvements

    • Improved test namespace handling and capture of namespace logs/events during teardown for better diagnostics.
  • Refactor

    • Promoted namespace helper to a shared exported utility and consolidated sleep-image lookup into common test utilities.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link

openshift-ci bot commented Jan 23, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szaher for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Moves GetSleepImage to a common package, adds exported namespace helpers and namespace log storage, and introduces multiple trainer upgrade integration tests (Kueue, sleep TrainJob, TrainingRuntime) with supporting helper functions.

Changes

Cohort / File(s) Summary
Environment helpers
tests/common/environment.go, tests/kfto/environment.go
Added GetSleepImage() in tests/common/environment.go; removed duplicate GetSleepImage() from tests/kfto/environment.go.
Namespace helpers & logs
tests/common/support/namespace.go
Added exported CreateOrGetTestNamespaceWithName and StoreNamespaceLogs; DeleteTestNamespace now calls StoreNamespaceLogs; imported k8s API error utilities for NotFound checks.
Upgrade namespace usage updates
tests/kfto/kfto_kueue_mnist_upgrade_training_test.go, tests/kfto/kfto_upgrade_sleep_test.go
Replaced calls to unexported helper with CreateOrGetTestNamespaceWithName; removed local helper createOrGetUpgradeTestNamespace.
Trainer Kueue upgrade tests
tests/trainer/trainer_kueue_upgrade_training_test.go
New tests TestSetupUpgradeTrainJob and TestRunUpgradeTrainJob with ResourceFlavor, ClusterQueue, LocalQueue, and TrainJob lifecycle for upgrade scenarios.
Trainer sleep upgrade tests
tests/trainer/trainer_upgrade_sleep_test.go
New tests TestSetupSleepTrainJob and TestVerifySleepTrainJob; added createSleepTrainJob, sleepPodRestartCount, and podPhase helpers.
TrainingRuntime upgrade tests
tests/trainer/trainer_trainingruntime_upgrade_test.go
New tests TestSetupTrainingRuntime, TestVerifyTrainingRuntime, TestSetupAllTrainingRuntimes, TestVerifyAllTrainingRuntimes; added helpers to create TrainingRuntime and store/verify runtime names in a ConfigMap.

Sequence Diagram(s)

(Skipped — changes are test additions and helper relocations without a new multi-component runtime control flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through tests both new and wide,
Namespaces kept and sleep images tied.
Helpers exported, logs tucked in a sack,
Upgrade jobs waiting — I peeked and came back.
A carrot, a commit, then a quiet nap.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add upgrade tests for Trainer v2' clearly and accurately summarizes the main change, which is the addition of multiple new upgrade test files and test functions for the Trainer v2 component.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/trainer/trainer_upgrade_sleep_test.go`:
- Around line 60-77: The test TestVerifySleepTrainJob currently only checks pod
count and restart count (via GetPods and sleepPodRestartCount) which can miss
non-Running phases; update the assertion to also verify the pod's phase is
Running (use the same GetPods call and add a
ContainElement(WithTransform(sleepPodPhase, Equal("Running"))) or equivalent
matcher referencing sleepPodPhase) so the test fails if the pod completed or
entered a different phase while still having zero restarts.
🧹 Nitpick comments (4)
tests/trainer/trainer_kueue_upgrade_training_test.go (3)

19-35: Consider reformatting imports per openshift-goimports convention.

Based on learnings, imports should have blank lines between different third-party groups (e.g., separating github.com, k8s.io, and sigs.k8s.io groups).

♻️ Suggested import organization
 import (
 	"testing"

 	trainerv1alpha1 "github.com/kubeflow/trainer/v2/pkg/apis/trainer/v1alpha1"
 	. "github.com/onsi/gomega"
+
 	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	"k8s.io/apimachinery/pkg/api/resource"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+
 	kueuev1beta1 "sigs.k8s.io/kueue/apis/kueue/v1beta1"
 	kueueacv1beta1 "sigs.k8s.io/kueue/client-go/applyconfiguration/kueue/v1beta1"

 	. "github.com/opendatahub-io/distributed-workloads/tests/common"
 	. "github.com/opendatahub-io/distributed-workloads/tests/common/support"
 	trainerutils "github.com/opendatahub-io/distributed-workloads/tests/trainer/utils"
 )

110-114: Minor: FieldManager value inconsistency.

The FieldManager is "application/apply-patch" here but "setup-TrainJob" in TestSetupUpgradeTrainJob. Consider using a consistent value or the common ApplyOptions from tests/common/support.go for consistency.


127-137: Defensive delete logic is appropriate, but check scope is broader than necessary.

The delete-before-create pattern handles test reruns gracefully. However, line 134 waits for all TrainJobs in the namespace to be empty (TrainJobs(test, namespace).Should(BeEmpty())), even though only one specific TrainJob was deleted. This works but could be more precise.

Consider using a targeted check:

-		test.Eventually(TrainJobs(test, namespace), TestTimeoutShort).Should(BeEmpty())
+		test.Eventually(func(g Gomega) {
+			_, err := test.Client().Trainer().TrainerV1alpha1().TrainJobs(namespace).Get(test.Ctx(), upgradeTrainJobName, metav1.GetOptions{})
+			g.Expect(errors.IsNotFound(err)).To(BeTrue())
+		}, TestTimeoutShort).Should(Succeed())
tests/common/environment.go (1)

159-161: Guard against empty SLEEP_IMAGE values.

If SLEEP_IMAGE is set but empty/whitespace, the current logic returns an empty image and the pod will fail to start. Consider trimming and falling back to the default.

♻️ Proposed update
 func GetSleepImage() string {
-	return lookupEnvOrDefault(sleepImageEnvVar, "gcr.io/k8s-staging-perf-tests/sleep@sha256:8d91ddf9f145b66475efda1a1b52269be542292891b5de2a7fad944052bab6ea")
+	v := strings.TrimSpace(os.Getenv(sleepImageEnvVar))
+	if v == "" {
+		return "gcr.io/k8s-staging-perf-tests/sleep@sha256:8d91ddf9f145b66475efda1a1b52269be542292891b5de2a7fad944052bab6ea"
+	}
+	return v
 }

Comment on lines +60 to +77
func TestVerifySleepTrainJob(t *testing.T) {
Tags(t, PostUpgrade)
test := With(t)
namespace := GetNamespaceWithName(test, sleepNamespaceName)

// Cleanup namespace in the end
defer DeleteTestNamespace(test, namespace)

// Pod should be still running without restart
test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
Should(
And(
HaveLen(1),
ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
),
)
test.T().Logf("TrainJob %s/%s is still running after upgrade with no pod restarts", sleepNamespaceName, sleepTrainJobName)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verify phase should assert pod is still running.

Right now the check only validates restart count and length. If the pod completes or moves to another phase with zero restarts, the test would still pass.

🐛 Suggested assertion addition
 	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
 		Should(
 			And(
 				HaveLen(1),
+				ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))),
 				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
 			),
 		)
🤖 Prompt for AI Agents
In `@tests/trainer/trainer_upgrade_sleep_test.go` around lines 60 - 77, The test
TestVerifySleepTrainJob currently only checks pod count and restart count (via
GetPods and sleepPodRestartCount) which can miss non-Running phases; update the
assertion to also verify the pod's phase is Running (use the same GetPods call
and add a ContainElement(WithTransform(sleepPodPhase, Equal("Running"))) or
equivalent matcher referencing sleepPodPhase) so the test fails if the pod
completed or entered a different phase while still having zero restarts.

@ChughShilpa ChughShilpa force-pushed the upgradeTests branch 2 times, most recently from 86f0fce to 49efd36 Compare January 23, 2026 16:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/trainer/trainer_trainingruntime_upgrade_test.go`:
- Around line 60-83: The test currently only verifies existence; update
TestVerifyTrainingRuntime to also assert the custom TrainingRuntime's PodSpec
preserved fields by locating the found runtime from runtimes.Items, then
extracting its container spec (e.g., runtime.Spec.Template.Spec.Containers[0])
and asserting container.Image, container.Command and container.Args equal the
expected values used when creating the custom runtime (use the same expected
literals/variables as the original test setup). Use
test.Expect/...To(Equal(...)) on container.Image, container.Command and
container.Args to fail the test if they differ.
♻️ Duplicate comments (1)
tests/trainer/trainer_upgrade_sleep_test.go (1)

60-76: Also assert the pod is still Running post-upgrade.
Right now only restart count is checked, so a Completed/Failed pod with zero restarts would pass.

🐛 Suggested assertion addition
 	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
 		Should(
 			And(
 				HaveLen(1),
+				ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))),
 				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
 			),
 		)
🧹 Nitpick comments (1)
tests/common/support/namespace.go (1)

89-104: Apply options when reusing an existing namespace.
If the namespace already exists, provided options (labels/annotations) are ignored, which can leave required settings (e.g., Kueue managed) missing on reruns. Consider applying the options and updating before returning.

♻️ Possible update
 	namespace, err := t.Client().Core().CoreV1().Namespaces().Get(t.Ctx(), name, metav1.GetOptions{})
 	if err == nil {
-		return namespace
+		for _, option := range options {
+			t.Expect(option.ApplyTo(namespace)).To(gomega.Succeed())
+		}
+		if len(options) > 0 {
+			namespace, err = t.Client().Core().CoreV1().Namespaces().Update(t.Ctx(), namespace, metav1.UpdateOptions{})
+			t.Expect(err).NotTo(gomega.HaveOccurred())
+		}
+		return namespace
 	} else if errors.IsNotFound(err) {

Comment on lines 60 to 83
// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade
// with its spec preserved.
func TestVerifyTrainingRuntime(t *testing.T) {
Tags(t, PostUpgrade)
test := With(t)

namespace := GetNamespaceWithName(test, runtimeNamespaceName)
defer DeleteTestNamespace(test, namespace)

// Verify TrainingRuntime still exists after upgrade by listing all runtimes
runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(
test.Ctx(), metav1.ListOptions{})
test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")

// Find Custom TrainingRuntime in the list
var found bool
for _, runtime := range runtimes.Items {
if runtime.Name == customRuntimeName {
found = true
break
}
}
test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
test.T().Logf("TrainingRuntime %s/%s is preserved after upgrade", runtimeNamespaceName, customRuntimeName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check the file exists and get its size
wc -l tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 137


🏁 Script executed:

# Read the entire file to understand context, setup, and whether spec checks exist
cat -n tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 6001


🏁 Script executed:

# Search for customRuntimeName definition and setup
rg "customRuntimeName" tests/trainer/trainer_trainingruntime_upgrade_test.go

Repository: opendatahub-io/distributed-workloads

Length of output: 803


🏁 Script executed:

# Check the trainerv1alpha1 package structure to verify field paths in suggested diff
fd "trainingruntime" --type f | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 133


Verify spec preservation, not just existence.

The test claims to verify spec preservation (function name and comment) but only checks that the runtime exists. Add assertions to verify the container image, command, and args survive the upgrade.

Suggested changes
-	// Find Custom TrainingRuntime in the list
-	var found bool
-	for _, runtime := range runtimes.Items {
-		if runtime.Name == customRuntimeName {
-			found = true
-			break
-		}
-	}
-	test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
+	// Find Custom TrainingRuntime in the list
+	var foundRuntime *trainerv1alpha1.TrainingRuntime
+	for i := range runtimes.Items {
+		if runtimes.Items[i].Name == customRuntimeName {
+			foundRuntime = &runtimes.Items[i]
+			break
+		}
+	}
+	test.Expect(foundRuntime).NotTo(BeNil(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
+	if foundRuntime != nil {
+		container := foundRuntime.Spec.Template.Spec.ReplicatedJobs[0].Template.Spec.Template.Spec.Containers[0]
+		test.Expect(container.Image).To(Equal(GetSleepImage()))
+		test.Expect(container.Command).To(Equal([]string{"sleep"}))
+		test.Expect(container.Args).To(Equal([]string{"24h"}))
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade
// with its spec preserved.
func TestVerifyTrainingRuntime(t *testing.T) {
Tags(t, PostUpgrade)
test := With(t)
namespace := GetNamespaceWithName(test, runtimeNamespaceName)
defer DeleteTestNamespace(test, namespace)
// Verify TrainingRuntime still exists after upgrade by listing all runtimes
runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(
test.Ctx(), metav1.ListOptions{})
test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")
// Find Custom TrainingRuntime in the list
var found bool
for _, runtime := range runtimes.Items {
if runtime.Name == customRuntimeName {
found = true
break
}
}
test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
test.T().Logf("TrainingRuntime %s/%s is preserved after upgrade", runtimeNamespaceName, customRuntimeName)
// TestVerifyTrainingRuntime verifies the custom TrainingRuntime survives upgrade
// with its spec preserved.
func TestVerifyTrainingRuntime(t *testing.T) {
Tags(t, PostUpgrade)
test := With(t)
namespace := GetNamespaceWithName(test, runtimeNamespaceName)
defer DeleteTestNamespace(test, namespace)
// Verify TrainingRuntime still exists after upgrade by listing all runtimes
runtimes, err := test.Client().Trainer().TrainerV1alpha1().TrainingRuntimes(runtimeNamespaceName).List(
test.Ctx(), metav1.ListOptions{})
test.Expect(err).NotTo(HaveOccurred(), "Failed to list TrainingRuntimes")
// Find Custom TrainingRuntime in the list
var foundRuntime *trainerv1alpha1.TrainingRuntime
for i := range runtimes.Items {
if runtimes.Items[i].Name == customRuntimeName {
foundRuntime = &runtimes.Items[i]
break
}
}
test.Expect(foundRuntime).NotTo(BeNil(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
if foundRuntime != nil {
container := foundRuntime.Spec.Template.Spec.ReplicatedJobs[0].Template.Spec.Template.Spec.Containers[0]
test.Expect(container.Image).To(Equal(GetSleepImage()))
test.Expect(container.Command).To(Equal([]string{"sleep"}))
test.Expect(container.Args).To(Equal([]string{"24h"}))
}
🤖 Prompt for AI Agents
In `@tests/trainer/trainer_trainingruntime_upgrade_test.go` around lines 60 - 83,
The test currently only verifies existence; update TestVerifyTrainingRuntime to
also assert the custom TrainingRuntime's PodSpec preserved fields by locating
the found runtime from runtimes.Items, then extracting its container spec (e.g.,
runtime.Spec.Template.Spec.Containers[0]) and asserting container.Image,
container.Command and container.Args equal the expected values used when
creating the custom runtime (use the same expected literals/variables as the
original test setup). Use test.Expect/...To(Equal(...)) on container.Image,
container.Command and container.Args to fail the test if they differ.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/trainer/trainer_upgrade_sleep_test.go`:
- Around line 114-116: The helper sleepPodRestartCount reads
pod.Status.ContainerStatuses[0] without checking the slice length, which can
panic if ContainerStatuses is empty; update sleepPodRestartCount to defensively
check len(pod.Status.ContainerStatuses) > 0 and return a safe default (e.g., 0)
when empty, otherwise return int(pod.Status.ContainerStatuses[0].RestartCount),
so callers of sleepPodRestartCount won't encounter an out-of-range panic.
♻️ Duplicate comments (1)
tests/trainer/trainer_upgrade_sleep_test.go (1)

60-77: Verify phase should assert pod is still running.

The check validates restart count and pod count but does not confirm the pod is still in Running phase. A pod that completed or crashed with zero restarts would pass this assertion.

🐛 Suggested fix
 	test.Expect(GetPods(test, sleepNamespaceName, metav1.ListOptions{})).
 		Should(
 			And(
 				HaveLen(1),
+				ContainElement(WithTransform(podPhase, Equal(corev1.PodRunning))),
 				ContainElement(WithTransform(sleepPodRestartCount, BeNumerically("==", 0))),
 			),
 		)

Comment on lines +114 to +116
func sleepPodRestartCount(pod corev1.Pod) int {
return int(pod.Status.ContainerStatuses[0].RestartCount)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add bounds check to prevent potential panic.

Accessing ContainerStatuses[0] without verifying the slice is non-empty could panic if the pod hasn't initialized container status yet.

🐛 Suggested defensive fix
 func sleepPodRestartCount(pod corev1.Pod) int {
+	if len(pod.Status.ContainerStatuses) == 0 {
+		return 0
+	}
 	return int(pod.Status.ContainerStatuses[0].RestartCount)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func sleepPodRestartCount(pod corev1.Pod) int {
return int(pod.Status.ContainerStatuses[0].RestartCount)
}
func sleepPodRestartCount(pod corev1.Pod) int {
if len(pod.Status.ContainerStatuses) == 0 {
return 0
}
return int(pod.Status.ContainerStatuses[0].RestartCount)
}
🤖 Prompt for AI Agents
In `@tests/trainer/trainer_upgrade_sleep_test.go` around lines 114 - 116, The
helper sleepPodRestartCount reads pod.Status.ContainerStatuses[0] without
checking the slice length, which can panic if ContainerStatuses is empty; update
sleepPodRestartCount to defensively check len(pod.Status.ContainerStatuses) > 0
and return a safe default (e.g., 0) when empty, otherwise return
int(pod.Status.ContainerStatuses[0].RestartCount), so callers of
sleepPodRestartCount won't encounter an out-of-range panic.

// CreateOrGetUpgradeTestNamespace creates a namespace with the given name if it doesn't exist,
// or returns the existing namespace if it does. This is useful for upgrade tests where
// the namespace needs to persist across pre-upgrade and post-upgrade test phases.
func CreateOrGetUpgradeTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick -

Suggested change
func CreateOrGetUpgradeTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {
func CreateOrGetTestNamespace(t Test, name string, options ...Option[*corev1.Namespace]) *corev1.Namespace {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateOrGetTestNamespace is already exist - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/common/support/test.go#L170 and this would cause the naming conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case something like CreateOrGetTestNamespaceWithName?
Current name hints that the function is usable for upgrade only, though in reality it can be used anywhere.

break
}
}
test.Expect(found).To(BeTrue(), "Custom TrainingRuntime %s should exist after upgrade", customRuntimeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - It is better to assert on whole list of runtimes, that way you get more context (list of all existing runtimes) when assertion fails.

customRuntimeName = "custom-sleep-runtime"
)

func TestSetupTrainingRuntime(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you cover additional scenario - all TrainingRuntimes which exist in PreUpgrade has to be kept after upgrade.
To persist runtime names you can use for example ConfigMap (or anything else).

upgradeTrainJobName = "trainjob-upgrade"
)

func TestSetupUpgradeTrainJob(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about another test case - same like this one, with the difference of using specific TrainingRuntime (i.e. torch_distributed_cuda128_torch29_py312.yaml, runtime name would have to be fetched somehow so we don't have to hardcode it here).

The reason for this additional test is to verify that runtime which is not updated still works properly. This test would have a main meaning for upgrades between stable versions, where default training runtime is updated, but specific training runtime isn't. It is good to verify that updated Trainer v2 operator still works properly with old training runtime.

WDYT?

@ChughShilpa ChughShilpa force-pushed the upgradeTests branch 2 times, most recently from 49efd36 to 99c1724 Compare January 28, 2026 15:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/trainer/trainer_trainingruntime_upgrade_test.go`:
- Around line 66-68: The current test calls DeleteTestNamespace immediately
after creating namespace via GetNamespaceWithName which can remove the custom
runtime before TestVerifyAllTrainingRuntimes and other post-upgrade checks run;
remove the early DeleteTestNamespace calls (references: GetNamespaceWithName,
DeleteTestNamespace) from within the per-test setup and instead perform
namespace deletion in a final cleanup step that runs after all post-upgrade
validations (e.g., a dedicated TestSuiteTeardown or a final test that deletes
the namespace), ensuring TestVerifyAllTrainingRuntimes and related checks run
against the created namespace first.

Comment on lines +66 to +68
namespace := GetNamespaceWithName(test, runtimeNamespaceName)
defer DeleteTestNamespace(test, namespace)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid deleting the namespace before all post-upgrade validations finish.
DeleteTestNamespace here can remove the custom runtime before TestVerifyAllTrainingRuntimes runs, which makes the suite order-dependent and flaky.

💡 Suggested change (remove early cleanup; move to a dedicated final cleanup step)
-	namespace := GetNamespaceWithName(test, runtimeNamespaceName)
-	defer DeleteTestNamespace(test, namespace)
+	namespace := GetNamespaceWithName(test, runtimeNamespaceName)

Please relocate namespace cleanup to a stage that’s guaranteed to run after all post-upgrade checks (e.g., a dedicated cleanup test or suite-level teardown).

Also applies to: 174-205

🤖 Prompt for AI Agents
In `@tests/trainer/trainer_trainingruntime_upgrade_test.go` around lines 66 - 68,
The current test calls DeleteTestNamespace immediately after creating namespace
via GetNamespaceWithName which can remove the custom runtime before
TestVerifyAllTrainingRuntimes and other post-upgrade checks run; remove the
early DeleteTestNamespace calls (references: GetNamespaceWithName,
DeleteTestNamespace) from within the per-test setup and instead perform
namespace deletion in a final cleanup step that runs after all post-upgrade
validations (e.g., a dedicated TestSuiteTeardown or a final test that deletes
the namespace), ensuring TestVerifyAllTrainingRuntimes and related checks run
against the created namespace first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants