Skip to content

Conversation

@ChughShilpa
Copy link
Contributor

@ChughShilpa ChughShilpa commented Nov 14, 2025

Add test coverage for JobSet Workflow Orchestration

Description

How Has This Been Tested?

go test ./tests/trainer/ -run TestJobSetWorkflow -v

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Added test utilities to locate Jobs and JobSets in test clusters.
    • Introduced end-to-end tests for the JobSet workflow, validating both success and failure scenarios.
    • Added helpers to create, configure and clean up training resources (TrainingRuntime, TrainJob, PVC) and to exercise initializer behavior in tests.

@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chipspeak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

Adds test support helpers to locate Jobs and fetch a single JobSet via the dynamic client, and introduces two end-to-end tests that exercise JobSet initializer workflows (successful and failing) using a TrainingRuntime, TrainJob, and shared PVC.

Changes

Cohort / File(s) Change Summary
Test Support: Job helpers
tests/common/support/jobs.go
Add GetJobByNamePattern(test Test, namespace string, pattern string) (*batchv1.Job, error) — lists BatchV1 Jobs and returns the first Job whose name contains the given pattern (or nil).
Test Support: JobSet helper
tests/common/support/jobset.go
Add GetSingleJobSet(test Test, namespace string) (*unstructured.Unstructured, error) — defines JobSet GVR, lists JobSets via dynamic client, asserts exactly one exists, and returns it.
Trainer tests: JobSet workflows
tests/trainer/jobset_workflow_test.go
Add tests TestJobSetWorkflow and TestFailedJobSetWorkflow plus helpers createTrainingRuntimeWithInitializers, createTrainJobWithInitializers, createTrainJobWithFailingInitializer, and deleteTrainingRuntime. Tests create namespace/PVC, create TrainingRuntime (JobSet with three replicated jobs), create TrainJob(s), validate JobSet contents and TrainJob success/failure, and clean up.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant K8s as Kubernetes API
    participant Dyn as Dynamic Client
    participant JobSet
    participant Jobs as Replicated Jobs

    rect rgb(240,248,255)
    Note over Test,K8s: Setup resources
    Test->>K8s: create Namespace & PVC
    Test->>K8s: create TrainingRuntime (JobSet with 3 replicatedJobs)
    end

    rect rgb(245,255,240)
    Note over Test,JobSet: Execution
    Test->>K8s: create TrainJob referencing TrainingRuntime
    JobSet->>Jobs: instantiate orchestrated replicated jobs (dataset/model/node)
    Jobs-->>JobSet: report status (succeeded / failed)
    end

    rect rgb(255,250,240)
    Note over Test,Dyn: Verification & cleanup
    Test->>Dyn: GetSingleJobSet(namespace)  — list JobSets via GVR
    Test->>K8s: GetJobByNamePattern(namespace, pattern) — find job by name substring
    Test->>K8s: assert TrainJob status (succeeded / failed & message)
    Test->>K8s: delete TrainingRuntime
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to dynamic client GVR definition and List usage in tests/common/support/jobset.go.
  • Verify error handling and string-matching logic in tests/common/support/jobs.go.
  • Inspect timing, polling, and failure-message extraction in tests/trainer/jobset_workflow_test.go (flaky test risks, resource cleanup).

Poem

🐰 I hop through tests with gentle cheer,
Finding Jobs by name far and near,
JobSets form threefold in a row,
Initializers run — or sometimes slow,
I nibble logs, then twitch my ear.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Add test coverage for JobSet Workflow Orchestration' directly and clearly describes the main change in the changeset, which adds comprehensive test coverage (two new test functions and supporting helpers) for JobSet workflow functionality.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/common/support/jobs.go (1)

26-41: Tighten not‑found semantics and avoid pointer-to-loop-variable confusion

Two small improvements you might consider:

  1. Returning nil, nil for “not found” is a bit ambiguous; a signature like (*batchv1.Job, bool, error) (or documenting the nil,nil case clearly) would make call sites more robust.
  2. For clarity, prefer taking the address of the slice element instead of the loop variable, e.g.:
-func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
+func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
 	test.T().Helper()

 	jobs, err := test.Client().Core().BatchV1().Jobs(namespace).List(test.Ctx(), metav1.ListOptions{})
 	if err != nil {
 		return nil, err
 	}

-	for _, job := range jobs.Items {
-		if strings.Contains(job.Name, pattern) {
-			return &job, nil
-		}
-	}
+	for i := range jobs.Items {
+		if strings.Contains(jobs.Items[i].Name, pattern) {
+			return &jobs.Items[i], nil
+		}
+	}

 	return nil, nil
 }

This avoids the common “pointer to range variable” gotcha and makes the intent explicit.

tests/trainer/jobset_workflow_test.go (1)

131-357: TrainingRuntime / JobSet spec is consistent; consider small extraction to reduce duplication

The createTrainingRuntimeWithInitializers helper wires three ReplicatedJobs with the expected dependencies and shared workspace PVC, and the resource requests/limits look reasonable for tests. The structure is clear and matches the intended workflow.

If you ever find this growing or needing reuse, you could optionally extract the common Volume/VolumeMount and BackoffLimit/RestartPolicyNever pieces into small local helpers to reduce repetition, but it’s fine as-is for readability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54395f4 and e5e350d.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (4)
tests/common/support/jobset.go (1)

27-48: JobSet helper is clear and idiomatic for test usage

The jobsetOperatorGVR definition and GetSingleJobSet helper are straightforward and fit well with the existing Test/Gomega patterns. Asserting a single JobSet via HaveLen(1) before returning the first item is reasonable in this test-only context.

tests/trainer/jobset_workflow_test.go (3)

80-129: Failed workflow test logic looks solid and exercises JobSet status conditions

The TestFailedJobSetWorkflow path correctly:

  • Waits for the JobSet to exist via GetSingleJobSet.
  • Polls status.conditions with unstructured.NestedSlice and explicitly errors when the field is missing.
  • Locates a "Failed" condition with status == "True", captures the message, and asserts the TrainJob eventually transitions to TrainJobFailed.

This gives good coverage of the failure behavior and ensures the TrainJob reflects JobSet failure as expected.


359-519: TrainJob helpers align well with initializer semantics

Both createTrainJobWithInitializers and createTrainJobWithFailingInitializer:

  • Correctly reference the TrainingRuntime via RuntimeRef.
  • Configure dataset/model initializer env vars that match what the initializer Jobs expect.
  • Use clear shell commands to validate dataset/model visibility (success case) or ensure the trainer won’t run if the initializer fails (failure case).

These helpers nicely encapsulate the TrainJob setup for the two scenarios and keep the tests themselves concise.


521-534: Runtime cleanup helper is safe and test-friendly

deleteTrainingRuntime wraps the delete call and logs a warning instead of failing the test on cleanup errors, which is a good pattern for integration tests where teardown shouldn’t mask primary failures.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5e350d and 8609f3a.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/jobset_workflow_test.go (10)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/test_tag.go (1)
  • Tags (32-40)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (1)
tests/trainer/jobset_workflow_test.go (1)

56-74: LGTM: Eventually block correctly handles missing field.

The code properly handles the case when spec.replicatedJobs is not found by returning an error (lines 67-68), ensuring that Eventually keeps polling until the field exists. This addresses the concern raised in the previous review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8609f3a and fc6e728.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (5)
tests/common/support/jobset.go (1)

27-49: LGTM! Clean test helper for JobSet retrieval.

The implementation correctly uses the dynamic client with the JobSet v1alpha2 API. Error handling is appropriate: the function returns errors from the List operation and uses a gomega assertion to verify exactly one JobSet exists, which is the expected behavior for a test helper.

tests/common/support/jobs.go (1)

26-42: LGTM! Loop variable issue correctly resolved.

The past critical issue regarding taking the address of a loop variable has been properly fixed. The code now iterates by index and returns &jobs.Items[i], avoiding the pointer-to-reused-variable bug.

One minor observation: the function returns (nil, nil) when no job matches the pattern. While this may be intentional for your test usage, consider documenting this behavior in a comment if it's not immediately obvious to future readers.

tests/trainer/jobset_workflow_test.go (3)

56-74: Well done fixing the NestedSlice polling logic.

The past issue where Eventually could succeed when spec.replicatedJobs was missing has been correctly addressed. The code now properly separates the error check from the found check (lines 64-69), ensuring the polling loop continues until the field actually exists.


102-143: Excellent error handling improvements in the failure detection logic.

The past issues with unchecked type assertions and ignored NestedString return values have been thoroughly addressed:

  1. Line 118: Type assertion now uses the comma-ok form with proper fallback
  2. Lines 122-125, 127-130, 132-138: All NestedString calls now check both found and err before using values

The current implementation is robust and handles missing or malformed fields gracefully.


151-554: Well-designed test helpers and workflow structure.

The helper functions create a realistic multi-stage JobSet workflow:

  • Dataset initializer → Model initializer → Trainer node, with proper DependsOn chains (lines 237-241, 307-311)
  • BackoffLimit: 0 throughout prevents retry noise in test runs
  • Shell scripts include appropriate validation (e.g., FAIL_ON_PURPOSE check at lines 190-193)
  • Cleanup helper logs warnings rather than failing tests (lines 549-553)

The test coverage effectively validates both success and failure paths for the JobSet orchestration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant