Add test coverage for JobSet Workflow Orchestration #513

ChughShilpa · 2025-11-14T11:38:56Z

Add test coverage for JobSet Workflow Orchestration

Description

How Has This Been Tested?

go test ./tests/trainer/ -run TestJobSetWorkflow -v

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added test utilities to locate Jobs and JobSets in test clusters.
- Introduced end-to-end tests for the JobSet workflow, validating both success and failure scenarios.
- Added helpers to create, configure and clean up training resources (TrainingRuntime, TrainJob, PVC) and to exercise initializer behavior in tests.

openshift-ci · 2025-11-14T11:39:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chipspeak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-11-14T11:39:05Z

Walkthrough

Adds test support helpers to locate Jobs and fetch a single JobSet via the dynamic client, and introduces two end-to-end tests that exercise JobSet initializer workflows (successful and failing) using a TrainingRuntime, TrainJob, and shared PVC.

Changes

Cohort / File(s)	Change Summary
Test Support: Job helpers `tests/common/support/jobs.go`	Add `GetJobByNamePattern(test Test, namespace string, pattern string) (*batchv1.Job, error)` — lists BatchV1 Jobs and returns the first Job whose name contains the given pattern (or nil).
Test Support: JobSet helper `tests/common/support/jobset.go`	Add `GetSingleJobSet(test Test, namespace string) (*unstructured.Unstructured, error)` — defines JobSet GVR, lists JobSets via dynamic client, asserts exactly one exists, and returns it.
Trainer tests: JobSet workflows `tests/trainer/jobset_workflow_test.go`	Add tests `TestJobSetWorkflow` and `TestFailedJobSetWorkflow` plus helpers `createTrainingRuntimeWithInitializers`, `createTrainJobWithInitializers`, `createTrainJobWithFailingInitializer`, and `deleteTrainingRuntime`. Tests create namespace/PVC, create TrainingRuntime (JobSet with three replicated jobs), create TrainJob(s), validate JobSet contents and TrainJob success/failure, and clean up.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant K8s as Kubernetes API
    participant Dyn as Dynamic Client
    participant JobSet
    participant Jobs as Replicated Jobs

    rect rgb(240,248,255)
    Note over Test,K8s: Setup resources
    Test->>K8s: create Namespace & PVC
    Test->>K8s: create TrainingRuntime (JobSet with 3 replicatedJobs)
    end

    rect rgb(245,255,240)
    Note over Test,JobSet: Execution
    Test->>K8s: create TrainJob referencing TrainingRuntime
    JobSet->>Jobs: instantiate orchestrated replicated jobs (dataset/model/node)
    Jobs-->>JobSet: report status (succeeded / failed)
    end

    rect rgb(255,250,240)
    Note over Test,Dyn: Verification & cleanup
    Test->>Dyn: GetSingleJobSet(namespace)  — list JobSets via GVR
    Test->>K8s: GetJobByNamePattern(namespace, pattern) — find job by name substring
    Test->>K8s: assert TrainJob status (succeeded / failed & message)
    Test->>K8s: delete TrainingRuntime
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay attention to dynamic client GVR definition and List usage in tests/common/support/jobset.go.
Verify error handling and string-matching logic in tests/common/support/jobs.go.
Inspect timing, polling, and failure-message extraction in tests/trainer/jobset_workflow_test.go (flaky test risks, resource cleanup).

Poem

🐰 I hop through tests with gentle cheer,
Finding Jobs by name far and near,
JobSets form threefold in a row,
Initializers run — or sometimes slow,
I nibble logs, then twitch my ear.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Add test coverage for JobSet Workflow Orchestration' directly and clearly describes the main change in the changeset, which adds comprehensive test coverage (two new test functions and supporting helpers) for JobSet workflow functionality.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/common/support/jobs.go (1)
26-41: Tighten not‑found semantics and avoid pointer-to-loop-variable confusion

Two small improvements you might consider:

Returning nil, nil for “not found” is a bit ambiguous; a signature like (*batchv1.Job, bool, error) (or documenting the nil,nil case clearly) would make call sites more robust.

For clarity, prefer taking the address of the slice element instead of the loop variable, e.g.:
-func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
+func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
 	test.T().Helper()

 	jobs, err := test.Client().Core().BatchV1().Jobs(namespace).List(test.Ctx(), metav1.ListOptions{})
 	if err != nil {
 		return nil, err
 	}

-	for _, job := range jobs.Items {
-		if strings.Contains(job.Name, pattern) {
-			return &job, nil
-		}
-	}
+	for i := range jobs.Items {
+		if strings.Contains(jobs.Items[i].Name, pattern) {
+			return &jobs.Items[i], nil
+		}
+	}

 	return nil, nil
 }
This avoids the common “pointer to range variable” gotcha and makes the intent explicit.
tests/trainer/jobset_workflow_test.go (1)

131-357: TrainingRuntime / JobSet spec is consistent; consider small extraction to reduce duplication

The createTrainingRuntimeWithInitializers helper wires three ReplicatedJobs with the expected dependencies and shared workspace PVC, and the resource requests/limits look reasonable for tests. The structure is clear and matches the intended workflow.

If you ever find this growing or needing reuse, you could optionally extract the common Volume/VolumeMount and BackoffLimit/RestartPolicyNever pieces into small local helpers to reduce repetition, but it’s fine as-is for readability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54395f4 and e5e350d.

📒 Files selected for processing (3)

tests/common/support/jobs.go (1 hunks)
tests/common/support/jobset.go (1 hunks)
tests/trainer/jobset_workflow_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/common/support/jobset.go (2)

tests/common/support/test.go (2)

Test (34-45)

T (90-102)

tests/common/support/client.go (1)

Client (39-50)

tests/common/support/jobs.go (3)

tests/common/support/test.go (2)

Test (34-45)

T (90-102)

tests/common/support/batch.go (1)

Job (27-33)

tests/common/support/client.go (1)

Client (39-50)

tests/trainer/jobset_workflow_test.go (9)

tests/common/support/test.go (3)

T (90-102)

With (61-63)

Test (34-45)

tests/common/support/core.go (2)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

tests/common/support/jobset.go (1)

GetSingleJobSet (38-49)

tests/common/support/support.go (3)

TestTimeoutShort (33-33)

TestTimeoutLong (35-35)

TestTimeoutMedium (34-34)

tests/common/support/utils.go (2)

Log (34-34)

Ptr (27-29)

tests/common/support/trainjob.go (4)

TrainJob (26-32)

TrainJobConditionComplete (42-44)

TrainJobConditionFailed (46-48)

TrainJobs (34-40)

tests/kfto/environment.go (2)

GetAlpacaDatasetImage (36-38)

GetBloomModelImage (32-34)

tests/common/support/environment.go (1)

GetTrainingCudaPyTorch28Image (103-105)

tests/common/support/client.go (1)

Client (39-50)

🔇 Additional comments (4)

tests/common/support/jobset.go (1)

27-48: JobSet helper is clear and idiomatic for test usage

The jobsetOperatorGVR definition and GetSingleJobSet helper are straightforward and fit well with the existing Test/Gomega patterns. Asserting a single JobSet via HaveLen(1) before returning the first item is reasonable in this test-only context.

tests/trainer/jobset_workflow_test.go (3)

80-129: Failed workflow test logic looks solid and exercises JobSet status conditions

The TestFailedJobSetWorkflow path correctly:

Waits for the JobSet to exist via GetSingleJobSet.

Polls status.conditions with unstructured.NestedSlice and explicitly errors when the field is missing.

Locates a "Failed" condition with status == "True", captures the message, and asserts the TrainJob eventually transitions to TrainJobFailed.

This gives good coverage of the failure behavior and ensures the TrainJob reflects JobSet failure as expected.

359-519: TrainJob helpers align well with initializer semantics

Both createTrainJobWithInitializers and createTrainJobWithFailingInitializer:

Correctly reference the TrainingRuntime via RuntimeRef.

Configure dataset/model initializer env vars that match what the initializer Jobs expect.

Use clear shell commands to validate dataset/model visibility (success case) or ensure the trainer won’t run if the initializer fails (failure case).

These helpers nicely encapsulate the TrainJob setup for the two scenarios and keep the tests themselves concise.

521-534: Runtime cleanup helper is safe and test-friendly

deleteTrainingRuntime wraps the delete call and logs a warning instead of failing the test on cleanup errors, which is a good pattern for integration tests where teardown shouldn’t mask primary failures.

tests/trainer/jobset_workflow_test.go

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5e350d and 8609f3a.

📒 Files selected for processing (3)

tests/common/support/jobs.go (1 hunks)
tests/common/support/jobset.go (1 hunks)
tests/trainer/jobset_workflow_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/common/support/jobset.go

🧰 Additional context used

🧬 Code graph analysis (2)

tests/trainer/jobset_workflow_test.go (10)

tests/common/support/test.go (3)

T (90-102)

With (61-63)

Test (34-45)

tests/common/test_tag.go (1)

Tags (32-40)

tests/common/support/core.go (2)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

tests/common/support/jobset.go (1)

GetSingleJobSet (38-49)

tests/common/support/support.go (3)

TestTimeoutShort (33-33)

TestTimeoutLong (35-35)

TestTimeoutMedium (34-34)

tests/common/support/utils.go (2)

Log (34-34)

Ptr (27-29)

tests/common/support/trainjob.go (4)

TrainJob (26-32)

TrainJobConditionComplete (42-44)

TrainJobConditionFailed (46-48)

TrainJobs (34-40)

tests/kfto/environment.go (2)

GetAlpacaDatasetImage (36-38)

GetBloomModelImage (32-34)

tests/common/support/environment.go (1)

GetTrainingCudaPyTorch28Image (103-105)

tests/common/support/client.go (1)

Client (39-50)

tests/common/support/jobs.go (3)

tests/common/support/test.go (2)

Test (34-45)

T (90-102)

tests/common/support/batch.go (1)

Job (27-33)

tests/common/support/client.go (1)

Client (39-50)

🔇 Additional comments (1)

tests/trainer/jobset_workflow_test.go (1)

56-74: LGTM: Eventually block correctly handles missing field.

The code properly handles the case when spec.replicatedJobs is not found by returning an error (lines 67-68), ensuring that Eventually keeps polling until the field exists. This addresses the concern raised in the previous review.

tests/common/support/jobs.go

tests/trainer/jobset_workflow_test.go

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8609f3a and fc6e728.

📒 Files selected for processing (3)

tests/common/support/jobs.go (1 hunks)
tests/common/support/jobset.go (1 hunks)
tests/trainer/jobset_workflow_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/common/support/jobset.go (2)

tests/common/support/test.go (2)

Test (34-45)

T (90-102)

tests/common/support/client.go (1)

Client (39-50)

tests/common/support/jobs.go (3)

tests/common/support/test.go (2)

Test (34-45)

T (90-102)

tests/common/support/batch.go (1)

Job (27-33)

tests/common/support/client.go (1)

Client (39-50)

tests/trainer/jobset_workflow_test.go (9)

tests/common/support/test.go (3)

T (90-102)

With (61-63)

Test (34-45)

tests/common/support/core.go (2)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

tests/common/support/jobset.go (1)

GetSingleJobSet (38-49)

tests/common/support/support.go (3)

TestTimeoutShort (33-33)

TestTimeoutLong (35-35)

TestTimeoutMedium (34-34)

tests/common/support/utils.go (2)

Log (34-34)

Ptr (27-29)

tests/common/support/trainjob.go (4)

TrainJob (26-32)

TrainJobConditionComplete (42-44)

TrainJobConditionFailed (46-48)

TrainJobs (34-40)

tests/kfto/environment.go (2)

GetAlpacaDatasetImage (36-38)

GetBloomModelImage (32-34)

tests/common/support/environment.go (1)

GetTrainingCudaPyTorch28Image (103-105)

tests/common/support/client.go (1)

Client (39-50)

🔇 Additional comments (5)

tests/common/support/jobset.go (1)

27-49: LGTM! Clean test helper for JobSet retrieval.

The implementation correctly uses the dynamic client with the JobSet v1alpha2 API. Error handling is appropriate: the function returns errors from the List operation and uses a gomega assertion to verify exactly one JobSet exists, which is the expected behavior for a test helper.

tests/common/support/jobs.go (1)

26-42: LGTM! Loop variable issue correctly resolved.

The past critical issue regarding taking the address of a loop variable has been properly fixed. The code now iterates by index and returns &jobs.Items[i], avoiding the pointer-to-reused-variable bug.

One minor observation: the function returns (nil, nil) when no job matches the pattern. While this may be intentional for your test usage, consider documenting this behavior in a comment if it's not immediately obvious to future readers.

tests/trainer/jobset_workflow_test.go (3)

56-74: Well done fixing the NestedSlice polling logic.

The past issue where Eventually could succeed when spec.replicatedJobs was missing has been correctly addressed. The code now properly separates the error check from the found check (lines 64-69), ensuring the polling loop continues until the field actually exists.

102-143: Excellent error handling improvements in the failure detection logic.

The past issues with unchecked type assertions and ignored NestedString return values have been thoroughly addressed:

Line 118: Type assertion now uses the comma-ok form with proper fallback

Lines 122-125, 127-130, 132-138: All NestedString calls now check both found and err before using values

The current implementation is robust and handles missing or malformed fields gracefully.

151-554: Well-designed test helpers and workflow structure.

The helper functions create a realistic multi-stage JobSet workflow:

Dataset initializer → Model initializer → Trainer node, with proper DependsOn chains (lines 237-241, 307-311)

BackoffLimit: 0 throughout prevents retry noise in test runs

Shell scripts include appropriate validation (e.g., FAIL_ON_PURPOSE check at lines 190-193)

Cleanup helper logs warnings rather than failing tests (lines 549-553)

The test coverage effectively validates both success and failure paths for the JobSet orchestration.

tests/trainer/jobset_workflow_test.go

ChughShilpa requested review from abhijeet-dhumal, kapil27, robert-bell and sutaakar November 14, 2025 11:38

openshift-ci bot requested review from Fiona-Waters and kryanbeane November 14, 2025 11:39

ChughShilpa removed request for Fiona-Waters and kryanbeane November 14, 2025 11:39

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

tests/trainer/jobset_workflow_test.go Show resolved Hide resolved

ChughShilpa force-pushed the jobset branch from e5e350d to 8609f3a Compare November 14, 2025 11:59

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

tests/common/support/jobs.go Outdated Show resolved Hide resolved

tests/trainer/jobset_workflow_test.go Show resolved Hide resolved

Add test coverage for JobSet Workflow Orchestration

fc6e728

ChughShilpa force-pushed the jobset branch from 8609f3a to fc6e728 Compare November 14, 2025 12:18

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

tests/trainer/jobset_workflow_test.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add test coverage for JobSet Workflow Orchestration #513

Add test coverage for JobSet Workflow Orchestration #513

ChughShilpa commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci bot commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add test coverage for JobSet Workflow Orchestration #513

Are you sure you want to change the base?

Add test coverage for JobSet Workflow Orchestration #513

Conversation

ChughShilpa commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChughShilpa commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading