-
Notifications
You must be signed in to change notification settings - Fork 70
Add test coverage for JobSet Workflow Orchestration #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds test support helpers to locate Jobs and fetch a single JobSet via the dynamic client, and introduces two end-to-end tests that exercise JobSet initializer workflows (successful and failing) using a TrainingRuntime, TrainJob, and shared PVC. Changes
Sequence Diagram(s)sequenceDiagram
participant Test
participant K8s as Kubernetes API
participant Dyn as Dynamic Client
participant JobSet
participant Jobs as Replicated Jobs
rect rgb(240,248,255)
Note over Test,K8s: Setup resources
Test->>K8s: create Namespace & PVC
Test->>K8s: create TrainingRuntime (JobSet with 3 replicatedJobs)
end
rect rgb(245,255,240)
Note over Test,JobSet: Execution
Test->>K8s: create TrainJob referencing TrainingRuntime
JobSet->>Jobs: instantiate orchestrated replicated jobs (dataset/model/node)
Jobs-->>JobSet: report status (succeeded / failed)
end
rect rgb(255,250,240)
Note over Test,Dyn: Verification & cleanup
Test->>Dyn: GetSingleJobSet(namespace) — list JobSets via GVR
Test->>K8s: GetJobByNamePattern(namespace, pattern) — find job by name substring
Test->>K8s: assert TrainJob status (succeeded / failed & message)
Test->>K8s: delete TrainingRuntime
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/common/support/jobs.go (1)
26-41: Tighten not‑found semantics and avoid pointer-to-loop-variable confusionTwo small improvements you might consider:
- Returning
nil, nilfor “not found” is a bit ambiguous; a signature like(*batchv1.Job, bool, error)(or documenting the nil,nil case clearly) would make call sites more robust.- For clarity, prefer taking the address of the slice element instead of the loop variable, e.g.:
-func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) { +func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) { test.T().Helper() jobs, err := test.Client().Core().BatchV1().Jobs(namespace).List(test.Ctx(), metav1.ListOptions{}) if err != nil { return nil, err } - for _, job := range jobs.Items { - if strings.Contains(job.Name, pattern) { - return &job, nil - } - } + for i := range jobs.Items { + if strings.Contains(jobs.Items[i].Name, pattern) { + return &jobs.Items[i], nil + } + } return nil, nil }This avoids the common “pointer to range variable” gotcha and makes the intent explicit.
tests/trainer/jobset_workflow_test.go (1)
131-357: TrainingRuntime / JobSet spec is consistent; consider small extraction to reduce duplicationThe
createTrainingRuntimeWithInitializershelper wires threeReplicatedJobs with the expected dependencies and sharedworkspacePVC, and the resource requests/limits look reasonable for tests. The structure is clear and matches the intended workflow.If you ever find this growing or needing reuse, you could optionally extract the common
Volume/VolumeMountandBackoffLimit/RestartPolicyNeverpieces into small local helpers to reduce repetition, but it’s fine as-is for readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (4)
tests/common/support/jobset.go (1)
27-48: JobSet helper is clear and idiomatic for test usageThe
jobsetOperatorGVRdefinition andGetSingleJobSethelper are straightforward and fit well with the existingTest/Gomega patterns. Asserting a single JobSet viaHaveLen(1)before returning the first item is reasonable in this test-only context.tests/trainer/jobset_workflow_test.go (3)
80-129: Failed workflow test logic looks solid and exercises JobSet status conditionsThe
TestFailedJobSetWorkflowpath correctly:
- Waits for the JobSet to exist via
GetSingleJobSet.- Polls
status.conditionswithunstructured.NestedSliceand explicitly errors when the field is missing.- Locates a
"Failed"condition withstatus == "True", captures the message, and asserts the TrainJob eventually transitions toTrainJobFailed.This gives good coverage of the failure behavior and ensures the TrainJob reflects JobSet failure as expected.
359-519: TrainJob helpers align well with initializer semanticsBoth
createTrainJobWithInitializersandcreateTrainJobWithFailingInitializer:
- Correctly reference the
TrainingRuntimeviaRuntimeRef.- Configure dataset/model initializer env vars that match what the initializer Jobs expect.
- Use clear shell commands to validate dataset/model visibility (success case) or ensure the trainer won’t run if the initializer fails (failure case).
These helpers nicely encapsulate the TrainJob setup for the two scenarios and keep the tests themselves concise.
521-534: Runtime cleanup helper is safe and test-friendly
deleteTrainingRuntimewraps the delete call and logs a warning instead of failing the test on cleanup errors, which is a good pattern for integration tests where teardown shouldn’t mask primary failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/jobset_workflow_test.go (10)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/test_tag.go (1)
Tags(32-40)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (1)
tests/trainer/jobset_workflow_test.go (1)
56-74: LGTM: Eventually block correctly handles missing field.The code properly handles the case when
spec.replicatedJobsis not found by returning an error (lines 67-68), ensuring thatEventuallykeeps polling until the field exists. This addresses the concern raised in the previous review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/common/support/jobs.go(1 hunks)tests/common/support/jobset.go(1 hunks)tests/trainer/jobset_workflow_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/client.go (1)
Client(39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
Test(34-45)T(90-102)tests/common/support/batch.go (1)
Job(27-33)tests/common/support/client.go (1)
Client(39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
T(90-102)With(61-63)Test(34-45)tests/common/support/core.go (2)
CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/common/support/jobset.go (1)
GetSingleJobSet(38-49)tests/common/support/support.go (3)
TestTimeoutShort(33-33)TestTimeoutLong(35-35)TestTimeoutMedium(34-34)tests/common/support/utils.go (2)
Log(34-34)Ptr(27-29)tests/common/support/trainjob.go (4)
TrainJob(26-32)TrainJobConditionComplete(42-44)TrainJobConditionFailed(46-48)TrainJobs(34-40)tests/kfto/environment.go (2)
GetAlpacaDatasetImage(36-38)GetBloomModelImage(32-34)tests/common/support/environment.go (1)
GetTrainingCudaPyTorch28Image(103-105)tests/common/support/client.go (1)
Client(39-50)
🔇 Additional comments (5)
tests/common/support/jobset.go (1)
27-49: LGTM! Clean test helper for JobSet retrieval.The implementation correctly uses the dynamic client with the JobSet v1alpha2 API. Error handling is appropriate: the function returns errors from the List operation and uses a gomega assertion to verify exactly one JobSet exists, which is the expected behavior for a test helper.
tests/common/support/jobs.go (1)
26-42: LGTM! Loop variable issue correctly resolved.The past critical issue regarding taking the address of a loop variable has been properly fixed. The code now iterates by index and returns
&jobs.Items[i], avoiding the pointer-to-reused-variable bug.One minor observation: the function returns
(nil, nil)when no job matches the pattern. While this may be intentional for your test usage, consider documenting this behavior in a comment if it's not immediately obvious to future readers.tests/trainer/jobset_workflow_test.go (3)
56-74: Well done fixing theNestedSlicepolling logic.The past issue where
Eventuallycould succeed whenspec.replicatedJobswas missing has been correctly addressed. The code now properly separates the error check from thefoundcheck (lines 64-69), ensuring the polling loop continues until the field actually exists.
102-143: Excellent error handling improvements in the failure detection logic.The past issues with unchecked type assertions and ignored
NestedStringreturn values have been thoroughly addressed:
- Line 118: Type assertion now uses the comma-ok form with proper fallback
- Lines 122-125, 127-130, 132-138: All
NestedStringcalls now check bothfoundanderrbefore using valuesThe current implementation is robust and handles missing or malformed fields gracefully.
151-554: Well-designed test helpers and workflow structure.The helper functions create a realistic multi-stage JobSet workflow:
- Dataset initializer → Model initializer → Trainer node, with proper
DependsOnchains (lines 237-241, 307-311)BackoffLimit: 0throughout prevents retry noise in test runs- Shell scripts include appropriate validation (e.g.,
FAIL_ON_PURPOSEcheck at lines 190-193)- Cleanup helper logs warnings rather than failing tests (lines 549-553)
The test coverage effectively validates both success and failure paths for the JobSet orchestration.
Add test coverage for JobSet Workflow Orchestration
Description
How Has This Been Tested?
go test ./tests/trainer/ -run TestJobSetWorkflow -vMerge criteria:
Summary by CodeRabbit