-
Notifications
You must be signed in to change notification settings - Fork 73
Updating training runtimes for training hub e2es #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughUpdates training runtime test data by introducing an ODHImage field to the ClusterTrainingRuntime struct, populating it with new runtime identifiers and images, and adding registry-specific image validation logic for RedHat and Quay.io registries. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/trainer/cluster_training_runtimes_test.go`:
- Around line 37-43: The test uses ODHImage in ClusterTrainingRuntime literals
but the ClusterTrainingRuntime struct is missing that field; update the
ClusterTrainingRuntime type to add an ODHImage string field (alongside existing
Name and RHOAIImage) so the literals in tests like expectedRuntimes compile, and
ensure any constructors or usages of ClusterTrainingRuntime (e.g., in tests or
factory functions) are updated to accommodate the new ODHImage field.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/trainer/cluster_training_runtimes_test.gotests/trainer/resources/osft.ipynbtests/trainer/resources/sft.ipynb
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-15T09:49:36.063Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/resources/osft.ipynb:460-493
Timestamp: 2025-12-15T09:49:36.063Z
Learning: In the Kubeflow Trainer SDK version used by this repository, get_job_logs(follow=False) returns a generator of log lines, just like when follow=True. When using this API, collect the generator into a list and then join the lines to form the full log string (e.g., logs = ''.join(list(get_job_logs(follow=False)))).
Applied to files:
tests/trainer/resources/osft.ipynbtests/trainer/resources/sft.ipynb
📚 Learning: 2025-12-12T12:23:09.527Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/sdk_tests/osft_traininghub_tests.go:33-36
Timestamp: 2025-12-12T12:23:09.527Z
Learning: In Go tests, the working directory during go test is the package directory being tested. Relative paths (e.g., "resources/file.txt") resolve to paths under that package directory. When tests live in subdirectories (e.g., tests/trainer/sdk_tests/), referencing resources should still be relative to the package directory being tested (tests/trainer/), so ensure resource paths are resolved from the package root of each test package.
Applied to files:
tests/trainer/cluster_training_runtimes_test.go
📚 Learning: 2025-12-16T07:55:49.810Z
Learnt from: ChughShilpa
Repo: opendatahub-io/distributed-workloads PR: 568
File: tests/trainer/cluster_training_runtimes_test.go:100-101
Timestamp: 2025-12-16T07:55:49.810Z
Learning: In Go tests using Gomega, ContainSubstring(substr string, args ...any) supports variadic args that are used with fmt.Sprintf to format the substring. This means you can pass formatting operands such as ContainSubstring("%s/rhoai/%s", registryName, expectedRuntime.Image) and have the matcher evaluate the formatted substring. Ensure you import fmt only if you actually format strings beforehand; otherwise rely on the variadic args as shown. This pattern applies to test files under the tests directory (e.g., tests/trainer/cluster_training_runtimes_test.go) and can be generalized to other tests using Gomega.
Applied to files:
tests/trainer/cluster_training_runtimes_test.go
📚 Learning: 2026-01-06T17:26:58.604Z
Learnt from: sutaakar
Repo: opendatahub-io/distributed-workloads PR: 579
File: tests/common/support/olm.go:1-25
Timestamp: 2026-01-06T17:26:58.604Z
Learning: In the distributed-workloads repository, Go imports should be formatted so there is a blank line between different third-party import groups (e.g., between github.com/operator-framework and k8s.io). Use openshift-goimports to format imports for Go files, ensuring the separation between distinct third-party groups is preserved. This convention applies to all Go files (e.g., tests/common/support/olm.go and beyond).
Applied to files:
tests/trainer/cluster_training_runtimes_test.go
🔇 Additional comments (3)
tests/trainer/resources/osft.ipynb (1)
402-402: Runtime name update looks good.The lookup and error message are aligned with the new runtime identifier.
tests/trainer/resources/sft.ipynb (1)
475-475: Runtime name update looks good.The runtime lookup and error string are consistent with the new runtime identifier.
tests/trainer/cluster_training_runtimes_test.go (1)
102-111: Registry-specific expected image logic looks good.The switch cleanly maps registries to the correct image source and handles unexpected registries.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| var expectedRuntimes = []ClusterTrainingRuntime{ | ||
| {Name: "torch-distributed", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | ||
| {Name: "torch-distributed-rocm", RHOAIImage: "odh-training-rocm64-torch28-py312-rhel9"}, | ||
| {Name: "torch-distributed-th03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | ||
| {Name: "training-hub", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | ||
| {Name: "training-hub03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | ||
| {Name: "torch-distributed", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | ||
| {Name: "torch-distributed-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | ||
| {Name: "torch-distributed-rocm", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"}, | ||
| {Name: "torch-distributed-rocm64-torch29-py312", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"}, | ||
| {Name: "training-hub", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | ||
| {Name: "training-hub-th05-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ODHImage to ClusterTrainingRuntime to avoid compile failure.
Line 37 uses ODHImage in struct literals, but the struct definition only has Name and RHOAIImage, so this won’t compile.
✅ Suggested fix
type ClusterTrainingRuntime struct {
- Name string
- RHOAIImage string
+ Name string
+ ODHImage string
+ RHOAIImage string
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var expectedRuntimes = []ClusterTrainingRuntime{ | |
| {Name: "torch-distributed", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | |
| {Name: "torch-distributed-rocm", RHOAIImage: "odh-training-rocm64-torch28-py312-rhel9"}, | |
| {Name: "torch-distributed-th03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | |
| {Name: "training-hub", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | |
| {Name: "training-hub03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"}, | |
| {Name: "torch-distributed", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | |
| {Name: "torch-distributed-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | |
| {Name: "torch-distributed-rocm", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"}, | |
| {Name: "torch-distributed-rocm64-torch29-py312", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"}, | |
| {Name: "training-hub", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | |
| {Name: "training-hub-th05-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"}, | |
| type ClusterTrainingRuntime struct { | |
| Name string | |
| ODHImage string | |
| RHOAIImage string | |
| } |
🤖 Prompt for AI Agents
In `@tests/trainer/cluster_training_runtimes_test.go` around lines 37 - 43, The
test uses ODHImage in ClusterTrainingRuntime literals but the
ClusterTrainingRuntime struct is missing that field; update the
ClusterTrainingRuntime type to add an ODHImage string field (alongside existing
Name and RHOAIImage) so the literals in tests like expectedRuntimes compile, and
ensure any constructors or usages of ClusterTrainingRuntime (e.g., in tests or
factory functions) are updated to accommodate the new ODHImage field.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Updating the training runtimes used in the training hub (sft + osft) e2e tests since they have been updated here.
Description
How Has This Been Tested?
Not tested yet, will test by running tests on openshift cluster.
Merge criteria:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.