Skip to content

Conversation

@Fiona-Waters
Copy link
Contributor

@Fiona-Waters Fiona-Waters commented Jan 16, 2026

Updating the training runtimes used in the training hub (sft + osft) e2e tests since they have been updated here.

Description

How Has This Been Tested?

Not tested yet, will test by running tests on openshift cluster.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Updated test suite to validate new training runtime versions with updated naming conventions (torch 2.9, Python 3.12).
    • Added registry-specific image validation logic for runtime configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci bot requested review from kapil27 and pawelpaszki January 16, 2026 15:35
@openshift-ci
Copy link

openshift-ci bot commented Jan 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign briangallagher for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

Updates training runtime test data by introducing an ODHImage field to the ClusterTrainingRuntime struct, populating it with new runtime identifiers and images, and adding registry-specific image validation logic for RedHat and Quay.io registries.

Changes

Cohort / File(s) Summary
Test struct expansion
tests/trainer/cluster_training_runtimes_test.go
Adds ODHImage field to ClusterTrainingRuntime struct; updates expectedRuntimes data with new runtime names (torch-distributed-cuda128-torch29-py312, training-hub-th05-cuda128-torch29-py312) and corresponding ODHImage values; implements registry-specific image validation for registry.redhat.io (uses RHOAIImage) and quay.io (uses ODHImage)
Runtime identifier updates
tests/trainer/resources/osft.ipynb, tests/trainer/resources/sft.ipynb
Updates selected training runtime references from "training-hub03-cuda128-torch28-py312" to "training-hub-th05-cuda128-torch29-py312"; adjusts corresponding error messages

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 The runtimes dance with images new,
ODHImage fields added to the view,
Training-hub paths all updated with care,
Registry logic floating through the air! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: updating training runtimes used in training hub end-to-end tests, which is reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/trainer/cluster_training_runtimes_test.go`:
- Around line 37-43: The test uses ODHImage in ClusterTrainingRuntime literals
but the ClusterTrainingRuntime struct is missing that field; update the
ClusterTrainingRuntime type to add an ODHImage string field (alongside existing
Name and RHOAIImage) so the literals in tests like expectedRuntimes compile, and
ensure any constructors or usages of ClusterTrainingRuntime (e.g., in tests or
factory functions) are updated to accommodate the new ODHImage field.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 226b7bc and d071bc7.

📒 Files selected for processing (3)
  • tests/trainer/cluster_training_runtimes_test.go
  • tests/trainer/resources/osft.ipynb
  • tests/trainer/resources/sft.ipynb
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-15T09:49:36.063Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/resources/osft.ipynb:460-493
Timestamp: 2025-12-15T09:49:36.063Z
Learning: In the Kubeflow Trainer SDK version used by this repository, get_job_logs(follow=False) returns a generator of log lines, just like when follow=True. When using this API, collect the generator into a list and then join the lines to form the full log string (e.g., logs = ''.join(list(get_job_logs(follow=False)))).

Applied to files:

  • tests/trainer/resources/osft.ipynb
  • tests/trainer/resources/sft.ipynb
📚 Learning: 2025-12-12T12:23:09.527Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/sdk_tests/osft_traininghub_tests.go:33-36
Timestamp: 2025-12-12T12:23:09.527Z
Learning: In Go tests, the working directory during go test is the package directory being tested. Relative paths (e.g., "resources/file.txt") resolve to paths under that package directory. When tests live in subdirectories (e.g., tests/trainer/sdk_tests/), referencing resources should still be relative to the package directory being tested (tests/trainer/), so ensure resource paths are resolved from the package root of each test package.

Applied to files:

  • tests/trainer/cluster_training_runtimes_test.go
📚 Learning: 2025-12-16T07:55:49.810Z
Learnt from: ChughShilpa
Repo: opendatahub-io/distributed-workloads PR: 568
File: tests/trainer/cluster_training_runtimes_test.go:100-101
Timestamp: 2025-12-16T07:55:49.810Z
Learning: In Go tests using Gomega, ContainSubstring(substr string, args ...any) supports variadic args that are used with fmt.Sprintf to format the substring. This means you can pass formatting operands such as ContainSubstring("%s/rhoai/%s", registryName, expectedRuntime.Image) and have the matcher evaluate the formatted substring. Ensure you import fmt only if you actually format strings beforehand; otherwise rely on the variadic args as shown. This pattern applies to test files under the tests directory (e.g., tests/trainer/cluster_training_runtimes_test.go) and can be generalized to other tests using Gomega.

Applied to files:

  • tests/trainer/cluster_training_runtimes_test.go
📚 Learning: 2026-01-06T17:26:58.604Z
Learnt from: sutaakar
Repo: opendatahub-io/distributed-workloads PR: 579
File: tests/common/support/olm.go:1-25
Timestamp: 2026-01-06T17:26:58.604Z
Learning: In the distributed-workloads repository, Go imports should be formatted so there is a blank line between different third-party import groups (e.g., between github.com/operator-framework and k8s.io). Use openshift-goimports to format imports for Go files, ensuring the separation between distinct third-party groups is preserved. This convention applies to all Go files (e.g., tests/common/support/olm.go and beyond).

Applied to files:

  • tests/trainer/cluster_training_runtimes_test.go
🔇 Additional comments (3)
tests/trainer/resources/osft.ipynb (1)

402-402: Runtime name update looks good.

The lookup and error message are aligned with the new runtime identifier.

tests/trainer/resources/sft.ipynb (1)

475-475: Runtime name update looks good.

The runtime lookup and error string are consistent with the new runtime identifier.

tests/trainer/cluster_training_runtimes_test.go (1)

102-111: Registry-specific expected image logic looks good.

The switch cleanly maps registries to the correct image source and handles unexpected registries.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 37 to +43
var expectedRuntimes = []ClusterTrainingRuntime{
{Name: "torch-distributed", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "torch-distributed-rocm", RHOAIImage: "odh-training-rocm64-torch28-py312-rhel9"},
{Name: "torch-distributed-th03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "training-hub", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "training-hub03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "torch-distributed", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "torch-distributed-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "torch-distributed-rocm", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"},
{Name: "torch-distributed-rocm64-torch29-py312", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"},
{Name: "training-hub", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "training-hub-th05-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add ODHImage to ClusterTrainingRuntime to avoid compile failure.

Line 37 uses ODHImage in struct literals, but the struct definition only has Name and RHOAIImage, so this won’t compile.

✅ Suggested fix
 type ClusterTrainingRuntime struct {
-	Name       string
-	RHOAIImage string
+	Name       string
+	ODHImage   string
+	RHOAIImage string
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var expectedRuntimes = []ClusterTrainingRuntime{
{Name: "torch-distributed", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "torch-distributed-rocm", RHOAIImage: "odh-training-rocm64-torch28-py312-rhel9"},
{Name: "torch-distributed-th03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "training-hub", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "training-hub03-cuda128-torch28-py312", RHOAIImage: "odh-training-cuda128-torch28-py312-rhel9"},
{Name: "torch-distributed", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "torch-distributed-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "torch-distributed-rocm", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"},
{Name: "torch-distributed-rocm64-torch29-py312", ODHImage: "opendatahub/odh-training-rocm64-torch29-py312", RHOAIImage: "odh-training-rocm64-torch29-py312-rhel9"},
{Name: "training-hub", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
{Name: "training-hub-th05-cuda128-torch29-py312", ODHImage: "opendatahub/odh-training-cuda128-torch29-py312", RHOAIImage: "odh-training-cuda128-torch29-py312-rhel9"},
type ClusterTrainingRuntime struct {
Name string
ODHImage string
RHOAIImage string
}
🤖 Prompt for AI Agents
In `@tests/trainer/cluster_training_runtimes_test.go` around lines 37 - 43, The
test uses ODHImage in ClusterTrainingRuntime literals but the
ClusterTrainingRuntime struct is missing that field; update the
ClusterTrainingRuntime type to add an ODHImage string field (alongside existing
Name and RHOAIImage) so the literals in tests like expectedRuntimes compile, and
ensure any constructors or usages of ClusterTrainingRuntime (e.g., in tests or
factory functions) are updated to accommodate the new ODHImage field.

@openshift-merge-robot
Copy link

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants