test: initial implementation of SDK e2e #488

MStokluska · 2025-11-06T09:30:30Z

Addition of initial e2e sanity test for kubeflow sdk

Description

As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.

How Has This Been Tested?

Provisioned my cluster and installed trainer v2 with CRDs
Created a clusterTrainingRuntime CR called "universal" that relies on universal early build image
Run the tests following trainer readme with following envs:
-- "TEST_TIER": "Sanity"
-- "ODH_NAMESPACE": "opendatahub"
-- "NOTEBOOK_USER_NAME": "xyz"
-- "NOTEBOOK_USER_TOKEN": "some_token"
-- "NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"

Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added distributed training sanity tests for Kubeflow trainer functionality.
- Added Fashion‑MNIST distributed training validation test that runs a notebook end‑to‑end.
- Added utilities for cluster preparation and notebook lifecycle management used by tests.
Chores
- Updated .gitignore to exclude VSCode configuration directory.

openshift-ci · 2025-11-06T09:30:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-11-06T09:30:42Z

Walkthrough

Adds end-to-end Kubeflow distributed FashionMNIST tests: a Jupyter notebook, Kubernetes-based test orchestration, cluster and RBAC preparation utilities, notebook lifecycle helpers, a sanity test entry, and a .gitignore entry to exclude .vscode/.

Changes

Cohort / File(s)	Summary
Configuration `\.gitignore`	Added VSCode directory ignore pattern (`.vscode/*`).
Test Entry Point `tests/trainer/kubeflow_sdk_test.go`	New sanity test function `TestKubeflowSDK_Sanity` that registers Sanity and invokes the FashionMNIST distributed training test.
Training Notebook `tests/trainer/resources/mnist.ipynb`	New Jupyter notebook implementing distributed FashionMNIST training (PyTorch model, NCCL/Gloo setup, DistributedSampler, SGD loop) and Kubeflow trainer client submission/monitoring.
Test Implementation `tests/trainer/sdk_tests/fashion_mnist_tests.go`	New orchestration test `RunFashionMnistCpuDistributedTraining`: namespace creation, cluster readiness checks, RBAC/setup, package notebook into ConfigMap, create PVC, deploy Notebook CR with Papermill command, wait for pod, poll completion marker, and cleanup.
Cluster Prep Utilities `tests/trainer/utils/utils_cluster_prep.go`	New helpers `EnsureTrainerClusterReady` and `EnsureNotebookRBAC` to check JobSet CRD, trainer controller readiness, expected runtimes, and to create ServiceAccount/ClusterRole/Role bindings for notebooks.
Notebook Utilities `tests/trainer/utils/utils_notebook.go`	New helpers: `BuildPapermillShellCmd`, `CreateNotebookFromBytes`, `WaitForNotebookPodRunning`, and `PollNotebookCompletionMarker` to build papermill commands, create ConfigMap+Notebook CR, wait for notebook pod, and poll SUCCESS/FAILURE marker via kubectl exec.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Runner
    participant K8s as Kubernetes Cluster
    participant Notebook as Notebook Pod
    participant Trainer as Kubeflow Trainer

    Test->>K8s: Ensure JobSet CRD & trainer controller ready
    Test->>K8s: Create test namespace
    Test->>K8s: Ensure Notebook RBAC (SA, ClusterRole, Role)
    Test->>K8s: Create ConfigMap (notebook bytes) & PVC
    Test->>K8s: Deploy Notebook CR with Papermill command

    rect rgb(235,245,255)
    Note over Notebook,Trainer: Notebook execution and trainer submission
    Notebook->>Notebook: Install papermill + extra packages
    Notebook->>Trainer: Query runtimes & submit CustomTrainer
    Trainer->>Trainer: Run distributed training (workers)
    Trainer->>Notebook: Stream job status/logs
    Notebook->>Notebook: Write completion marker (SUCCESS/FAILURE)
    end

    Test->>Notebook: Poll marker file until SUCCESS/FAILURE
    alt SUCCESS
        Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
    else FAILURE or timeout
        Test->>K8s: Gather logs, delete resources
    end
    Test->>Test: Report test result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay extra attention to:
- tests/trainer/resources/mnist.ipynb — distributed PyTorch init, barriers, and trainer client usage.
- tests/trainer/sdk_tests/fashion_mnist_tests.go — Kubernetes resource lifecycle, timeouts, and cleanup correctness.
- tests/trainer/utils/utils_cluster_prep.go and tests/trainer/utils/utils_notebook.go — RBAC creation, CRD checks, polling logic, and kubectl exec interactions.

Poem

🐰 In a cluster of pods I hop and spin,
Papermill whirs as the notebooks begin,
Workers sync tightly, gradients take flight,
Fashion-MNIST learns through day and night,
A rabbit's small test hops into the light.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding initial end-to-end SDK sanity tests with supporting test infrastructure and utilities.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

tests/trainer/utils/utils_cluster_prep.go (2)
53-55: Consider making expected runtimes configurable.

The hardcoded list of ClusterTrainingRuntimes (torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.

Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:
-func EnsureTrainerClusterReady(t *testing.T, test Test) {
+func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) {
+	if len(expectedRuntimes) == 0 {
+		// Default runtimes if none specified
+		expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"}
+	}
 	...
-	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+	for _, name := range expectedRuntimes {
64-66: Hardcoded ServiceAccount name reduces reusability.

The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.

Consider accepting the SA name as a parameter:
-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) {
 	t.Helper()
-	saName := "jupyter-nb-kube-3aadmin"
Alternatively, if this SA name is always tied to notebooks, accept it from common.NOTEBOOK_CONTAINER_NAME explicitly to make the dependency clear.
tests/trainer/resources/mnist.ipynb (1)
17-125: Consider adding training validation.

The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.

For example, after training completes, you could add:
# At the end of train_fashion_mnist, before dist.destroy_process_group()
if dist.get_rank() == 0:
    if loss.item() > initial_loss:
        print("WARNING: Loss did not decrease during training")
    print(f"Final loss: {loss.item():.6f}")
tests/trainer/utils/utils_notebook.go (1)

49-53: Unused helper function.

CreateNotebookFromBytes is defined but not used anywhere in this PR. The test in fashion_mnist_tests.go creates the ConfigMap and Notebook separately.

Consider either:

Using this helper in fashion_mnist_tests.go to reduce duplication

Removing it if it's not needed yet

Adding a comment explaining it's for future use

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2595866 and 796d164.

📒 Files selected for processing (6)

.gitignore (1 hunks)
tests/trainer/kubeflow_sdk_test.go (1 hunks)
tests/trainer/resources/mnist.ipynb (1 hunks)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

tests/trainer/utils/utils_notebook.go (4)

tests/common/support/test.go (1)

Test (34-45)

tests/common/notebook.go (3)

ContainerSize (72-72)

CreateNotebook (104-185)

NOTEBOOK_CONTAINER_NAME (37-37)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

GetPods (111-116)

tests/common/support/support.go (1)

TestTimeoutLong (35-35)

tests/trainer/kubeflow_sdk_test.go (2)

tests/common/test_tag.go (2)

Tags (32-40)

Sanity (48-50)

tests/trainer/sdk_tests/fashion_mnist_tests.go (1)

RunFashionMnistCpuDistributedTraining (39-86)

tests/trainer/sdk_tests/fashion_mnist_tests.go (8)

tests/common/support/test.go (2)

T (90-102)

With (61-63)

tests/trainer/utils/utils_cluster_prep.go (2)

EnsureTrainerClusterReady (33-56)

EnsureNotebookRBAC (60-83)

tests/common/environment.go (2)

GetNotebookUserName (70-76)

GetNotebookUserToken (78-84)

tests/common/support/rbac.go (1)

CreateUserRoleBindingWithClusterRole (206-238)

tests/common/support/core.go (3)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

tests/trainer/utils/utils_notebook.go (3)

BuildPapermillShellCmd (36-46)

WaitForNotebookPodRunning (56-64)

PollNotebookCompletionMarker (67-85)

tests/common/notebook.go (4)

CreateNotebook (104-185)

ContainerSizeSmall (75-75)

DeleteNotebook (187-190)

Notebooks (192-204)

tests/common/support/support.go (2)

TestTimeoutLong (35-35)

TestTimeoutDouble (36-36)

tests/trainer/utils/utils_cluster_prep.go (4)

tests/common/support/test.go (2)

T (90-102)

Test (34-45)

tests/common/support/client.go (1)

Client (39-50)

tests/common/support/core.go (2)

ServiceAccount (194-200)

ServiceAccounts (207-219)

tests/common/support/rbac.go (4)

CreateClusterRole (48-70)

CreateClusterRoleBinding (135-169)

CreateRole (27-46)

CreateRoleBinding (72-102)

🪛 GitHub Actions: Verify Generated Files and Import Organization

tests/trainer/utils/utils_notebook.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

tests/trainer/utils/utils_cluster_prep.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🔇 Additional comments (5)

.gitignore (1)

2-2: LGTM!

Adding .vscode/* to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.

tests/trainer/kubeflow_sdk_test.go (1)

26-30: LGTM!

Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.

tests/trainer/sdk_tests/fashion_mnist_tests.go (2)

39-86: Well-structured test implementation.

The test follows good practices:

Proper test namespace isolation

Prerequisite checks before execution

RBAC setup

Resource cleanup with defer

Clear error messages

The flow is logical and the use of helper utilities keeps the test readable.

33-36: Relative path verified—notebook file found at expected location.

The notebook file exists at tests/trainer/resources/mnist.ipynb, confirming the relative path "resources/" resolves correctly when tests run from the package directory.

tests/trainer/utils/utils_notebook.go (1)

36-46: LGTM!

The Papermill shell command construction is clean and handles optional extra packages appropriately. The set -e ensures failures are caught, and the marker file pattern is a good approach for async completion detection.

coderabbitai · 2025-11-06T09:40:00Z

tests/trainer/resources/mnist.ipynb

+    "for runtime in client.list_runtimes():\n",
+    "    print(runtime)\n",
+    "    if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
+    "        torch_runtime = runtime"
+   ]


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Hardcoded runtime name needs to be resolved.

The comment on line 158 indicates this uses a placeholder runtime name "universal" that needs updating. This conflicts with the hardcoded runtime checks in utils_cluster_prep.go (lines 53-54) which expect different runtime names.

The PR description mentions needing a custom clusterTrainingRuntime named "universal", but the cluster prep utility checks for different names. Ensure consistency between:

The runtime name expected by this notebook (line 158)

The runtimes validated in EnsureTrainerClusterReady (utils_cluster_prep.go:53)

The actual runtime available in the test environment

Consider parameterizing the runtime name or documenting the expected setup more clearly.

--- 🏁 Script executed: ```shell #!/bin/bash # Extract and examine the notebook file structure and content around lines 156-160 fd -e ipynb -path '*/tests/trainer/resources/mnist.ipynb' --exec cat {} \; | python3 << 'EOF' import json import sys try: data = json.load(sys.stdin) cells = data.get('cells', []) # Flatten all cells and count lines line_count = 0 for cell_idx, cell in enumerate(cells): source = cell.get('source', []) if isinstance(source, str): source = source.split('\n') for line_idx, line in enumerate(source): line_count += 1 # Print lines 150-165 to capture context around 156-160 if 150 <= line_count <= 165: print(f"Line {line_count}: {line}") except Exception as e: print(f"Error: {e}", file=sys.stderr) EOF

Length of output: 217

🏁 Script executed:

#!/bin/bash # Examine utils_cluster_prep.go around lines 53-54 fd -e go -name 'utils_cluster_prep.go' --exec cat {} \; | head -70 | tail -25

Length of output: 230

🏁 Script executed:

#!/bin/bash # Search for runtime-related configurations and definitions rg -n "runtime" -i --type go -A2 -B2 | head -100

Length of output: 7893

--- 🏁 Script executed: ```shell #!/bin/bash # Find and read the notebook file directly find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f

Length of output: 37

🏁 Script executed:

#!/bin/bash # Once found, extract notebook content and look for "universal" runtime reference find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f -exec python3 << 'EOF' {} \; import json import sys try: with open(sys.argv[1], 'r') as f: notebook = json.load(f) # Search through all cells for "universal" string for cell_idx, cell in enumerate(notebook.get('cells', [])): source = cell.get('source', []) if isinstance(source, str): source_lines = source.split('\n') else: source_lines = source for line_idx, line in enumerate(source_lines): if 'universal' in line.lower() or 'torch_runtime' in line.lower(): print(f"Cell {cell_idx}, Line {line_idx}: {line}") except Exception as e: print(f"Error processing {sys.argv[1]}: {e}", file=sys.stderr) EOF

Length of output: 218

🏁 Script executed:

#!/bin/bash # Also search codebase for references to "universal" runtime rg -n "universal" --type go

Length of output: 113

--- 🏁 Script executed: ```shell #!/bin/bash # Search notebook file for "universal" using grep/rg rg -n "universal" ./tests/trainer/resources/mnist.ipynb

Length of output: 108

🏁 Script executed:

#!/bin/bash # Search for torch_runtime usage in notebook rg -n "torch_runtime\|runtime" ./tests/trainer/resources/mnist.ipynb | head -30

Length of output: 0

🏁 Script executed:

#!/bin/bash # Extract notebook with cat and parse JSON properly cat ./tests/trainer/resources/mnist.ipynb | python3 << 'PYTHON_EOF' import json import sys try: nb = json.load(sys.stdin) for idx, cell in enumerate(nb.get('cells', [])): source = cell.get('source', []) if isinstance(source, list): content = ''.join(source) else: content = source # Look for runtime-related content if 'universal' in content.lower() or 'torch_runtime' in content.lower(): lines = content.split('\n') for line_num, line in enumerate(lines): if 'universal' in line.lower() or 'torch_runtime' in line.lower(): print(f"Cell {idx}: {line}") except Exception as e: print(f"Error: {e}", file=sys.stderr) PYTHON_EOF

Length of output: 48

Update notebook runtime selection to match validated test runtimes.

The notebook hardcodes a check for runtime.name == "universal" at line 158, but the test utility (utils_cluster_prep.go lines 53-54) validates only torch-cuda-241, torch-cuda-251, torch-rocm-241, and torch-rocm-251 runtimes are available. The "universal" runtime is marked as a future addition (see the TODO at utils_cluster_prep.go line 52). Since "universal" does not currently exist in the test environment, the notebook's torch_runtime variable will remain unset, causing failures in subsequent code.

Align the notebook to use one of the currently available runtimes (e.g., "torch-cuda-241") or parameterize it with a fallback mechanism.

🤖 Prompt for AI Agents

In tests/trainer/resources/mnist.ipynb around lines 156 to 160, the notebook checks for runtime.name == "universal" which doesn't exist in the test environment and leaves torch_runtime unset; change the selection to match validated runtimes (for example "torch-cuda-241") or implement a fallback loop that picks the first available runtime from the allowed set ["torch-cuda-241","torch-cuda-251","torch-rocm-241","torch-rocm-251"]; update the condition to test membership in that list and set torch_runtime accordingly so downstream cells always have a valid runtime.

coderabbitai · 2025-11-06T09:40:00Z

tests/trainer/utils/utils_cluster_prep.go

@@ -0,0 +1,83 @@
+/*


⚠️ Potential issue | 🔴 Critical

Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted.

Run make imports to fix the import organization as indicated by the pipeline failure.

🧰 Tools

🪛 GitHub Actions: Verify Generated Files and Import Organization

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🤖 Prompt for AI Agents

In tests/trainer/utils/utils_cluster_prep.go around lines 1 to 1 the import statements are not properly organized causing the CI pipeline to fail; run the project's import formatter (make imports) or manually sort and group imports according to the project's gofmt/gofmtimports rules (standard libs first, blank line, third-party, blank line, internal packages) and ensure there are no unused imports, then save and re-run the pipeline.

coderabbitai · 2025-11-06T09:40:01Z

tests/trainer/utils/utils_cluster_prep.go

+	if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
+		t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
+	}


⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().

Apply this diff to use the Kubernetes API client:

- if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil { - t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out)) - } + _, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{ + Group: "apiextensions.k8s.io", + Version: "v1", + Resource: "customresourcedefinitions", + }).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{}) + test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the exec.Command("kubectl", ...) call with a Kubernetes API call using the provided test.Client(): call test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx, "jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g., context.Background()), and if it returns an error fail the test with t.Fatalf including the error; remove the exec.Command usage and its output handling and ensure the file imports context and metav1 (and any apiextensions client types if missing).

This is a good suggestion!
Will switch to it most likely but after comments from others.

@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.

This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.

coderabbitai · 2025-11-06T09:40:01Z

tests/trainer/utils/utils_cluster_prep.go

+	if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
+		t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
+	}


⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.

Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.

Apply this diff to use the Kubernetes API client with Eventually for polling:

- if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil { - t.Fatalf("Trainer controller not available: %v\n%s", err, string(out)) - } + odhNamespace := "opendatahub" // Consider making this configurable + test.Eventually(func(g Gomega) { + deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{}) + g.Expect(err).NotTo(HaveOccurred()) + g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{ + "Type": Equal(appsv1.DeploymentAvailable), + "Status": Equal(corev1.ConditionTrue), + }))) + }, 180*time.Second).Should(Succeed(), "Trainer controller not available")

Committable suggestion skipped: line range outside the PR's diff.

same as above

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Fetch all review comments from this PR to find the "above" comment being referenced gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50

Length of output: 25

🏁 Script executed:

#!/bin/bash # Try fetching review comments using the GitHub API directly gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100

Length of output: 18212

@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.

🐰✨

coderabbitai · 2025-11-06T09:40:01Z

tests/trainer/utils/utils_cluster_prep.go

+	// Ensure ServiceAccount exists
+	saName := "jupyter-nb-kube-3aadmin"
+	sa := &corev1.ServiceAccount{ObjectMeta: metav1.ObjectMeta{Name: saName, Namespace: namespace}}
+	_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})


⚠️ Potential issue | 🟡 Minor

Ignoring Create error could mask issues.

The error from ServiceAccounts(...).Create(...) is silently ignored, which might hide actual problems (e.g., permission issues).

Consider logging the error or checking if it's an AlreadyExists error:

- _, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{}) + _, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{}) + if err != nil && !errors.IsAlreadyExists(err) { + test.T().Logf("Warning: failed to create ServiceAccount: %v", err) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})

_, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})

if err != nil && !errors.IsAlreadyExists(err) {

test.T().Logf("Warning: failed to create ServiceAccount: %v", err)

}

🤖 Prompt for AI Agents

In tests/trainer/utils/utils_cluster_prep.go around line 66, the code currently ignores the error returned by ServiceAccounts(...).Create(...); capture the returned error, check if it's nil or an apierrors.IsAlreadyExists(err) and only treat non-nil/non-AlreadyExists errors as failures (fail the test or log the error), otherwise proceed; update the call to handle the error accordingly (use t.Fatalf/t.Fatalf-like helper or test.Logger.Errorf) so permission or creation failures are not silently swallowed.

tests/trainer/utils/utils_notebook.go

coderabbitai · 2025-11-06T09:40:01Z

tests/trainer/utils/utils_notebook.go

+func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
+	var finalErr error
+	test.Eventually(func() bool {
+		out, err := exec.Command("kubectl", "-n", namespace, "exec", podName, "-c", containerName, "--", "cat", marker).CombinedOutput()
+		if err != nil {
+			return false
+		}
+		switch strings.TrimSpace(string(out)) {
+		case "SUCCESS":
+			return true
+		case "FAILURE":
+			finalErr = fmt.Errorf("Notebook execution failed")
+			return true
+		default:
+			return false
+		}
+	}, timeout).Should(BeTrue(), "Notebook did not reach definitive state")
+	return finalErr
+}


⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) to read the marker file is brittle and adds external dependencies.

Consider using the Kubernetes API to read the file contents via the pod exec subresource:

func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error { var finalErr error test.Eventually(func() bool { req := test.Client().Core().CoreV1().RESTClient(). Get(). Namespace(namespace). Resource("pods"). Name(podName). SubResource("exec"). VersionedParams(&corev1.PodExecOptions{ Container: containerName, Command: []string{"cat", marker}, Stdout: true, Stderr: true, }, scheme.ParameterCodec) exec, err := remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL()) if err != nil { return false } var stdout, stderr bytes.Buffer err = exec.Stream(remotecommand.StreamOptions{ Stdout: &stdout, Stderr: &stderr, }) if err != nil { return false } switch strings.TrimSpace(stdout.String()) { case "SUCCESS": return true case "FAILURE": finalErr = fmt.Errorf("Notebook execution failed") return true default: return false } }, timeout).Should(BeTrue(), "Notebook did not reach definitive state") return finalErr }

🤖 Prompt for AI Agents

In tests/trainer/utils/utils_notebook.go around lines 67 to 85, replace the brittle exec.Command("kubectl", ...) call with a Kubernetes client-based pod exec using test.Client() and the pod exec subresource: build a REST GET request against pods/{name}/exec with corev1.PodExecOptions (container, command ["cat", marker], Stdout/Stderr true), create an SPDY executor with remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL()), stream into bytes.Buffer for stdout/stderr, handle executor creation/stream errors by returning false (and record finalErr when stdout trimmed equals "FAILURE"), return true on "SUCCESS" or "FAILURE", and ensure you add necessary imports (bytes, corev1, scheme, remotecommand) and preserve the existing Eventually/timeout behavior and return finalErr at the end.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tests/trainer/utils/utils_notebook.go (2)

17-32: Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted. Run make imports to fix the import organization.

66-85: Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.

🧹 Nitpick comments (1)

tests/trainer/utils/utils_notebook.go (1)

55-64: Eliminate redundant GetPods call.

Lines 62-63 repeat the same GetPods call that was already executed in the Eventually block (lines 58-60), which is inefficient.

Apply this diff to store and reuse the pods result:

 func WaitForNotebookPodRunning(test Test, namespace string) (string, string) {
 	labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME)
+	var pods []corev1.Pod
 	test.Eventually(func() []corev1.Pod {
-		return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		return pods
 	}, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod")
 
-	pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
 	return pods[0].Name, pods[0].Spec.Containers[0].Name
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796d164 and 4e71919.

📒 Files selected for processing (2)

tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/trainer/utils/utils_cluster_prep.go

🧰 Additional context used

🧬 Code graph analysis (1)

tests/trainer/utils/utils_notebook.go (4)

tests/common/support/test.go (1)

Test (34-45)

tests/common/notebook.go (3)

ContainerSize (72-72)

CreateNotebook (104-185)

NOTEBOOK_CONTAINER_NAME (37-37)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

GetPods (111-116)

tests/common/support/support.go (1)

TestTimeoutLong (35-35)

🔇 Additional comments (2)

tests/trainer/utils/utils_notebook.go (2)

34-46: LGTM! Papermill command construction is sound.

The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses sleep infinity to keep the container running for inspection. The use of set -e ensures proper error propagation.

48-53: LGTM! Notebook setup orchestration is correct.

The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.

sutaakar · 2025-11-06T12:15:06Z

tests/trainer/resources/mnist.ipynb

+    "        dataset = datasets.FashionMNIST(\n",
+    "            \"./data\",\n",
+    "            train=True,\n",
+    "            download=True,\n",


This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67

sutaakar · 2025-11-06T12:39:27Z

tests/trainer/resources/mnist.ipynb

+    "for runtime in client.list_runtimes():\n",
+    "    print(runtime)\n",
+    "    if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
+    "        torch_runtime = runtime"


Why this approach instead of getting universal runtime by client.get_runtime("universal")?

That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!

sutaakar · 2025-11-06T12:40:58Z

tests/trainer/resources/mnist.ipynb

+    "            \"cpu\": 2,\n",
+    "            \"memory\": \"8Gi\",\n",
+    "        },\n",
+    "        packages_to_install=[\"torchvision\"],\n",


I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.

yes I think it makes sense. Thanks.

sutaakar · 2025-11-06T13:01:47Z

tests/trainer/utils/utils_cluster_prep.go

+	}
+	// TODO: Extend / tweak with universal image runtime once available
+	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+		test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name))


Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?

Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?

Yes, can be removed

sutaakar · 2025-11-06T13:08:12Z

tests/trainer/utils/utils_cluster_prep.go

+
+// EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can
+// read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace.
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {


I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.

Agree, thanks Karel

sutaakar · 2025-11-06T13:30:01Z

tests/trainer/utils/utils_notebook.go

+// PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout.
+func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
+	var finalErr error
+	test.Eventually(func() bool {


Thinking about whether it would be simpler to print status into Workbench log directly.

It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?

sutaakar · 2025-11-06T13:46:50Z

tests/trainer/sdk_tests/fashion_mnist_tests.go

+
+	// Wait for the Notebook Pod and get pod/container names
+	podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)
+


Would it have sense to add assertion checking that TrainJob is created and successfully finished?

Yes, I think this is very important :)
Thanks

test: initial implementation of SDK e2e

796d164

openshift-ci bot requested a review from kapil27 November 6, 2025 09:30

openshift-ci bot requested a review from pawelpaszki November 6, 2025 09:30

cleanup imports

4e71919

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

sutaakar reviewed Nov 6, 2025

View reviewed changes

-	_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+	_, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+	if err != nil && !errors.IsAlreadyExists(err) {
+		test.T().Logf("Warning: failed to create ServiceAccount: %v", err)
+	}


		// Wait for the Notebook Pod and get pod/container names
		podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)

test: initial implementation of SDK e2e #488

Are you sure you want to change the base?

test: initial implementation of SDK e2e #488

Uh oh!

Conversation

MStokluska commented Nov 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MStokluska commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

coderabbitai bot Nov 6, 2025 •

edited

Loading

coderabbitai bot Nov 6, 2025 •

edited

Loading