Skip to content

Conversation

@MStokluska
Copy link
Contributor

@MStokluska MStokluska commented Nov 6, 2025

Addition of initial e2e sanity test for kubeflow sdk

Description

As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.

How Has This Been Tested?

  • Provisioned my cluster and installed trainer v2 with CRDs
  • Created a clusterTrainingRuntime CR called "universal" that relies on universal early build image
  • Run the tests following trainer readme with following envs:
    -- "TEST_TIER": "Sanity"
    -- "ODH_NAMESPACE": "opendatahub"
    -- "NOTEBOOK_USER_NAME": "xyz"
    -- "NOTEBOOK_USER_TOKEN": "some_token"
    -- "NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"

Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests

    • Added distributed training sanity tests for Kubeflow trainer functionality.
    • Added Fashion‑MNIST distributed training validation test that runs a notebook end‑to‑end.
    • Added utilities for cluster preparation and notebook lifecycle management used by tests.
  • Chores

    • Updated .gitignore to exclude VSCode configuration directory.

@openshift-ci openshift-ci bot requested a review from kapil27 November 6, 2025 09:30
@openshift-ci
Copy link

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested a review from pawelpaszki November 6, 2025 09:30
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

Adds end-to-end Kubeflow distributed FashionMNIST tests: a Jupyter notebook, Kubernetes-based test orchestration, cluster and RBAC preparation utilities, notebook lifecycle helpers, a sanity test entry, and a .gitignore entry to exclude .vscode/.

Changes

Cohort / File(s) Summary
Configuration
\.gitignore
Added VSCode directory ignore pattern (.vscode/*).
Test Entry Point
tests/trainer/kubeflow_sdk_test.go
New sanity test function TestKubeflowSDK_Sanity that registers Sanity and invokes the FashionMNIST distributed training test.
Training Notebook
tests/trainer/resources/mnist.ipynb
New Jupyter notebook implementing distributed FashionMNIST training (PyTorch model, NCCL/Gloo setup, DistributedSampler, SGD loop) and Kubeflow trainer client submission/monitoring.
Test Implementation
tests/trainer/sdk_tests/fashion_mnist_tests.go
New orchestration test RunFashionMnistCpuDistributedTraining: namespace creation, cluster readiness checks, RBAC/setup, package notebook into ConfigMap, create PVC, deploy Notebook CR with Papermill command, wait for pod, poll completion marker, and cleanup.
Cluster Prep Utilities
tests/trainer/utils/utils_cluster_prep.go
New helpers EnsureTrainerClusterReady and EnsureNotebookRBAC to check JobSet CRD, trainer controller readiness, expected runtimes, and to create ServiceAccount/ClusterRole/Role bindings for notebooks.
Notebook Utilities
tests/trainer/utils/utils_notebook.go
New helpers: BuildPapermillShellCmd, CreateNotebookFromBytes, WaitForNotebookPodRunning, and PollNotebookCompletionMarker to build papermill commands, create ConfigMap+Notebook CR, wait for notebook pod, and poll SUCCESS/FAILURE marker via kubectl exec.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Runner
    participant K8s as Kubernetes Cluster
    participant Notebook as Notebook Pod
    participant Trainer as Kubeflow Trainer

    Test->>K8s: Ensure JobSet CRD & trainer controller ready
    Test->>K8s: Create test namespace
    Test->>K8s: Ensure Notebook RBAC (SA, ClusterRole, Role)
    Test->>K8s: Create ConfigMap (notebook bytes) & PVC
    Test->>K8s: Deploy Notebook CR with Papermill command

    rect rgb(235,245,255)
    Note over Notebook,Trainer: Notebook execution and trainer submission
    Notebook->>Notebook: Install papermill + extra packages
    Notebook->>Trainer: Query runtimes & submit CustomTrainer
    Trainer->>Trainer: Run distributed training (workers)
    Trainer->>Notebook: Stream job status/logs
    Notebook->>Notebook: Write completion marker (SUCCESS/FAILURE)
    end

    Test->>Notebook: Poll marker file until SUCCESS/FAILURE
    alt SUCCESS
        Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
    else FAILURE or timeout
        Test->>K8s: Gather logs, delete resources
    end
    Test->>Test: Report test result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to:
    • tests/trainer/resources/mnist.ipynb — distributed PyTorch init, barriers, and trainer client usage.
    • tests/trainer/sdk_tests/fashion_mnist_tests.go — Kubernetes resource lifecycle, timeouts, and cleanup correctness.
    • tests/trainer/utils/utils_cluster_prep.go and tests/trainer/utils/utils_notebook.go — RBAC creation, CRD checks, polling logic, and kubectl exec interactions.

Poem

🐰 In a cluster of pods I hop and spin,
Papermill whirs as the notebooks begin,
Workers sync tightly, gradients take flight,
Fashion-MNIST learns through day and night,
A rabbit's small test hops into the light.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding initial end-to-end SDK sanity tests with supporting test infrastructure and utilities.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
tests/trainer/utils/utils_cluster_prep.go (2)

53-55: Consider making expected runtimes configurable.

The hardcoded list of ClusterTrainingRuntimes (torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.

Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:

-func EnsureTrainerClusterReady(t *testing.T, test Test) {
+func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) {
+	if len(expectedRuntimes) == 0 {
+		// Default runtimes if none specified
+		expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"}
+	}
 	...
-	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+	for _, name := range expectedRuntimes {

64-66: Hardcoded ServiceAccount name reduces reusability.

The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.

Consider accepting the SA name as a parameter:

-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) {
 	t.Helper()
-	saName := "jupyter-nb-kube-3aadmin"

Alternatively, if this SA name is always tied to notebooks, accept it from common.NOTEBOOK_CONTAINER_NAME explicitly to make the dependency clear.

tests/trainer/resources/mnist.ipynb (1)

17-125: Consider adding training validation.

The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.

For example, after training completes, you could add:

# At the end of train_fashion_mnist, before dist.destroy_process_group()
if dist.get_rank() == 0:
    if loss.item() > initial_loss:
        print("WARNING: Loss did not decrease during training")
    print(f"Final loss: {loss.item():.6f}")
tests/trainer/utils/utils_notebook.go (1)

49-53: Unused helper function.

CreateNotebookFromBytes is defined but not used anywhere in this PR. The test in fashion_mnist_tests.go creates the ConfigMap and Notebook separately.

Consider either:

  1. Using this helper in fashion_mnist_tests.go to reduce duplication
  2. Removing it if it's not needed yet
  3. Adding a comment explaining it's for future use
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2595866 and 796d164.

📒 Files selected for processing (6)
  • .gitignore (1 hunks)
  • tests/trainer/kubeflow_sdk_test.go (1 hunks)
  • tests/trainer/resources/mnist.ipynb (1 hunks)
  • tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
  • Test (34-45)
tests/common/notebook.go (3)
  • ContainerSize (72-72)
  • CreateNotebook (104-185)
  • NOTEBOOK_CONTAINER_NAME (37-37)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • GetPods (111-116)
tests/common/support/support.go (1)
  • TestTimeoutLong (35-35)
tests/trainer/kubeflow_sdk_test.go (2)
tests/common/test_tag.go (2)
  • Tags (32-40)
  • Sanity (48-50)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
  • RunFashionMnistCpuDistributedTraining (39-86)
tests/trainer/sdk_tests/fashion_mnist_tests.go (8)
tests/common/support/test.go (2)
  • T (90-102)
  • With (61-63)
tests/trainer/utils/utils_cluster_prep.go (2)
  • EnsureTrainerClusterReady (33-56)
  • EnsureNotebookRBAC (60-83)
tests/common/environment.go (2)
  • GetNotebookUserName (70-76)
  • GetNotebookUserToken (78-84)
tests/common/support/rbac.go (1)
  • CreateUserRoleBindingWithClusterRole (206-238)
tests/common/support/core.go (3)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/trainer/utils/utils_notebook.go (3)
  • BuildPapermillShellCmd (36-46)
  • WaitForNotebookPodRunning (56-64)
  • PollNotebookCompletionMarker (67-85)
tests/common/notebook.go (4)
  • CreateNotebook (104-185)
  • ContainerSizeSmall (75-75)
  • DeleteNotebook (187-190)
  • Notebooks (192-204)
tests/common/support/support.go (2)
  • TestTimeoutLong (35-35)
  • TestTimeoutDouble (36-36)
tests/trainer/utils/utils_cluster_prep.go (4)
tests/common/support/test.go (2)
  • T (90-102)
  • Test (34-45)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/core.go (2)
  • ServiceAccount (194-200)
  • ServiceAccounts (207-219)
tests/common/support/rbac.go (4)
  • CreateClusterRole (48-70)
  • CreateClusterRoleBinding (135-169)
  • CreateRole (27-46)
  • CreateRoleBinding (72-102)
🪛 GitHub Actions: Verify Generated Files and Import Organization
tests/trainer/utils/utils_notebook.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

tests/trainer/utils/utils_cluster_prep.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🔇 Additional comments (5)
.gitignore (1)

2-2: LGTM!

Adding .vscode/* to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.

tests/trainer/kubeflow_sdk_test.go (1)

26-30: LGTM!

Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.

tests/trainer/sdk_tests/fashion_mnist_tests.go (2)

39-86: Well-structured test implementation.

The test follows good practices:

  • Proper test namespace isolation
  • Prerequisite checks before execution
  • RBAC setup
  • Resource cleanup with defer
  • Clear error messages

The flow is logical and the use of helper utilities keeps the test readable.


33-36: Relative path verified—notebook file found at expected location.

The notebook file exists at tests/trainer/resources/mnist.ipynb, confirming the relative path "resources/" resolves correctly when tests run from the package directory.

tests/trainer/utils/utils_notebook.go (1)

36-46: LGTM!

The Papermill shell command construction is clean and handles optional extra packages appropriately. The set -e ensures failures are caught, and the marker file pattern is a good approach for async completion detection.

Comment on lines +156 to +160
"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
" torch_runtime = runtime"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Hardcoded runtime name needs to be resolved.

The comment on line 158 indicates this uses a placeholder runtime name "universal" that needs updating. This conflicts with the hardcoded runtime checks in utils_cluster_prep.go (lines 53-54) which expect different runtime names.

The PR description mentions needing a custom clusterTrainingRuntime named "universal", but the cluster prep utility checks for different names. Ensure consistency between:

  1. The runtime name expected by this notebook (line 158)
  2. The runtimes validated in EnsureTrainerClusterReady (utils_cluster_prep.go:53)
  3. The actual runtime available in the test environment

Consider parameterizing the runtime name or documenting the expected setup more clearly.



---

🏁 Script executed:

```shell
#!/bin/bash
# Extract and examine the notebook file structure and content around lines 156-160
fd -e ipynb -path '*/tests/trainer/resources/mnist.ipynb' --exec cat {} \; | python3 << 'EOF'
import json
import sys

try:
    data = json.load(sys.stdin)
    cells = data.get('cells', [])
    
    # Flatten all cells and count lines
    line_count = 0
    for cell_idx, cell in enumerate(cells):
        source = cell.get('source', [])
        if isinstance(source, str):
            source = source.split('\n')
        
        for line_idx, line in enumerate(source):
            line_count += 1
            # Print lines 150-165 to capture context around 156-160
            if 150 <= line_count <= 165:
                print(f"Line {line_count}: {line}")
except Exception as e:
    print(f"Error: {e}", file=sys.stderr)
EOF

Length of output: 217


🏁 Script executed:

#!/bin/bash
# Examine utils_cluster_prep.go around lines 53-54
fd -e go -name 'utils_cluster_prep.go' --exec cat {} \; | head -70 | tail -25

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Search for runtime-related configurations and definitions
rg -n "runtime" -i --type go -A2 -B2 | head -100

Length of output: 7893



---

🏁 Script executed:

```shell
#!/bin/bash
# Find and read the notebook file directly
find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Once found, extract notebook content and look for "universal" runtime reference
find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f -exec python3 << 'EOF' {} \;
import json
import sys

try:
    with open(sys.argv[1], 'r') as f:
        notebook = json.load(f)
    
    # Search through all cells for "universal" string
    for cell_idx, cell in enumerate(notebook.get('cells', [])):
        source = cell.get('source', [])
        if isinstance(source, str):
            source_lines = source.split('\n')
        else:
            source_lines = source
        
        for line_idx, line in enumerate(source_lines):
            if 'universal' in line.lower() or 'torch_runtime' in line.lower():
                print(f"Cell {cell_idx}, Line {line_idx}: {line}")
except Exception as e:
    print(f"Error processing {sys.argv[1]}: {e}", file=sys.stderr)
EOF

Length of output: 218


🏁 Script executed:

#!/bin/bash
# Also search codebase for references to "universal" runtime
rg -n "universal" --type go

Length of output: 113



---

🏁 Script executed:

```shell
#!/bin/bash
# Search notebook file for "universal" using grep/rg
rg -n "universal" ./tests/trainer/resources/mnist.ipynb

Length of output: 108


🏁 Script executed:

#!/bin/bash
# Search for torch_runtime usage in notebook
rg -n "torch_runtime\|runtime" ./tests/trainer/resources/mnist.ipynb | head -30

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Extract notebook with cat and parse JSON properly
cat ./tests/trainer/resources/mnist.ipynb | python3 << 'PYTHON_EOF'
import json
import sys

try:
    nb = json.load(sys.stdin)
    for idx, cell in enumerate(nb.get('cells', [])):
        source = cell.get('source', [])
        if isinstance(source, list):
            content = ''.join(source)
        else:
            content = source
        
        # Look for runtime-related content
        if 'universal' in content.lower() or 'torch_runtime' in content.lower():
            lines = content.split('\n')
            for line_num, line in enumerate(lines):
                if 'universal' in line.lower() or 'torch_runtime' in line.lower():
                    print(f"Cell {idx}: {line}")
except Exception as e:
    print(f"Error: {e}", file=sys.stderr)
PYTHON_EOF

Length of output: 48


Update notebook runtime selection to match validated test runtimes.

The notebook hardcodes a check for runtime.name == "universal" at line 158, but the test utility (utils_cluster_prep.go lines 53-54) validates only torch-cuda-241, torch-cuda-251, torch-rocm-241, and torch-rocm-251 runtimes are available. The "universal" runtime is marked as a future addition (see the TODO at utils_cluster_prep.go line 52). Since "universal" does not currently exist in the test environment, the notebook's torch_runtime variable will remain unset, causing failures in subsequent code.

Align the notebook to use one of the currently available runtimes (e.g., "torch-cuda-241") or parameterize it with a fallback mechanism.

🤖 Prompt for AI Agents
In tests/trainer/resources/mnist.ipynb around lines 156 to 160, the notebook
checks for runtime.name == "universal" which doesn't exist in the test
environment and leaves torch_runtime unset; change the selection to match
validated runtimes (for example "torch-cuda-241") or implement a fallback loop
that picks the first available runtime from the allowed set
["torch-cuda-241","torch-cuda-251","torch-rocm-241","torch-rocm-251"]; update
the condition to test membership in that list and set torch_runtime accordingly
so downstream cells always have a valid runtime.

@@ -0,0 +1,83 @@
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted.

Run make imports to fix the import organization as indicated by the pipeline failure.

🧰 Tools
🪛 GitHub Actions: Verify Generated Files and Import Organization

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 1 to 1 the import
statements are not properly organized causing the CI pipeline to fail; run the
project's import formatter (make imports) or manually sort and group imports
according to the project's gofmt/gofmtimports rules (standard libs first, blank
line, third-party, blank line, internal packages) and ensure there are no unused
imports, then save and re-run the pipeline.

Comment on lines +37 to +39
if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
}
Copy link

@coderabbitai coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().

Apply this diff to use the Kubernetes API client:

-	if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
-		t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
-	}
+	_, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{
+		Group:    "apiextensions.k8s.io",
+		Version:  "v1",
+		Resource: "customresourcedefinitions",
+	}).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{})
+	test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the
exec.Command("kubectl", ...) call with a Kubernetes API call using the provided
test.Client(): call
test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx,
"jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g.,
context.Background()), and if it returns an error fail the test with t.Fatalf
including the error; remove the exec.Command usage and its output handling and
ensure the file imports context and metav1 (and any apiextensions client types
if missing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion!
Will switch to it most likely but after comments from others.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.

Comment on lines +42 to +44
if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
}
Copy link

@coderabbitai coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.

Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.

Apply this diff to use the Kubernetes API client with Eventually for polling:

-	if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
-		t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
-	}
+	odhNamespace := "opendatahub" // Consider making this configurable
+	test.Eventually(func(g Gomega) {
+		deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{})
+		g.Expect(err).NotTo(HaveOccurred())
+		g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{
+			"Type":   Equal(appsv1.DeploymentAvailable),
+			"Status": Equal(corev1.ConditionTrue),
+		})))
+	}, 180*time.Second).Should(Succeed(), "Trainer controller not available")

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch all review comments from this PR to find the "above" comment being referenced
gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50

Length of output: 25


🏁 Script executed:

#!/bin/bash
# Try fetching review comments using the GitHub API directly
gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100

Length of output: 18212


@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.

🐰✨

// Ensure ServiceAccount exists
saName := "jupyter-nb-kube-3aadmin"
sa := &corev1.ServiceAccount{ObjectMeta: metav1.ObjectMeta{Name: saName, Namespace: namespace}}
_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ignoring Create error could mask issues.

The error from ServiceAccounts(...).Create(...) is silently ignored, which might hide actual problems (e.g., permission issues).

Consider logging the error or checking if it's an AlreadyExists error:

-	_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+	_, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+	if err != nil && !errors.IsAlreadyExists(err) {
+		test.T().Logf("Warning: failed to create ServiceAccount: %v", err)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
_, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
if err != nil && !errors.IsAlreadyExists(err) {
test.T().Logf("Warning: failed to create ServiceAccount: %v", err)
}
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around line 66, the code currently
ignores the error returned by ServiceAccounts(...).Create(...); capture the
returned error, check if it's nil or an apierrors.IsAlreadyExists(err) and only
treat non-nil/non-AlreadyExists errors as failures (fail the test or log the
error), otherwise proceed; update the call to handle the error accordingly (use
t.Fatalf/t.Fatalf-like helper or test.Logger.Errorf) so permission or creation
failures are not silently swallowed.

Comment on lines +67 to +85
func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
var finalErr error
test.Eventually(func() bool {
out, err := exec.Command("kubectl", "-n", namespace, "exec", podName, "-c", containerName, "--", "cat", marker).CombinedOutput()
if err != nil {
return false
}
switch strings.TrimSpace(string(out)) {
case "SUCCESS":
return true
case "FAILURE":
finalErr = fmt.Errorf("Notebook execution failed")
return true
default:
return false
}
}, timeout).Should(BeTrue(), "Notebook did not reach definitive state")
return finalErr
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) to read the marker file is brittle and adds external dependencies.

Consider using the Kubernetes API to read the file contents via the pod exec subresource:

func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
	var finalErr error
	test.Eventually(func() bool {
		req := test.Client().Core().CoreV1().RESTClient().
			Get().
			Namespace(namespace).
			Resource("pods").
			Name(podName).
			SubResource("exec").
			VersionedParams(&corev1.PodExecOptions{
				Container: containerName,
				Command:   []string{"cat", marker},
				Stdout:    true,
				Stderr:    true,
			}, scheme.ParameterCodec)
		
		exec, err := remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL())
		if err != nil {
			return false
		}
		
		var stdout, stderr bytes.Buffer
		err = exec.Stream(remotecommand.StreamOptions{
			Stdout: &stdout,
			Stderr: &stderr,
		})
		if err != nil {
			return false
		}
		
		switch strings.TrimSpace(stdout.String()) {
		case "SUCCESS":
			return true
		case "FAILURE":
			finalErr = fmt.Errorf("Notebook execution failed")
			return true
		default:
			return false
		}
	}, timeout).Should(BeTrue(), "Notebook did not reach definitive state")
	return finalErr
}
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_notebook.go around lines 67 to 85, replace the
brittle exec.Command("kubectl", ...) call with a Kubernetes client-based pod
exec using test.Client() and the pod exec subresource: build a REST GET request
against pods/{name}/exec with corev1.PodExecOptions (container, command ["cat",
marker], Stdout/Stderr true), create an SPDY executor with
remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL()), stream into
bytes.Buffer for stdout/stderr, handle executor creation/stream errors by
returning false (and record finalErr when stdout trimmed equals "FAILURE"),
return true on "SUCCESS" or "FAILURE", and ensure you add necessary imports
(bytes, corev1, scheme, remotecommand) and preserve the existing
Eventually/timeout behavior and return finalErr at the end.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/trainer/utils/utils_notebook.go (2)

17-32: Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted. Run make imports to fix the import organization.


66-85: Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.

🧹 Nitpick comments (1)
tests/trainer/utils/utils_notebook.go (1)

55-64: Eliminate redundant GetPods call.

Lines 62-63 repeat the same GetPods call that was already executed in the Eventually block (lines 58-60), which is inefficient.

Apply this diff to store and reuse the pods result:

 func WaitForNotebookPodRunning(test Test, namespace string) (string, string) {
 	labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME)
+	var pods []corev1.Pod
 	test.Eventually(func() []corev1.Pod {
-		return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		return pods
 	}, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod")
 
-	pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
 	return pods[0].Name, pods[0].Spec.Containers[0].Name
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796d164 and 4e71919.

📒 Files selected for processing (2)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/utils/utils_cluster_prep.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
  • Test (34-45)
tests/common/notebook.go (3)
  • ContainerSize (72-72)
  • CreateNotebook (104-185)
  • NOTEBOOK_CONTAINER_NAME (37-37)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • GetPods (111-116)
tests/common/support/support.go (1)
  • TestTimeoutLong (35-35)
🔇 Additional comments (2)
tests/trainer/utils/utils_notebook.go (2)

34-46: LGTM! Papermill command construction is sound.

The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses sleep infinity to keep the container running for inspection. The use of set -e ensures proper error propagation.


48-53: LGTM! Notebook setup orchestration is correct.

The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.

" dataset = datasets.FashionMNIST(\n",
" \"./data\",\n",
" train=True,\n",
" download=True,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67

"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
" torch_runtime = runtime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this approach instead of getting universal runtime by client.get_runtime("universal")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!

" \"cpu\": 2,\n",
" \"memory\": \"8Gi\",\n",
" },\n",
" packages_to_install=[\"torchvision\"],\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think it makes sense. Thanks.

}
// TODO: Extend / tweak with universal image runtime once available
for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can be removed


// EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can
// read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace.
func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks Karel

// PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout.
func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
var finalErr error
test.Eventually(func() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about whether it would be simpler to print status into Workbench log directly.

It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?


// Wait for the Notebook Pod and get pod/container names
podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have sense to add assertion checking that TrainJob is created and successfully finished?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is very important :)
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants