Skip to content

Conversation

@abhijeet-dhumal
Copy link
Member

@abhijeet-dhumal abhijeet-dhumal commented Jan 8, 2026

Description

RHOAIENG-42362
Enables RHAI feature tests (progression tracking, JIT checkpointing) to run in disconnected OpenShift environments where cluster nodes have no internet access.

Key Features

  • S3/MinIO Integration: Models and datasets are pre-staged to S3, notebook downloads to shared PVC, training pods load from local paths (no S3 access needed in training pods)
  • Kubeflow SDK Installation: install_kubeflow.py tries PyPI first, falls back to S3 wheel if version mismatch (will change this flow once disconnected mirror index includes official RedHat test indexes - todo for next release)
  • Multi-GPU Support: New test cases for multi-node multi-GPU scenarios
  • Prestage Script: Supports presets (rhai, sft, all), custom models/datasets, --check and --force flags, auto-resume on interruption

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests

    • Added multi-GPU test cases for CUDA and ROCm environments with 2-node, 2-GPU configurations
    • Extended test infrastructure with S3 and PyPI mirror support for offline environments
  • Documentation

    • Added comprehensive disconnected environment setup guide for offline OpenShift clusters, including model/dataset pre-staging and environment configuration procedures

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci bot requested review from chipspeak and sutaakar January 8, 2026 19:43
@openshift-ci
Copy link

openshift-ci bot commented Jan 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign efazal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

This PR extends multi-GPU training test coverage for Kubeflow SDK with CUDA and ROCm test variants, adds comprehensive disconnected environment setup documentation with Python utilities for S3 pre-staging and Kubeflow installation, and enhances the training notebook to support local and S3-based data loading for offline deployments.

Changes

Cohort / File(s) Summary
Multi-GPU Test Functions
tests/trainer/kubeflow_sdk_test.go
Added 6 new test functions (3 for CUDA, 3 for ROCm) that configure 2-node, 2-GPU setups and delegate to existing multi-GPU test runners for progression, checkpointing, and feature validation.
Disconnected Environment Setup Guide
tests/trainer/resources/disconnected_env/README.md
Comprehensive documentation detailing prerequisites, step-by-step procedures for image mirroring, model/dataset pre-staging, Kubeflow SDK wheel preparation, environment variable configuration, and test execution with troubleshooting scenarios.
Kubeflow Installation Utility
tests/trainer/resources/disconnected_env/install_kubeflow.py
Python script implementing PyPI-first installation with S3 fallback, version verification, and credential-based S3 wheel download for disconnected environments.
Model/Dataset Pre-staging Utility
tests/trainer/resources/disconnected_env/prestage_models_datasets.py
Python tool for downloading HuggingFace models and datasets with resume capability, uploading to S3/MinIO, supporting presets, dry-run mode, and force overwrite with progress tracking.
Training Notebook
tests/trainer/resources/rhai_features.ipynb
Extended training routine with auto-detection of local vs. HuggingFace data sources, S3 download logic for disconnected environments, and environment-based configuration for multi-node, multi-GPU resource provisioning.
Test Configuration & Helpers
tests/trainer/sdk_tests/rhai_features_tests.go
Extended RhaiFeatureConfig with NumNodes and NumGpusPerNode fields; added three public multi-GPU test entry points; enhanced test setup to inject install_kubeflow.py into ConfigMap and handle S3/PyPI mirror environment variables.

Sequence Diagram(s)

sequenceDiagram
    participant Operator as Test Operator
    participant TestEnv as Test Environment
    participant S3 as S3/MinIO
    participant Registry as Container Registry
    participant Kubeflow as Kubeflow Cluster
    participant Notebook as Training Notebook

    Operator->>TestEnv: 1. Run image mirror script
    TestEnv->>Registry: Pull public images
    TestEnv->>Registry: Push to private registry
    
    Operator->>S3: 2. Run prestage_models_datasets.py
    S3->>S3: Download HF models/datasets
    S3->>S3: Upload to S3 bucket
    
    Operator->>TestEnv: 3. Prepare Kubeflow SDK wheel
    TestEnv->>S3: Store wheel in S3
    
    Operator->>TestEnv: 4. Configure env vars (S3, PyPI, GPU count)
    Operator->>Kubeflow: 5. Trigger multi-GPU test
    
    Kubeflow->>Notebook: Launch training pods (2 nodes × 2 GPUs)
    Notebook->>Notebook: Detect local data via PVC
    alt Local Data Present
        Notebook->>Notebook: Load model/dataset from PVC
    else Local Data Absent
        Notebook->>S3: Download model/dataset from S3
        S3->>Notebook: Stream to local PVC
    end
    
    Notebook->>Kubeflow: Execute distributed training
    Kubeflow->>Operator: Report test results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops of joy, a multi-GPU dream,
CUDA, ROCm—what a gleaming team!
S3 mirrors in the dark, no internet near,
Models pre-staged and notebooks clear,
Distributed training, nodes entwined,
Disconnected labs with peace of mind! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: enabling SDK TransformersTrainer end-to-end tests to work in disconnected environments, which aligns with all file modifications across test configs, utility scripts, documentation, and test infrastructure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@abhijeet-dhumal abhijeet-dhumal changed the title fix: Update SDK RHAI TransfomersTrainer based Feature tests for disco… Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment Jan 8, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @tests/trainer/sdk_tests/rhai_features_tests.go:
- Around line 214-269: The PyPI mirror log currently prints pipIndexUrl raw
which may contain embedded credentials; sanitize before logging by parsing
pipIndexUrl (e.g., via net/url) to remove Userinfo or log only the hostname
(reuse pipTrustedHost) instead of the full URL; update the test.T().Logf call
that references pipIndexUrl and ensure pipTrustedHost computation still extracts
the host for trusted-host usage.
🧹 Nitpick comments (10)
tests/trainer/resources/disconnected_env/prestage_models_datasets.py (6)

113-119: Minor: format_size returns inconsistent precision for bytes.

For bytes (< 1024), it returns an integer format, but after the loop finishes without returning (≥ PB), the final size_bytes is already divided by 1024^5, so it represents exabytes not petabytes.

♻️ Suggested fix
 def format_size(size_bytes: int) -> str:
     """Format file size in human-readable format."""
-    for unit in ["B", "KB", "MB", "GB", "TB"]:
+    for unit in ["B", "KB", "MB", "GB", "TB", "PB"]:
         if size_bytes < 1024:
             return f"{size_bytes:.1f} {unit}" if unit != "B" else f"{size_bytes} {unit}"
         size_bytes /= 1024
-    return f"{size_bytes:.1f} PB"
+    return f"{size_bytes:.1f} EB"

243-261: Unused function parameters prefix and is_last.

The static analysis correctly flags that prefix and is_last parameters on print_tree are never used in the function body. These appear to be leftover from a different design.

♻️ Suggested fix
-def print_tree(tree: Dict, prefix: str = "", is_last: bool = True, indent: str = ""):
+def print_tree(tree: Dict, indent: str = ""):
     """
     Print tree structure with proper indentation.
     """
     items = sorted(tree.items(), key=lambda x: (isinstance(x[1], dict), x[0]))
     
     for i, (name, value) in enumerate(items):
         is_last_item = i == len(items) - 1
         connector = "└── " if is_last_item else "├── "
         
         if isinstance(value, dict):
             # Directory
             print(f"{indent}{connector}{name}/")
             new_indent = indent + ("    " if is_last_item else "│   ")
-            print_tree(value, name, is_last_item, new_indent)
+            print_tree(value, new_indent)
         else:
             # File with size
             size_str = format_size(value)
             print(f"{indent}{connector}{name} ({size_str})")

322-327: Unused loop variables file_size and relative.

The unpacked tuple values file_size and relative are not used inside the loop body. Use _ to indicate intentionally ignored values.

♻️ Suggested fix
-        for file_path, s3_key, file_size, relative in files_to_upload:
+        for file_path, s3_key, _, _ in files_to_upload:
             # Upload with callback for progress
             with open(file_path, "rb") as f:
                 wrapped_file = CallbackIOWrapper(pbar.update, f, "read")
                 s3_client.upload_fileobj(wrapped_file, bucket, s3_key)
             file_count += 1

463-465: Use explicit Optional type hint for nullable parameters.

PEP 484 prohibits implicit Optional. Use str | None or Optional[str] for parameters that can be None.

♻️ Suggested fix
+from typing import Dict, List, Tuple, Optional

-def process_model(model_name: str, s3_client, bucket: str, download_dir: str, 
-                  skip_download: bool, skip_upload: bool, force: bool = False,
-                  custom_s3_prefix: str = None) -> Tuple[str, int, str]:
+def process_model(model_name: str, s3_client, bucket: str, download_dir: str, 
+                  skip_download: bool, skip_upload: bool, force: bool = False,
+                  custom_s3_prefix: str | None = None) -> Tuple[str, int, str]:

-def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str,
-                    skip_download: bool, skip_upload: bool, force: bool = False,
-                    custom_s3_prefix: str = None) -> Tuple[str, int, str]:
+def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str,
+                    skip_download: bool, skip_upload: bool, force: bool = False,
+                    custom_s3_prefix: str | None = None) -> Tuple[str, int, str]:

Also applies to: 502-504


387-387: Remove extraneous f prefixes from strings without placeholders.

Several strings use f"..." format but contain no placeholders. This adds minor overhead and can confuse readers.

♻️ Examples to fix
-        print(f"    (Saving as JSONL for SFT notebook compatibility)")
+        print("    (Saving as JSONL for SFT notebook compatibility)")

-    print(f"    (HuggingFace datasets library shows download progress)")
+    print("    (HuggingFace datasets library shows download progress)")

-    print(f"  Saving dataset to disk...")
+    print("  Saving dataset to disk...")

-            print(f"     Use --force to overwrite existing data")
+            print("     Use --force to overwrite existing data")

Also applies to: 419-419, 425-425, 481-481, 520-520, 846-847


197-213: Consider logging exception details when catching blind exceptions.

The check_s3_prefix_exists function catches all exceptions silently, which can make debugging difficult. Consider at least logging at debug level.

♻️ Suggested improvement
 def check_s3_prefix_exists(s3_client, bucket: str, s3_prefix: str) -> Tuple[bool, int]:
     """
     Check if an S3 prefix already has objects.
     Returns (exists, file_count).
     """
     try:
         paginator = s3_client.get_paginator("list_objects_v2")
         file_count = 0
         for page in paginator.paginate(Bucket=bucket, Prefix=s3_prefix + "/", MaxKeys=100):
             contents = page.get("Contents", [])
             file_count += len(contents)
             if file_count > 0:
                 # Early exit once we confirm existence
-                break
-        return (file_count > 0, file_count)
-    except Exception:
+                return (True, file_count)
+        return (file_count > 0, file_count)
+    except Exception as e:
+        # Silently return False - caller should handle missing data
         return (False, 0)
tests/trainer/resources/disconnected_env/README.md (1)

158-174: Add language specifiers to fenced code blocks for better rendering.

Several code blocks are missing language specifiers, which affects syntax highlighting and tooling compatibility.

♻️ Examples of fixes needed

Line 158 (bucket structure):

-```
+```text
 <bucket>/
 ├── models/

Line 314 (error symptom):

-```
+```text
 TypeError: expected str, bytes or os.PathLike object, not NoneType

Line 384 (test output):

-```
+```text
 === RUN   TestRhaiTrainingProgressionCPU

Line 411 (ASCII diagram):

-```
+```text
 ┌─────────────────┐

Also applies to: 314-316, 328-330, 344-346, 369-371, 384-391, 394-405, 411-444

tests/trainer/sdk_tests/rhai_features_tests.go (1)

252-268: Potential issue: hostname extraction may fail for URLs with paths.

The PIP_TRUSTED_HOST extraction logic handles : (port) and / (path) but processes them sequentially. If the URL is https://host:8080/simple/, the first condition extracts host correctly. But if there's no port (https://host/simple/), it correctly extracts host. The logic appears correct.

However, the order of operations could be clearer:

♻️ Suggested clarification (optional)
 		if pipTrustedHost == "" {
-			// Extract hostname from URL
-			pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
-			if idx := strings.Index(pipTrustedHost, ":"); idx > 0 {
-				pipTrustedHost = pipTrustedHost[:idx]
-			} else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 {
-				pipTrustedHost = pipTrustedHost[:idx]
-			}
+			// Extract hostname from URL (strip scheme, then take up to first : or /)
+			pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
+			// Find the first occurrence of : or / to get just the hostname
+			if idx := strings.IndexAny(pipTrustedHost, ":/"); idx > 0 {
+				pipTrustedHost = pipTrustedHost[:idx]
+			}
 		}
tests/trainer/resources/disconnected_env/install_kubeflow.py (1)

137-139: Consider more specific exception handling for S3 operations.

The broad except Exception catch makes debugging harder. Consider catching specific boto3 exceptions or at least logging the exception type.

♻️ Suggested improvement
-    except Exception as e:
-        print(f"S3 fallback failed: {e}")
+    except ImportError as e:
+        print(f"S3 fallback failed - boto3 not installed: {e}")
+        return False
+    except Exception as e:
+        print(f"S3 fallback failed ({type(e).__name__}): {e}")
         return False
tests/trainer/resources/rhai_features.ipynb (1)

141-144: Consider adding strict=True to zip() for safety.

Python 3.10+ supports strict=True on zip() to raise an error if iterables have different lengths. This would catch data corruption issues early.

♻️ Suggested fix (if Python 3.10+ is guaranteed)
     def tokenize_function(examples):
         texts = [f\"### Instruction:\\n{i}\\n\\n### Response:\\n{o}\"
-                 for i, o in zip(examples[\"instruction\"], examples[\"output\"])]
+                 for i, o in zip(examples[\"instruction\"], examples[\"output\"], strict=True)]
         return tokenizer(texts, padding=\"max_length\", truncation=True, max_length=128)

Note: Only apply if the notebook runtime guarantees Python 3.10+. The README mentions Python 3.9+ requirement, so this may not be applicable yet.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a056f04 and 5f1609e.

📒 Files selected for processing (6)
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/resources/disconnected_env/README.md
  • tests/trainer/resources/disconnected_env/install_kubeflow.py
  • tests/trainer/resources/disconnected_env/prestage_models_datasets.py
  • tests/trainer/resources/rhai_features.ipynb
  • tests/trainer/sdk_tests/rhai_features_tests.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-15T09:49:36.063Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/resources/osft.ipynb:460-493
Timestamp: 2025-12-15T09:49:36.063Z
Learning: In the Kubeflow Trainer SDK version used by this repository, get_job_logs(follow=False) returns a generator of log lines, just like when follow=True. When using this API, collect the generator into a list and then join the lines to form the full log string (e.g., logs = ''.join(list(get_job_logs(follow=False)))).

Applied to files:

  • tests/trainer/resources/rhai_features.ipynb
📚 Learning: 2025-12-12T12:23:09.527Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/sdk_tests/osft_traininghub_tests.go:33-36
Timestamp: 2025-12-12T12:23:09.527Z
Learning: In Go tests, the working directory during go test is the package directory being tested. Relative paths (e.g., "resources/file.txt") resolve to paths under that package directory. When tests live in subdirectories (e.g., tests/trainer/sdk_tests/), referencing resources should still be relative to the package directory being tested (tests/trainer/), so ensure resource paths are resolved from the package root of each test package.

Applied to files:

  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/sdk_tests/rhai_features_tests.go
📚 Learning: 2025-12-16T07:55:49.810Z
Learnt from: ChughShilpa
Repo: opendatahub-io/distributed-workloads PR: 568
File: tests/trainer/cluster_training_runtimes_test.go:100-101
Timestamp: 2025-12-16T07:55:49.810Z
Learning: In Go tests using Gomega, ContainSubstring(substr string, args ...any) supports variadic args that are used with fmt.Sprintf to format the substring. This means you can pass formatting operands such as ContainSubstring("%s/rhoai/%s", registryName, expectedRuntime.Image) and have the matcher evaluate the formatted substring. Ensure you import fmt only if you actually format strings beforehand; otherwise rely on the variadic args as shown. This pattern applies to test files under the tests directory (e.g., tests/trainer/cluster_training_runtimes_test.go) and can be generalized to other tests using Gomega.

Applied to files:

  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/sdk_tests/rhai_features_tests.go
📚 Learning: 2026-01-06T17:26:58.604Z
Learnt from: sutaakar
Repo: opendatahub-io/distributed-workloads PR: 579
File: tests/common/support/olm.go:1-25
Timestamp: 2026-01-06T17:26:58.604Z
Learning: In the distributed-workloads repository, Go imports should be formatted so there is a blank line between different third-party import groups (e.g., between github.com/operator-framework and k8s.io). Use openshift-goimports to format imports for Go files, ensuring the separation between distinct third-party groups is preserved. This convention applies to all Go files (e.g., tests/common/support/olm.go and beyond).

Applied to files:

  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/sdk_tests/rhai_features_tests.go
🧬 Code graph analysis (2)
tests/trainer/kubeflow_sdk_test.go (3)
tests/common/test_tag.go (4)
  • Tags (32-40)
  • KftoCuda (72-74)
  • MultiNodeMultiGpu (118-122)
  • KftoRocm (76-78)
tests/common/support/accelerator.go (2)
  • NVIDIA (22-22)
  • AMD (20-20)
tests/trainer/sdk_tests/rhai_features_tests.go (3)
  • RunRhaiFeaturesProgressionMultiGpuTest (116-127)
  • RunRhaiFeaturesCheckpointMultiGpuTest (130-141)
  • RunRhaiFeaturesAllMultiGpuTest (144-155)
tests/trainer/sdk_tests/rhai_features_tests.go (2)
tests/common/support/core.go (1)
  • CreateConfigMap (33-54)
tests/common/support/environment.go (4)
  • GetStorageBucketDefaultEndpoint (155-158)
  • GetStorageBucketAccessKeyId (165-168)
  • GetStorageBucketSecretKey (170-173)
  • GetStorageBucketName (175-178)
🪛 markdownlint-cli2 (0.18.1)
tests/trainer/resources/disconnected_env/README.md

158-158: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


314-314: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


328-328: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


344-344: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


369-369: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


384-384: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


394-394: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


411-411: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
tests/trainer/resources/disconnected_env/install_kubeflow.py

59-59: subprocess call: check for execution of untrusted input

(S603)


118-118: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)


124-124: subprocess call: check for execution of untrusted input

(S603)


135-135: Consider moving this statement to an else block

(TRY300)


137-137: Do not catch blind exception: Exception

(BLE001)

tests/trainer/resources/rhai_features.ipynb

125-125: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


249-249: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/trainer/resources/disconnected_env/prestage_models_datasets.py

211-211: Consider moving this statement to an else block

(TRY300)


212-212: Do not catch blind exception: Exception

(BLE001)


238-238: Do not catch blind exception: Exception

(BLE001)


243-243: Unused function argument: prefix

(ARG001)


243-243: Unused function argument: is_last

(ARG001)


300-301: try-except-pass detected, consider logging the exception

(S110)


300-300: Do not catch blind exception: Exception

(BLE001)


322-322: Loop control variable file_size not used within loop body

(B007)


322-322: Loop control variable relative not used within loop body

(B007)


349-349: Do not catch blind exception: Exception

(BLE001)


387-387: f-string without any placeholders

Remove extraneous f prefix

(F541)


419-419: f-string without any placeholders

Remove extraneous f prefix

(F541)


425-425: f-string without any placeholders

Remove extraneous f prefix

(F541)


465-465: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


481-481: f-string without any placeholders

Remove extraneous f prefix

(F541)


504-504: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


520-520: f-string without any placeholders

Remove extraneous f prefix

(F541)


774-774: Do not catch blind exception: Exception

(BLE001)


846-846: f-string without any placeholders

Remove extraneous f prefix

(F541)


847-847: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (16)
tests/trainer/resources/disconnected_env/prestage_models_datasets.py (1)

1-91: LGTM - Well-structured CLI tool with good UX.

The prestage script is well-designed with:

  • Clear preset system for common use cases
  • Dry-run and check modes for safe operation
  • Resume support for interrupted uploads
  • Good progress feedback with tqdm

Also applies to: 545-851

tests/trainer/resources/disconnected_env/README.md (2)

1-469: LGTM - Comprehensive and well-organized documentation.

The README provides excellent guidance for disconnected environment setup with:

  • Clear step-by-step instructions
  • Helpful troubleshooting section
  • Visual test flow diagram
  • Quick reference tables

This will significantly help operators deploy in air-gapped environments.


12-12: The Go 1.24 requirement stated in the README is accurate. The go.mod file specifies go 1.24.6 with toolchain go1.24.11, and the README correctly documents this constraint. Additionally, section 5.1 of the same README provides clear instructions for verifying and setting up Go 1.24. No issues found.

Likely an incorrect or invalid review comment.

tests/trainer/kubeflow_sdk_test.go (1)

91-122: LGTM - Well-structured multi-GPU test additions.

The new multi-GPU tests follow the established patterns:

  • Consistent naming convention (TestRhai...MultiGpu{Cuda,Rocm})
  • Proper use of MultiNodeMultiGpu(2, support.NVIDIA/AMD, 2) tags for 2x2 GPU configuration
  • Correct delegation to the corresponding runner functions with explicit node/GPU counts

The tests align well with the RunRhaiFeaturesXxxMultiGpuTest functions defined in rhai_features_tests.go.

tests/trainer/sdk_tests/rhai_features_tests.go (4)

69-71: LGTM - Clean multi-GPU configuration extension.

The NumNodes and NumGpusPerNode fields are well-integrated into RhaiFeatureConfig with sensible defaults (2 nodes, 1 GPU per node). The existing single-GPU test runners now explicitly set these defaults, maintaining backward compatibility.

Also applies to: 82-84, 96-98, 110-112


115-155: LGTM - Multi-GPU test runners follow established patterns.

The three new multi-GPU runner functions (RunRhaiFeaturesProgressionMultiGpuTest, RunRhaiFeaturesCheckpointMultiGpuTest, RunRhaiFeaturesAllMultiGpuTest) correctly:

  • Accept numNodes and numGpusPerNode parameters
  • Delegate to the shared runRhaiFeaturesTestWithConfig with appropriate config
  • Mirror the structure of existing single-GPU runners

174-187: LGTM - ConfigMap now includes install helper script.

The ConfigMap creation now properly bundles both the notebook and the install_kubeflow.py script, enabling the disconnected installation workflow. The relative path resources/disconnected_env/install_kubeflow.py correctly references from the sdk_tests package directory. Based on learnings, this path resolution is correct since the working directory during go test is the package directory being tested.


281-317: LGTM - Shell command properly propagates multi-GPU config.

The shell command correctly exports NUM_NODES and NUM_GPUS_PER_NODE environment variables, which the notebook consumes to configure the training job. The format string properly interpolates all the configuration values.

tests/trainer/resources/disconnected_env/install_kubeflow.py (3)

37-52: Version verification imports kubeflow but doesn't handle stale imports.

If install_from_pypi() installs a new version after a previous import failed, the verify_kubeflow_version() call will import the newly installed package. However, if kubeflow was previously imported (e.g., an older version), Python won't re-import it. This scenario is unlikely in the current flow but worth noting.

The current implementation is fine for the intended use case (fresh install in container).


76-139: S3 fallback implementation is solid with appropriate error handling.

The S3 fallback logic:

  • Validates all required credentials before attempting download
  • Uses proper S3v4 signing and path-style addressing for MinIO compatibility
  • Preserves the original wheel filename (important for pip)
  • Disables SSL verification appropriately for self-signed certs

The use of /tmp/ for the wheel download is acceptable for a container environment where the script runs once during initialization.


142-156: LGTM - Clean main function with proper exit codes.

The main function follows a clear PyPI-first, S3-fallback pattern and returns appropriate exit codes for success (0) and failure (1).

tests/trainer/resources/rhai_features.ipynb (5)

20-51: LGTM - Well-documented training function with auto-detection.

The updated train_bloom() function has a clear docstring explaining the auto-detection behavior. The local path check (config.json existence) is a reliable way to detect pre-downloaded models.


95-133: LGTM - Local vs HuggingFace mode switching is well-implemented.

The branching logic correctly handles:

  • Local mode: Loads from pre-downloaded paths on shared PVC
  • HuggingFace mode: Downloads with rank-0 coordination via dist.barrier()

The rank-based synchronization prevents redundant downloads in distributed training.


242-341: LGTM - S3 pre-download logic is well-structured.

The S3 download cell:

  • Properly explains the PVC mount path differences (notebook vs training pods)
  • Cleans up old checkpoints to avoid resume conflicts
  • Uses idempotent downloads (skips if already present)
  • Has clear progress feedback

The path mapping comment is helpful for understanding the shared storage architecture.


363-391: LGTM - Multi-GPU configuration correctly propagates.

The notebook correctly:

  • Reads NUM_NODES and NUM_GPUS_PER_NODE from environment
  • Uses num_gpus_per_node for GPU resource requests
  • Logs the training configuration for debugging

259-265: Good practice: Cleaning up old checkpoints before training.

The cleanup of old checkpoints (/opt/app-root/src/checkpoints) is important to prevent the JIT checkpoint feature from resuming from stale checkpoints during test reruns.

Comment on lines +214 to +269
// S3/MinIO configuration for disconnected environments (optional)
s3Endpoint, _ := GetStorageBucketDefaultEndpoint()
s3AccessKey, _ := GetStorageBucketAccessKeyId()
s3SecretKey, _ := GetStorageBucketSecretKey()
s3Bucket, _ := GetStorageBucketName()
modelS3Prefix := os.Getenv("MODEL_S3_PREFIX")
if modelS3Prefix == "" {
modelS3Prefix = "models/distilgpt2"
}
datasetS3Prefix := os.Getenv("DATASET_S3_PREFIX")
if datasetS3Prefix == "" {
datasetS3Prefix = "alpaca-cleaned-datasets"
}

// Build S3 export commands (only if configured)
s3Exports := ""
if s3Endpoint != "" && s3Bucket != "" {
test.T().Logf("S3 mode: endpoint=%s, bucket=%s", s3Endpoint, s3Bucket)
s3Exports = fmt.Sprintf(
"export AWS_DEFAULT_ENDPOINT='%s'; "+
"export AWS_ACCESS_KEY_ID='%s'; "+
"export AWS_SECRET_ACCESS_KEY='%s'; "+
"export AWS_STORAGE_BUCKET='%s'; "+
"export MODEL_S3_PREFIX='%s'; "+
"export DATASET_S3_PREFIX='%s'; ",
s3Endpoint, s3AccessKey, s3SecretKey, s3Bucket, modelS3Prefix, datasetS3Prefix,
)
} else {
test.T().Log("HuggingFace mode: S3 not configured, will download from HF Hub")
}

// PyPI mirror configuration for disconnected environments (optional)
pipIndexUrl := os.Getenv("PIP_INDEX_URL")
pipExports := ""
pipInstallFlags := ""
if pipIndexUrl != "" {
test.T().Logf("PyPI mirror: %s", pipIndexUrl)
// Extract hostname for trusted-host
pipTrustedHost := os.Getenv("PIP_TRUSTED_HOST")
if pipTrustedHost == "" {
// Extract hostname from URL
pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
if idx := strings.Index(pipTrustedHost, ":"); idx > 0 {
pipTrustedHost = pipTrustedHost[:idx]
} else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 {
pipTrustedHost = pipTrustedHost[:idx]
}
}
pipExports = fmt.Sprintf(
"export PIP_INDEX_URL='%s'; "+
"export PIP_TRUSTED_HOST='%s'; "+
"export PYTHONHTTPSVERIFY='0'; ",
pipIndexUrl, pipTrustedHost,
)
pipInstallFlags = fmt.Sprintf("--index-url '%s' --trusted-host '%s' ", pipIndexUrl, pipTrustedHost)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if any logging statements might expose credentials in the test file
rg -n "Logf.*secret|Logf.*password|Logf.*access.?key" tests/trainer/sdk_tests/

Repository: opendatahub-io/distributed-workloads

Length of output: 62


🏁 Script executed:

#!/bin/bash
# Read the file to see the full context
cat -n tests/trainer/sdk_tests/rhai_features_tests.go | sed -n '214,269p'

Repository: opendatahub-io/distributed-workloads

Length of output: 2566


🏁 Script executed:

#!/bin/bash
# Search for uses of s3Exports and pipExports that might log them
rg -n "s3Exports|pipExports" tests/trainer/sdk_tests/rhai_features_tests.go -A 3 -B 1

Repository: opendatahub-io/distributed-workloads

Length of output: 908


🏁 Script executed:

#!/bin/bash
# Find the function call around line 313
sed -n '300,330p' tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 1253


🏁 Script executed:

#!/bin/bash
# Search for any logging that includes s3Exports or pipExports
rg -n "Log.*s3Exports|Log.*pipExports|Logf.*s3Exports|Logf.*pipExports" tests/trainer/sdk_tests/

Repository: opendatahub-io/distributed-workloads

Length of output: 62


🏁 Script executed:

#!/bin/bash
# Check how these export strings are used - likely passed to shell execution
rg -n "sh\.|bash|exec|Cmd|Command" tests/trainer/sdk_tests/rhai_features_tests.go | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 161


🏁 Script executed:

#!/bin/bash
# See the full shellCmd construction
sed -n '270,295p' tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 870


🏁 Script executed:

#!/bin/bash
# Search for PIP_INDEX_URL usage across the codebase to understand if credentials can be embedded
rg -n "PIP_INDEX_URL" tests/ -B 2 -A 2

Repository: opendatahub-io/distributed-workloads

Length of output: 6240


🏁 Script executed:

#!/bin/bash
# Check what environment variables are being used for credentials
rg -n "GetStorageBucket|PIP_" tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 463


Address potential credential exposure in PyPI URL logging at line 250.

The S3 credentials are handled safely—they're passed via environment variables in the shell command without being logged directly (line 231 only logs endpoint and bucket).

However, the PyPI URL logging at line 250 (test.T().Logf("PyPI mirror: %s", pipIndexUrl)) is a security risk: if PIP_INDEX_URL contains embedded credentials in the format https://user:password@registry/simple/, they will be exposed in test logs. Consider masking the credentials before logging or extract only the hostname.

🤖 Prompt for AI Agents
In @tests/trainer/sdk_tests/rhai_features_tests.go around lines 214 - 269, The
PyPI mirror log currently prints pipIndexUrl raw which may contain embedded
credentials; sanitize before logging by parsing pipIndexUrl (e.g., via net/url)
to remove Userinfo or log only the hostname (reuse pipTrustedHost) instead of
the full URL; update the test.T().Logf call that references pipIndexUrl and
ensure pipTrustedHost computation still extracts the host for trusted-host
usage.

required_version = get_required_version()
print(f"Attempting to install kubeflow=={required_version} from PyPI...")
result = subprocess.run(
[sys.executable, "-m", "pip", "install", "--quiet", f"kubeflow=={required_version}"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeflow package is not in default Pypi index but in red hat index only, so it should use the index to install SDK

print(f"Verified: kubeflow version {installed_version} matches required {required_version}")
return True
else:
print(f"Version mismatch: installed '{installed_version}', required '{required_version}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For installing from Red hat index getting:

Version mismatch: installed 'v0.2.1+rhai0', required '0.2.1+rhai0'

}

// Multi-GPU CUDA tests - 2 nodes, 2 GPUs each (requires 4 total NVIDIA GPUs)
func TestRhaiTrainingProgressionMultiGpuCuda(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test separately multinode single GPU and multinode multi GPU?
how much time does it take to run one test?

Copy link
Contributor

@sutaakar sutaakar Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the new model downloading implementation is quite slow, downloading from HuggingFace

abhijeet-dhumal

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants