Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582

abhijeet-dhumal · 2026-01-08T19:42:58Z

Description

RHOAIENG-42362
Enables RHAI feature tests (progression tracking, JIT checkpointing) to run in disconnected OpenShift environments where cluster nodes have no internet access.

Key Features

S3/MinIO Integration: Models and datasets are pre-staged to S3, notebook downloads to shared PVC, training pods load from local paths (no S3 access needed in training pods)
Kubeflow SDK Installation: install_kubeflow.py tries PyPI first, falls back to S3 wheel if version mismatch (will change this flow once disconnected mirror index includes official RedHat test indexes - todo for next release)
Multi-GPU Support: New test cases for multi-node multi-GPU scenarios
Prestage Script: Supports presets (rhai, sft, all), custom models/datasets, --check and --force flags, auto-resume on interruption

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added multi-GPU test cases for CUDA and ROCm environments with 2-node, 2-GPU configurations
- Extended test infrastructure with S3 and PyPI mirror support for offline environments
Documentation
- Added comprehensive disconnected environment setup guide for offline OpenShift clusters, including model/dataset pre-staging and environment configuration procedures

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…nnected env Signed-off-by: abhijeet-dhumal <[email protected]>

openshift-ci · 2026-01-08T19:43:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign efazal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-01-08T19:43:10Z

📝 Walkthrough

Walkthrough

This PR extends multi-GPU training test coverage for Kubeflow SDK with CUDA and ROCm test variants, adds comprehensive disconnected environment setup documentation with Python utilities for S3 pre-staging and Kubeflow installation, and enhances the training notebook to support local and S3-based data loading for offline deployments.

Changes

Cohort / File(s)	Summary
Multi-GPU Test Functions `tests/trainer/kubeflow_sdk_test.go`	Added 6 new test functions (3 for CUDA, 3 for ROCm) that configure 2-node, 2-GPU setups and delegate to existing multi-GPU test runners for progression, checkpointing, and feature validation.
Disconnected Environment Setup Guide `tests/trainer/resources/disconnected_env/README.md`	Comprehensive documentation detailing prerequisites, step-by-step procedures for image mirroring, model/dataset pre-staging, Kubeflow SDK wheel preparation, environment variable configuration, and test execution with troubleshooting scenarios.
Kubeflow Installation Utility `tests/trainer/resources/disconnected_env/install_kubeflow.py`	Python script implementing PyPI-first installation with S3 fallback, version verification, and credential-based S3 wheel download for disconnected environments.
Model/Dataset Pre-staging Utility `tests/trainer/resources/disconnected_env/prestage_models_datasets.py`	Python tool for downloading HuggingFace models and datasets with resume capability, uploading to S3/MinIO, supporting presets, dry-run mode, and force overwrite with progress tracking.
Training Notebook `tests/trainer/resources/rhai_features.ipynb`	Extended training routine with auto-detection of local vs. HuggingFace data sources, S3 download logic for disconnected environments, and environment-based configuration for multi-node, multi-GPU resource provisioning.
Test Configuration & Helpers `tests/trainer/sdk_tests/rhai_features_tests.go`	Extended RhaiFeatureConfig with NumNodes and NumGpusPerNode fields; added three public multi-GPU test entry points; enhanced test setup to inject install_kubeflow.py into ConfigMap and handle S3/PyPI mirror environment variables.

Sequence Diagram(s)

sequenceDiagram
    participant Operator as Test Operator
    participant TestEnv as Test Environment
    participant S3 as S3/MinIO
    participant Registry as Container Registry
    participant Kubeflow as Kubeflow Cluster
    participant Notebook as Training Notebook

    Operator->>TestEnv: 1. Run image mirror script
    TestEnv->>Registry: Pull public images
    TestEnv->>Registry: Push to private registry
    
    Operator->>S3: 2. Run prestage_models_datasets.py
    S3->>S3: Download HF models/datasets
    S3->>S3: Upload to S3 bucket
    
    Operator->>TestEnv: 3. Prepare Kubeflow SDK wheel
    TestEnv->>S3: Store wheel in S3
    
    Operator->>TestEnv: 4. Configure env vars (S3, PyPI, GPU count)
    Operator->>Kubeflow: 5. Trigger multi-GPU test
    
    Kubeflow->>Notebook: Launch training pods (2 nodes × 2 GPUs)
    Notebook->>Notebook: Detect local data via PVC
    alt Local Data Present
        Notebook->>Notebook: Load model/dataset from PVC
    else Local Data Absent
        Notebook->>S3: Download model/dataset from S3
        S3->>Notebook: Stream to local PVC
    end
    
    Notebook->>Kubeflow: Execute distributed training
    Kubeflow->>Operator: Report test results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops of joy, a multi-GPU dream,
CUDA, ROCm—what a gleaming team!
S3 mirrors in the dark, no internet near,
Models pre-staged and notebooks clear,
Distributed training, nodes entwined,
Disconnected labs with peace of mind! 🚀

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: enabling SDK TransformersTrainer end-to-end tests to work in disconnected environments, which aligns with all file modifications across test configs, utility scripts, documentation, and test infrastructure.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @tests/trainer/sdk_tests/rhai_features_tests.go:
- Around line 214-269: The PyPI mirror log currently prints pipIndexUrl raw
which may contain embedded credentials; sanitize before logging by parsing
pipIndexUrl (e.g., via net/url) to remove Userinfo or log only the hostname
(reuse pipTrustedHost) instead of the full URL; update the test.T().Logf call
that references pipIndexUrl and ensure pipTrustedHost computation still extracts
the host for trusted-host usage.

🧹 Nitpick comments (10)

tests/trainer/resources/disconnected_env/prestage_models_datasets.py (6)
113-119: Minor: format_size returns inconsistent precision for bytes.

For bytes (< 1024), it returns an integer format, but after the loop finishes without returning (≥ PB), the final size_bytes is already divided by 1024^5, so it represents exabytes not petabytes.
♻️ Suggested fix
 def format_size(size_bytes: int) -> str:
     """Format file size in human-readable format."""
-    for unit in ["B", "KB", "MB", "GB", "TB"]:
+    for unit in ["B", "KB", "MB", "GB", "TB", "PB"]:
         if size_bytes < 1024:
             return f"{size_bytes:.1f} {unit}" if unit != "B" else f"{size_bytes} {unit}"
         size_bytes /= 1024
-    return f"{size_bytes:.1f} PB"
+    return f"{size_bytes:.1f} EB"
243-261: Unused function parameters prefix and is_last.

The static analysis correctly flags that prefix and is_last parameters on print_tree are never used in the function body. These appear to be leftover from a different design.
♻️ Suggested fix
-def print_tree(tree: Dict, prefix: str = "", is_last: bool = True, indent: str = ""):
+def print_tree(tree: Dict, indent: str = ""):
     """
     Print tree structure with proper indentation.
     """
     items = sorted(tree.items(), key=lambda x: (isinstance(x[1], dict), x[0]))
     
     for i, (name, value) in enumerate(items):
         is_last_item = i == len(items) - 1
         connector = "└── " if is_last_item else "├── "
         
         if isinstance(value, dict):
             # Directory
             print(f"{indent}{connector}{name}/")
             new_indent = indent + ("    " if is_last_item else "│   ")
-            print_tree(value, name, is_last_item, new_indent)
+            print_tree(value, new_indent)
         else:
             # File with size
             size_str = format_size(value)
             print(f"{indent}{connector}{name} ({size_str})")
322-327: Unused loop variables file_size and relative.

The unpacked tuple values file_size and relative are not used inside the loop body. Use _ to indicate intentionally ignored values.
♻️ Suggested fix
-        for file_path, s3_key, file_size, relative in files_to_upload:
+        for file_path, s3_key, _, _ in files_to_upload:
             # Upload with callback for progress
             with open(file_path, "rb") as f:
                 wrapped_file = CallbackIOWrapper(pbar.update, f, "read")
                 s3_client.upload_fileobj(wrapped_file, bucket, s3_key)
             file_count += 1
463-465: Use explicit Optional type hint for nullable parameters.

PEP 484 prohibits implicit Optional. Use str | None or Optional[str] for parameters that can be None.
♻️ Suggested fix
+from typing import Dict, List, Tuple, Optional

-def process_model(model_name: str, s3_client, bucket: str, download_dir: str, 
-                  skip_download: bool, skip_upload: bool, force: bool = False,
-                  custom_s3_prefix: str = None) -> Tuple[str, int, str]:
+def process_model(model_name: str, s3_client, bucket: str, download_dir: str, 
+                  skip_download: bool, skip_upload: bool, force: bool = False,
+                  custom_s3_prefix: str | None = None) -> Tuple[str, int, str]:

-def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str,
-                    skip_download: bool, skip_upload: bool, force: bool = False,
-                    custom_s3_prefix: str = None) -> Tuple[str, int, str]:
+def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str,
+                    skip_download: bool, skip_upload: bool, force: bool = False,
+                    custom_s3_prefix: str | None = None) -> Tuple[str, int, str]:
Also applies to: 502-504

387-387: Remove extraneous f prefixes from strings without placeholders.

Several strings use f"..." format but contain no placeholders. This adds minor overhead and can confuse readers.
♻️ Examples to fix
-        print(f"    (Saving as JSONL for SFT notebook compatibility)")
+        print("    (Saving as JSONL for SFT notebook compatibility)")

-    print(f"    (HuggingFace datasets library shows download progress)")
+    print("    (HuggingFace datasets library shows download progress)")

-    print(f"  Saving dataset to disk...")
+    print("  Saving dataset to disk...")

-            print(f"     Use --force to overwrite existing data")
+            print("     Use --force to overwrite existing data")
Also applies to: 419-419, 425-425, 481-481, 520-520, 846-847

197-213: Consider logging exception details when catching blind exceptions.

The check_s3_prefix_exists function catches all exceptions silently, which can make debugging difficult. Consider at least logging at debug level.
♻️ Suggested improvement
 def check_s3_prefix_exists(s3_client, bucket: str, s3_prefix: str) -> Tuple[bool, int]:
     """
     Check if an S3 prefix already has objects.
     Returns (exists, file_count).
     """
     try:
         paginator = s3_client.get_paginator("list_objects_v2")
         file_count = 0
         for page in paginator.paginate(Bucket=bucket, Prefix=s3_prefix + "/", MaxKeys=100):
             contents = page.get("Contents", [])
             file_count += len(contents)
             if file_count > 0:
                 # Early exit once we confirm existence
-                break
-        return (file_count > 0, file_count)
-    except Exception:
+                return (True, file_count)
+        return (file_count > 0, file_count)
+    except Exception as e:
+        # Silently return False - caller should handle missing data
         return (False, 0)
tests/trainer/resources/disconnected_env/README.md (1)
158-174: Add language specifiers to fenced code blocks for better rendering.

Several code blocks are missing language specifiers, which affects syntax highlighting and tooling compatibility.
♻️ Examples of fixes needed

Line 158 (bucket structure):
-```
+```text
 <bucket>/
 ├── models/
Line 314 (error symptom):
-```
+```text
 TypeError: expected str, bytes or os.PathLike object, not NoneType
Line 384 (test output):
-```
+```text
 === RUN   TestRhaiTrainingProgressionCPU
Line 411 (ASCII diagram):
-```
+```text
 ┌─────────────────┐
Also applies to: 314-316, 328-330, 344-346, 369-371, 384-391, 394-405, 411-444
tests/trainer/sdk_tests/rhai_features_tests.go (1)
252-268: Potential issue: hostname extraction may fail for URLs with paths.

The PIP_TRUSTED_HOST extraction logic handles : (port) and / (path) but processes them sequentially. If the URL is https://host:8080/simple/, the first condition extracts host correctly. But if there's no port (https://host/simple/), it correctly extracts host. The logic appears correct.

However, the order of operations could be clearer:
♻️ Suggested clarification (optional)
 		if pipTrustedHost == "" {
-			// Extract hostname from URL
-			pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
-			if idx := strings.Index(pipTrustedHost, ":"); idx > 0 {
-				pipTrustedHost = pipTrustedHost[:idx]
-			} else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 {
-				pipTrustedHost = pipTrustedHost[:idx]
-			}
+			// Extract hostname from URL (strip scheme, then take up to first : or /)
+			pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
+			// Find the first occurrence of : or / to get just the hostname
+			if idx := strings.IndexAny(pipTrustedHost, ":/"); idx > 0 {
+				pipTrustedHost = pipTrustedHost[:idx]
+			}
 		}
tests/trainer/resources/disconnected_env/install_kubeflow.py (1)
137-139: Consider more specific exception handling for S3 operations.

The broad except Exception catch makes debugging harder. Consider catching specific boto3 exceptions or at least logging the exception type.
♻️ Suggested improvement
-    except Exception as e:
-        print(f"S3 fallback failed: {e}")
+    except ImportError as e:
+        print(f"S3 fallback failed - boto3 not installed: {e}")
+        return False
+    except Exception as e:
+        print(f"S3 fallback failed ({type(e).__name__}): {e}")
         return False
tests/trainer/resources/rhai_features.ipynb (1)
141-144: Consider adding strict=True to zip() for safety.

Python 3.10+ supports strict=True on zip() to raise an error if iterables have different lengths. This would catch data corruption issues early.
♻️ Suggested fix (if Python 3.10+ is guaranteed)
     def tokenize_function(examples):
         texts = [f\"### Instruction:\\n{i}\\n\\n### Response:\\n{o}\"
-                 for i, o in zip(examples[\"instruction\"], examples[\"output\"])]
+                 for i, o in zip(examples[\"instruction\"], examples[\"output\"], strict=True)]
         return tokenizer(texts, padding=\"max_length\", truncation=True, max_length=128)
Note: Only apply if the notebook runtime guarantees Python 3.10+. The README mentions Python 3.9+ requirement, so this may not be applicable yet.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a056f04 and 5f1609e.

📒 Files selected for processing (6)

tests/trainer/kubeflow_sdk_test.go
tests/trainer/resources/disconnected_env/README.md
tests/trainer/resources/disconnected_env/install_kubeflow.py
tests/trainer/resources/disconnected_env/prestage_models_datasets.py
tests/trainer/resources/rhai_features.ipynb
tests/trainer/sdk_tests/rhai_features_tests.go

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-12-15T09:49:36.063Z

Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/resources/osft.ipynb:460-493
Timestamp: 2025-12-15T09:49:36.063Z
Learning: In the Kubeflow Trainer SDK version used by this repository, get_job_logs(follow=False) returns a generator of log lines, just like when follow=True. When using this API, collect the generator into a list and then join the lines to form the full log string (e.g., logs = ''.join(list(get_job_logs(follow=False)))).

Applied to files:

tests/trainer/resources/rhai_features.ipynb

📚 Learning: 2025-12-12T12:23:09.527Z

Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/sdk_tests/osft_traininghub_tests.go:33-36
Timestamp: 2025-12-12T12:23:09.527Z
Learning: In Go tests, the working directory during go test is the package directory being tested. Relative paths (e.g., "resources/file.txt") resolve to paths under that package directory. When tests live in subdirectories (e.g., tests/trainer/sdk_tests/), referencing resources should still be relative to the package directory being tested (tests/trainer/), so ensure resource paths are resolved from the package root of each test package.

Applied to files:

tests/trainer/kubeflow_sdk_test.go
tests/trainer/sdk_tests/rhai_features_tests.go

📚 Learning: 2025-12-16T07:55:49.810Z

Learnt from: ChughShilpa
Repo: opendatahub-io/distributed-workloads PR: 568
File: tests/trainer/cluster_training_runtimes_test.go:100-101
Timestamp: 2025-12-16T07:55:49.810Z
Learning: In Go tests using Gomega, ContainSubstring(substr string, args ...any) supports variadic args that are used with fmt.Sprintf to format the substring. This means you can pass formatting operands such as ContainSubstring("%s/rhoai/%s", registryName, expectedRuntime.Image) and have the matcher evaluate the formatted substring. Ensure you import fmt only if you actually format strings beforehand; otherwise rely on the variadic args as shown. This pattern applies to test files under the tests directory (e.g., tests/trainer/cluster_training_runtimes_test.go) and can be generalized to other tests using Gomega.

Applied to files:

tests/trainer/kubeflow_sdk_test.go
tests/trainer/sdk_tests/rhai_features_tests.go

📚 Learning: 2026-01-06T17:26:58.604Z

Learnt from: sutaakar
Repo: opendatahub-io/distributed-workloads PR: 579
File: tests/common/support/olm.go:1-25
Timestamp: 2026-01-06T17:26:58.604Z
Learning: In the distributed-workloads repository, Go imports should be formatted so there is a blank line between different third-party import groups (e.g., between github.com/operator-framework and k8s.io). Use openshift-goimports to format imports for Go files, ensuring the separation between distinct third-party groups is preserved. This convention applies to all Go files (e.g., tests/common/support/olm.go and beyond).

Applied to files:

tests/trainer/kubeflow_sdk_test.go
tests/trainer/sdk_tests/rhai_features_tests.go

🧬 Code graph analysis (2)

tests/trainer/kubeflow_sdk_test.go (3)

tests/common/test_tag.go (4)

Tags (32-40)

KftoCuda (72-74)

MultiNodeMultiGpu (118-122)

KftoRocm (76-78)

tests/common/support/accelerator.go (2)

NVIDIA (22-22)

AMD (20-20)

tests/trainer/sdk_tests/rhai_features_tests.go (3)

RunRhaiFeaturesProgressionMultiGpuTest (116-127)

RunRhaiFeaturesCheckpointMultiGpuTest (130-141)

RunRhaiFeaturesAllMultiGpuTest (144-155)

tests/trainer/sdk_tests/rhai_features_tests.go (2)

tests/common/support/core.go (1)

CreateConfigMap (33-54)

tests/common/support/environment.go (4)

GetStorageBucketDefaultEndpoint (155-158)

GetStorageBucketAccessKeyId (165-168)

GetStorageBucketSecretKey (170-173)

GetStorageBucketName (175-178)

🪛 markdownlint-cli2 (0.18.1)

tests/trainer/resources/disconnected_env/README.md

158-158: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

314-314: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

328-328: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

344-344: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

369-369: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

384-384: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

394-394: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

411-411: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)

tests/trainer/resources/disconnected_env/install_kubeflow.py

59-59: subprocess call: check for execution of untrusted input

(S603)

118-118: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

124-124: subprocess call: check for execution of untrusted input

(S603)

135-135: Consider moving this statement to an else block

(TRY300)

137-137: Do not catch blind exception: Exception

(BLE001)

tests/trainer/resources/rhai_features.ipynb

125-125: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

249-249: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/trainer/resources/disconnected_env/prestage_models_datasets.py

211-211: Consider moving this statement to an else block

(TRY300)

212-212: Do not catch blind exception: Exception

(BLE001)

238-238: Do not catch blind exception: Exception

(BLE001)

243-243: Unused function argument: prefix

(ARG001)

243-243: Unused function argument: is_last

(ARG001)

300-301: try-except-pass detected, consider logging the exception

(S110)

300-300: Do not catch blind exception: Exception

(BLE001)

322-322: Loop control variable file_size not used within loop body

(B007)

322-322: Loop control variable relative not used within loop body

(B007)

349-349: Do not catch blind exception: Exception

(BLE001)

387-387: f-string without any placeholders

Remove extraneous f prefix

(F541)

419-419: f-string without any placeholders

Remove extraneous f prefix

(F541)

425-425: f-string without any placeholders

Remove extraneous f prefix

(F541)

465-465: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

481-481: f-string without any placeholders

Remove extraneous f prefix

(F541)

504-504: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

520-520: f-string without any placeholders

Remove extraneous f prefix

(F541)

774-774: Do not catch blind exception: Exception

(BLE001)

846-846: f-string without any placeholders

Remove extraneous f prefix

(F541)

847-847: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (16)

tests/trainer/resources/disconnected_env/prestage_models_datasets.py (1)

1-91: LGTM - Well-structured CLI tool with good UX.

The prestage script is well-designed with:

Clear preset system for common use cases

Dry-run and check modes for safe operation

Resume support for interrupted uploads

Good progress feedback with tqdm

Also applies to: 545-851

tests/trainer/resources/disconnected_env/README.md (2)

1-469: LGTM - Comprehensive and well-organized documentation.

The README provides excellent guidance for disconnected environment setup with:

Clear step-by-step instructions

Helpful troubleshooting section

Visual test flow diagram

Quick reference tables

This will significantly help operators deploy in air-gapped environments.

12-12: The Go 1.24 requirement stated in the README is accurate. The go.mod file specifies go 1.24.6 with toolchain go1.24.11, and the README correctly documents this constraint. Additionally, section 5.1 of the same README provides clear instructions for verifying and setting up Go 1.24. No issues found.

Likely an incorrect or invalid review comment.

tests/trainer/kubeflow_sdk_test.go (1)

91-122: LGTM - Well-structured multi-GPU test additions.

The new multi-GPU tests follow the established patterns:

Consistent naming convention (TestRhai...MultiGpu{Cuda,Rocm})

Proper use of MultiNodeMultiGpu(2, support.NVIDIA/AMD, 2) tags for 2x2 GPU configuration

Correct delegation to the corresponding runner functions with explicit node/GPU counts

The tests align well with the RunRhaiFeaturesXxxMultiGpuTest functions defined in rhai_features_tests.go.

tests/trainer/sdk_tests/rhai_features_tests.go (4)

69-71: LGTM - Clean multi-GPU configuration extension.

The NumNodes and NumGpusPerNode fields are well-integrated into RhaiFeatureConfig with sensible defaults (2 nodes, 1 GPU per node). The existing single-GPU test runners now explicitly set these defaults, maintaining backward compatibility.

Also applies to: 82-84, 96-98, 110-112

115-155: LGTM - Multi-GPU test runners follow established patterns.

The three new multi-GPU runner functions (RunRhaiFeaturesProgressionMultiGpuTest, RunRhaiFeaturesCheckpointMultiGpuTest, RunRhaiFeaturesAllMultiGpuTest) correctly:

Accept numNodes and numGpusPerNode parameters

Delegate to the shared runRhaiFeaturesTestWithConfig with appropriate config

Mirror the structure of existing single-GPU runners

174-187: LGTM - ConfigMap now includes install helper script.

The ConfigMap creation now properly bundles both the notebook and the install_kubeflow.py script, enabling the disconnected installation workflow. The relative path resources/disconnected_env/install_kubeflow.py correctly references from the sdk_tests package directory. Based on learnings, this path resolution is correct since the working directory during go test is the package directory being tested.

281-317: LGTM - Shell command properly propagates multi-GPU config.

The shell command correctly exports NUM_NODES and NUM_GPUS_PER_NODE environment variables, which the notebook consumes to configure the training job. The format string properly interpolates all the configuration values.

tests/trainer/resources/disconnected_env/install_kubeflow.py (3)

37-52: Version verification imports kubeflow but doesn't handle stale imports.

If install_from_pypi() installs a new version after a previous import failed, the verify_kubeflow_version() call will import the newly installed package. However, if kubeflow was previously imported (e.g., an older version), Python won't re-import it. This scenario is unlikely in the current flow but worth noting.

The current implementation is fine for the intended use case (fresh install in container).

76-139: S3 fallback implementation is solid with appropriate error handling.

The S3 fallback logic:

Validates all required credentials before attempting download

Uses proper S3v4 signing and path-style addressing for MinIO compatibility

Preserves the original wheel filename (important for pip)

Disables SSL verification appropriately for self-signed certs

The use of /tmp/ for the wheel download is acceptable for a container environment where the script runs once during initialization.

142-156: LGTM - Clean main function with proper exit codes.

The main function follows a clear PyPI-first, S3-fallback pattern and returns appropriate exit codes for success (0) and failure (1).

tests/trainer/resources/rhai_features.ipynb (5)

20-51: LGTM - Well-documented training function with auto-detection.

The updated train_bloom() function has a clear docstring explaining the auto-detection behavior. The local path check (config.json existence) is a reliable way to detect pre-downloaded models.

95-133: LGTM - Local vs HuggingFace mode switching is well-implemented.

The branching logic correctly handles:

Local mode: Loads from pre-downloaded paths on shared PVC

HuggingFace mode: Downloads with rank-0 coordination via dist.barrier()

The rank-based synchronization prevents redundant downloads in distributed training.

242-341: LGTM - S3 pre-download logic is well-structured.

The S3 download cell:

Properly explains the PVC mount path differences (notebook vs training pods)

Cleans up old checkpoints to avoid resume conflicts

Uses idempotent downloads (skips if already present)

Has clear progress feedback

The path mapping comment is helpful for understanding the shared storage architecture.

363-391: LGTM - Multi-GPU configuration correctly propagates.

The notebook correctly:

Reads NUM_NODES and NUM_GPUS_PER_NODE from environment

Uses num_gpus_per_node for GPU resource requests

Logs the training configuration for debugging

259-265: Good practice: Cleaning up old checkpoints before training.

The cleanup of old checkpoints (/opt/app-root/src/checkpoints) is important to prevent the JIT checkpoint feature from resuming from stale checkpoints during test reruns.

coderabbitai · 2026-01-08T19:49:02Z

tests/trainer/sdk_tests/rhai_features_tests.go

+	// S3/MinIO configuration for disconnected environments (optional)
+	s3Endpoint, _ := GetStorageBucketDefaultEndpoint()
+	s3AccessKey, _ := GetStorageBucketAccessKeyId()
+	s3SecretKey, _ := GetStorageBucketSecretKey()
+	s3Bucket, _ := GetStorageBucketName()
+	modelS3Prefix := os.Getenv("MODEL_S3_PREFIX")
+	if modelS3Prefix == "" {
+		modelS3Prefix = "models/distilgpt2"
+	}
+	datasetS3Prefix := os.Getenv("DATASET_S3_PREFIX")
+	if datasetS3Prefix == "" {
+		datasetS3Prefix = "alpaca-cleaned-datasets"
+	}
+
+	// Build S3 export commands (only if configured)
+	s3Exports := ""
+	if s3Endpoint != "" && s3Bucket != "" {
+		test.T().Logf("S3 mode: endpoint=%s, bucket=%s", s3Endpoint, s3Bucket)
+		s3Exports = fmt.Sprintf(
+			"export AWS_DEFAULT_ENDPOINT='%s'; "+
+				"export AWS_ACCESS_KEY_ID='%s'; "+
+				"export AWS_SECRET_ACCESS_KEY='%s'; "+
+				"export AWS_STORAGE_BUCKET='%s'; "+
+				"export MODEL_S3_PREFIX='%s'; "+
+				"export DATASET_S3_PREFIX='%s'; ",
+			s3Endpoint, s3AccessKey, s3SecretKey, s3Bucket, modelS3Prefix, datasetS3Prefix,
+		)
+	} else {
+		test.T().Log("HuggingFace mode: S3 not configured, will download from HF Hub")
+	}
+
+	// PyPI mirror configuration for disconnected environments (optional)
+	pipIndexUrl := os.Getenv("PIP_INDEX_URL")
+	pipExports := ""
+	pipInstallFlags := ""
+	if pipIndexUrl != "" {
+		test.T().Logf("PyPI mirror: %s", pipIndexUrl)
+		// Extract hostname for trusted-host
+		pipTrustedHost := os.Getenv("PIP_TRUSTED_HOST")
+		if pipTrustedHost == "" {
+			// Extract hostname from URL
+			pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://")
+			if idx := strings.Index(pipTrustedHost, ":"); idx > 0 {
+				pipTrustedHost = pipTrustedHost[:idx]
+			} else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 {
+				pipTrustedHost = pipTrustedHost[:idx]
+			}
+		}
+		pipExports = fmt.Sprintf(
+			"export PIP_INDEX_URL='%s'; "+
+				"export PIP_TRUSTED_HOST='%s'; "+
+				"export PYTHONHTTPSVERIFY='0'; ",
+			pipIndexUrl, pipTrustedHost,
+		)
+		pipInstallFlags = fmt.Sprintf("--index-url '%s' --trusted-host '%s' ", pipIndexUrl, pipTrustedHost)
+	}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if any logging statements might expose credentials in the test file rg -n "Logf.*secret|Logf.*password|Logf.*access.?key" tests/trainer/sdk_tests/

Repository: opendatahub-io/distributed-workloads

Length of output: 62

🏁 Script executed:

#!/bin/bash # Read the file to see the full context cat -n tests/trainer/sdk_tests/rhai_features_tests.go | sed -n '214,269p'

Repository: opendatahub-io/distributed-workloads

Length of output: 2566

🏁 Script executed:

#!/bin/bash # Search for uses of s3Exports and pipExports that might log them rg -n "s3Exports|pipExports" tests/trainer/sdk_tests/rhai_features_tests.go -A 3 -B 1

Repository: opendatahub-io/distributed-workloads

Length of output: 908

🏁 Script executed:

#!/bin/bash # Find the function call around line 313 sed -n '300,330p' tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 1253

🏁 Script executed:

#!/bin/bash # Search for any logging that includes s3Exports or pipExports rg -n "Log.*s3Exports|Log.*pipExports|Logf.*s3Exports|Logf.*pipExports" tests/trainer/sdk_tests/

Repository: opendatahub-io/distributed-workloads

Length of output: 62

🏁 Script executed:

#!/bin/bash # Check how these export strings are used - likely passed to shell execution rg -n "sh\.|bash|exec|Cmd|Command" tests/trainer/sdk_tests/rhai_features_tests.go | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 161

🏁 Script executed:

#!/bin/bash # See the full shellCmd construction sed -n '270,295p' tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 870

🏁 Script executed:

#!/bin/bash # Search for PIP_INDEX_URL usage across the codebase to understand if credentials can be embedded rg -n "PIP_INDEX_URL" tests/ -B 2 -A 2

Repository: opendatahub-io/distributed-workloads

Length of output: 6240

🏁 Script executed:

#!/bin/bash # Check what environment variables are being used for credentials rg -n "GetStorageBucket|PIP_" tests/trainer/sdk_tests/rhai_features_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 463

Address potential credential exposure in PyPI URL logging at line 250.

The S3 credentials are handled safely—they're passed via environment variables in the shell command without being logged directly (line 231 only logs endpoint and bucket).

However, the PyPI URL logging at line 250 (test.T().Logf("PyPI mirror: %s", pipIndexUrl)) is a security risk: if PIP_INDEX_URL contains embedded credentials in the format https://user:password@registry/simple/, they will be exposed in test logs. Consider masking the credentials before logging or extract only the hostname.

🤖 Prompt for AI Agents

In @tests/trainer/sdk_tests/rhai_features_tests.go around lines 214 - 269, The PyPI mirror log currently prints pipIndexUrl raw which may contain embedded credentials; sanitize before logging by parsing pipIndexUrl (e.g., via net/url) to remove Userinfo or log only the hostname (reuse pipTrustedHost) instead of the full URL; update the test.T().Logf call that references pipIndexUrl and ensure pipTrustedHost computation still extracts the host for trusted-host usage.

sutaakar · 2026-01-16T13:17:19Z

tests/trainer/resources/disconnected_env/install_kubeflow.py

+    required_version = get_required_version()
+    print(f"Attempting to install kubeflow=={required_version} from PyPI...")
+    result = subprocess.run(
+        [sys.executable, "-m", "pip", "install", "--quiet", f"kubeflow=={required_version}"],


kubeflow package is not in default Pypi index but in red hat index only, so it should use the index to install SDK

sutaakar · 2026-01-16T13:29:48Z

tests/trainer/resources/disconnected_env/install_kubeflow.py

+            print(f"Verified: kubeflow version {installed_version} matches required {required_version}")
+            return True
+        else:
+            print(f"Version mismatch: installed '{installed_version}', required '{required_version}'")


For installing from Red hat index getting:

Version mismatch: installed 'v0.2.1+rhai0', required '0.2.1+rhai0'

sutaakar · 2026-01-16T13:43:25Z

tests/trainer/kubeflow_sdk_test.go

 }
+
+// Multi-GPU CUDA tests - 2 nodes, 2 GPUs each (requires 4 total NVIDIA GPUs)
+func TestRhaiTrainingProgressionMultiGpuCuda(t *testing.T) {


do we need to test separately multinode single GPU and multinode multi GPU?
how much time does it take to run one test?

it looks like the new model downloading implementation is quite slow, downloading from HuggingFace

fix: Update SDK RHAI TransfomersTrainer based Feature tests for disco…

5f1609e

…nnected env Signed-off-by: abhijeet-dhumal <[email protected]>

openshift-ci bot requested review from chipspeak and sutaakar January 8, 2026 19:43

abhijeet-dhumal changed the title ~~fix: Update SDK RHAI TransfomersTrainer based Feature tests for disco…~~ Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment Jan 8, 2026

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

abhijeet-dhumal requested review from ChughShilpa, Fiona-Waters, briangallagher, efazal and robert-bell and removed request for chipspeak January 9, 2026 06:07

sutaakar reviewed Jan 16, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

abhijeet-dhumal marked this pull request as draft January 27, 2026 10:15

openshift-ci bot added the do-not-merge/work-in-progress label Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582

Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582

Uh oh!

abhijeet-dhumal commented Jan 8, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

sutaakar Jan 16, 2026

Uh oh!

sutaakar Jan 16, 2026

Uh oh!

sutaakar Jan 16, 2026

Uh oh!

sutaakar Jan 16, 2026 •

edited

Loading

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582

Are you sure you want to change the base?

Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582

Uh oh!

Conversation

abhijeet-dhumal commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhijeet-dhumal commented Jan 8, 2026 •

edited

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

sutaakar Jan 16, 2026 •

edited

Loading