-
Notifications
You must be signed in to change notification settings - Fork 73
Adjust SDK TransfomersTrainer e2e tests for Trainer v2 to work in disconnected environment #582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nnected env Signed-off-by: abhijeet-dhumal <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR extends multi-GPU training test coverage for Kubeflow SDK with CUDA and ROCm test variants, adds comprehensive disconnected environment setup documentation with Python utilities for S3 pre-staging and Kubeflow installation, and enhances the training notebook to support local and S3-based data loading for offline deployments. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator as Test Operator
participant TestEnv as Test Environment
participant S3 as S3/MinIO
participant Registry as Container Registry
participant Kubeflow as Kubeflow Cluster
participant Notebook as Training Notebook
Operator->>TestEnv: 1. Run image mirror script
TestEnv->>Registry: Pull public images
TestEnv->>Registry: Push to private registry
Operator->>S3: 2. Run prestage_models_datasets.py
S3->>S3: Download HF models/datasets
S3->>S3: Upload to S3 bucket
Operator->>TestEnv: 3. Prepare Kubeflow SDK wheel
TestEnv->>S3: Store wheel in S3
Operator->>TestEnv: 4. Configure env vars (S3, PyPI, GPU count)
Operator->>Kubeflow: 5. Trigger multi-GPU test
Kubeflow->>Notebook: Launch training pods (2 nodes × 2 GPUs)
Notebook->>Notebook: Detect local data via PVC
alt Local Data Present
Notebook->>Notebook: Load model/dataset from PVC
else Local Data Absent
Notebook->>S3: Download model/dataset from S3
S3->>Notebook: Stream to local PVC
end
Notebook->>Kubeflow: Execute distributed training
Kubeflow->>Operator: Report test results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @tests/trainer/sdk_tests/rhai_features_tests.go:
- Around line 214-269: The PyPI mirror log currently prints pipIndexUrl raw
which may contain embedded credentials; sanitize before logging by parsing
pipIndexUrl (e.g., via net/url) to remove Userinfo or log only the hostname
(reuse pipTrustedHost) instead of the full URL; update the test.T().Logf call
that references pipIndexUrl and ensure pipTrustedHost computation still extracts
the host for trusted-host usage.
🧹 Nitpick comments (10)
tests/trainer/resources/disconnected_env/prestage_models_datasets.py (6)
113-119: Minor:format_sizereturns inconsistent precision for bytes.For bytes (< 1024), it returns an integer format, but after the loop finishes without returning (≥ PB), the final
size_bytesis already divided by 1024^5, so it represents exabytes not petabytes.♻️ Suggested fix
def format_size(size_bytes: int) -> str: """Format file size in human-readable format.""" - for unit in ["B", "KB", "MB", "GB", "TB"]: + for unit in ["B", "KB", "MB", "GB", "TB", "PB"]: if size_bytes < 1024: return f"{size_bytes:.1f} {unit}" if unit != "B" else f"{size_bytes} {unit}" size_bytes /= 1024 - return f"{size_bytes:.1f} PB" + return f"{size_bytes:.1f} EB"
243-261: Unused function parametersprefixandis_last.The static analysis correctly flags that
prefixandis_lastparameters onprint_treeare never used in the function body. These appear to be leftover from a different design.♻️ Suggested fix
-def print_tree(tree: Dict, prefix: str = "", is_last: bool = True, indent: str = ""): +def print_tree(tree: Dict, indent: str = ""): """ Print tree structure with proper indentation. """ items = sorted(tree.items(), key=lambda x: (isinstance(x[1], dict), x[0])) for i, (name, value) in enumerate(items): is_last_item = i == len(items) - 1 connector = "└── " if is_last_item else "├── " if isinstance(value, dict): # Directory print(f"{indent}{connector}{name}/") new_indent = indent + (" " if is_last_item else "│ ") - print_tree(value, name, is_last_item, new_indent) + print_tree(value, new_indent) else: # File with size size_str = format_size(value) print(f"{indent}{connector}{name} ({size_str})")
322-327: Unused loop variablesfile_sizeandrelative.The unpacked tuple values
file_sizeandrelativeare not used inside the loop body. Use_to indicate intentionally ignored values.♻️ Suggested fix
- for file_path, s3_key, file_size, relative in files_to_upload: + for file_path, s3_key, _, _ in files_to_upload: # Upload with callback for progress with open(file_path, "rb") as f: wrapped_file = CallbackIOWrapper(pbar.update, f, "read") s3_client.upload_fileobj(wrapped_file, bucket, s3_key) file_count += 1
463-465: Use explicitOptionaltype hint for nullable parameters.PEP 484 prohibits implicit
Optional. Usestr | NoneorOptional[str]for parameters that can beNone.♻️ Suggested fix
+from typing import Dict, List, Tuple, Optional -def process_model(model_name: str, s3_client, bucket: str, download_dir: str, - skip_download: bool, skip_upload: bool, force: bool = False, - custom_s3_prefix: str = None) -> Tuple[str, int, str]: +def process_model(model_name: str, s3_client, bucket: str, download_dir: str, + skip_download: bool, skip_upload: bool, force: bool = False, + custom_s3_prefix: str | None = None) -> Tuple[str, int, str]: -def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str, - skip_download: bool, skip_upload: bool, force: bool = False, - custom_s3_prefix: str = None) -> Tuple[str, int, str]: +def process_dataset(dataset_name: str, s3_client, bucket: str, download_dir: str, + skip_download: bool, skip_upload: bool, force: bool = False, + custom_s3_prefix: str | None = None) -> Tuple[str, int, str]:Also applies to: 502-504
387-387: Remove extraneousfprefixes from strings without placeholders.Several strings use
f"..."format but contain no placeholders. This adds minor overhead and can confuse readers.♻️ Examples to fix
- print(f" (Saving as JSONL for SFT notebook compatibility)") + print(" (Saving as JSONL for SFT notebook compatibility)") - print(f" (HuggingFace datasets library shows download progress)") + print(" (HuggingFace datasets library shows download progress)") - print(f" Saving dataset to disk...") + print(" Saving dataset to disk...") - print(f" Use --force to overwrite existing data") + print(" Use --force to overwrite existing data")Also applies to: 419-419, 425-425, 481-481, 520-520, 846-847
197-213: Consider logging exception details when catching blind exceptions.The
check_s3_prefix_existsfunction catches all exceptions silently, which can make debugging difficult. Consider at least logging at debug level.♻️ Suggested improvement
def check_s3_prefix_exists(s3_client, bucket: str, s3_prefix: str) -> Tuple[bool, int]: """ Check if an S3 prefix already has objects. Returns (exists, file_count). """ try: paginator = s3_client.get_paginator("list_objects_v2") file_count = 0 for page in paginator.paginate(Bucket=bucket, Prefix=s3_prefix + "/", MaxKeys=100): contents = page.get("Contents", []) file_count += len(contents) if file_count > 0: # Early exit once we confirm existence - break - return (file_count > 0, file_count) - except Exception: + return (True, file_count) + return (file_count > 0, file_count) + except Exception as e: + # Silently return False - caller should handle missing data return (False, 0)tests/trainer/resources/disconnected_env/README.md (1)
158-174: Add language specifiers to fenced code blocks for better rendering.Several code blocks are missing language specifiers, which affects syntax highlighting and tooling compatibility.
♻️ Examples of fixes needed
Line 158 (bucket structure):
-``` +```text <bucket>/ ├── models/Line 314 (error symptom):
-``` +```text TypeError: expected str, bytes or os.PathLike object, not NoneTypeLine 384 (test output):
-``` +```text === RUN TestRhaiTrainingProgressionCPULine 411 (ASCII diagram):
-``` +```text ┌─────────────────┐Also applies to: 314-316, 328-330, 344-346, 369-371, 384-391, 394-405, 411-444
tests/trainer/sdk_tests/rhai_features_tests.go (1)
252-268: Potential issue: hostname extraction may fail for URLs with paths.The
PIP_TRUSTED_HOSTextraction logic handles:(port) and/(path) but processes them sequentially. If the URL ishttps://host:8080/simple/, the first condition extractshostcorrectly. But if there's no port (https://host/simple/), it correctly extractshost. The logic appears correct.However, the order of operations could be clearer:
♻️ Suggested clarification (optional)
if pipTrustedHost == "" { - // Extract hostname from URL - pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://") - if idx := strings.Index(pipTrustedHost, ":"); idx > 0 { - pipTrustedHost = pipTrustedHost[:idx] - } else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 { - pipTrustedHost = pipTrustedHost[:idx] - } + // Extract hostname from URL (strip scheme, then take up to first : or /) + pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://") + // Find the first occurrence of : or / to get just the hostname + if idx := strings.IndexAny(pipTrustedHost, ":/"); idx > 0 { + pipTrustedHost = pipTrustedHost[:idx] + } }tests/trainer/resources/disconnected_env/install_kubeflow.py (1)
137-139: Consider more specific exception handling for S3 operations.The broad
except Exceptioncatch makes debugging harder. Consider catching specific boto3 exceptions or at least logging the exception type.♻️ Suggested improvement
- except Exception as e: - print(f"S3 fallback failed: {e}") + except ImportError as e: + print(f"S3 fallback failed - boto3 not installed: {e}") + return False + except Exception as e: + print(f"S3 fallback failed ({type(e).__name__}): {e}") return Falsetests/trainer/resources/rhai_features.ipynb (1)
141-144: Consider addingstrict=Truetozip()for safety.Python 3.10+ supports
strict=Trueonzip()to raise an error if iterables have different lengths. This would catch data corruption issues early.♻️ Suggested fix (if Python 3.10+ is guaranteed)
def tokenize_function(examples): texts = [f\"### Instruction:\\n{i}\\n\\n### Response:\\n{o}\" - for i, o in zip(examples[\"instruction\"], examples[\"output\"])] + for i, o in zip(examples[\"instruction\"], examples[\"output\"], strict=True)] return tokenizer(texts, padding=\"max_length\", truncation=True, max_length=128)Note: Only apply if the notebook runtime guarantees Python 3.10+. The README mentions Python 3.9+ requirement, so this may not be applicable yet.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tests/trainer/kubeflow_sdk_test.gotests/trainer/resources/disconnected_env/README.mdtests/trainer/resources/disconnected_env/install_kubeflow.pytests/trainer/resources/disconnected_env/prestage_models_datasets.pytests/trainer/resources/rhai_features.ipynbtests/trainer/sdk_tests/rhai_features_tests.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-15T09:49:36.063Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/resources/osft.ipynb:460-493
Timestamp: 2025-12-15T09:49:36.063Z
Learning: In the Kubeflow Trainer SDK version used by this repository, get_job_logs(follow=False) returns a generator of log lines, just like when follow=True. When using this API, collect the generator into a list and then join the lines to form the full log string (e.g., logs = ''.join(list(get_job_logs(follow=False)))).
Applied to files:
tests/trainer/resources/rhai_features.ipynb
📚 Learning: 2025-12-12T12:23:09.527Z
Learnt from: Fiona-Waters
Repo: opendatahub-io/distributed-workloads PR: 536
File: tests/trainer/sdk_tests/osft_traininghub_tests.go:33-36
Timestamp: 2025-12-12T12:23:09.527Z
Learning: In Go tests, the working directory during go test is the package directory being tested. Relative paths (e.g., "resources/file.txt") resolve to paths under that package directory. When tests live in subdirectories (e.g., tests/trainer/sdk_tests/), referencing resources should still be relative to the package directory being tested (tests/trainer/), so ensure resource paths are resolved from the package root of each test package.
Applied to files:
tests/trainer/kubeflow_sdk_test.gotests/trainer/sdk_tests/rhai_features_tests.go
📚 Learning: 2025-12-16T07:55:49.810Z
Learnt from: ChughShilpa
Repo: opendatahub-io/distributed-workloads PR: 568
File: tests/trainer/cluster_training_runtimes_test.go:100-101
Timestamp: 2025-12-16T07:55:49.810Z
Learning: In Go tests using Gomega, ContainSubstring(substr string, args ...any) supports variadic args that are used with fmt.Sprintf to format the substring. This means you can pass formatting operands such as ContainSubstring("%s/rhoai/%s", registryName, expectedRuntime.Image) and have the matcher evaluate the formatted substring. Ensure you import fmt only if you actually format strings beforehand; otherwise rely on the variadic args as shown. This pattern applies to test files under the tests directory (e.g., tests/trainer/cluster_training_runtimes_test.go) and can be generalized to other tests using Gomega.
Applied to files:
tests/trainer/kubeflow_sdk_test.gotests/trainer/sdk_tests/rhai_features_tests.go
📚 Learning: 2026-01-06T17:26:58.604Z
Learnt from: sutaakar
Repo: opendatahub-io/distributed-workloads PR: 579
File: tests/common/support/olm.go:1-25
Timestamp: 2026-01-06T17:26:58.604Z
Learning: In the distributed-workloads repository, Go imports should be formatted so there is a blank line between different third-party import groups (e.g., between github.com/operator-framework and k8s.io). Use openshift-goimports to format imports for Go files, ensuring the separation between distinct third-party groups is preserved. This convention applies to all Go files (e.g., tests/common/support/olm.go and beyond).
Applied to files:
tests/trainer/kubeflow_sdk_test.gotests/trainer/sdk_tests/rhai_features_tests.go
🧬 Code graph analysis (2)
tests/trainer/kubeflow_sdk_test.go (3)
tests/common/test_tag.go (4)
Tags(32-40)KftoCuda(72-74)MultiNodeMultiGpu(118-122)KftoRocm(76-78)tests/common/support/accelerator.go (2)
NVIDIA(22-22)AMD(20-20)tests/trainer/sdk_tests/rhai_features_tests.go (3)
RunRhaiFeaturesProgressionMultiGpuTest(116-127)RunRhaiFeaturesCheckpointMultiGpuTest(130-141)RunRhaiFeaturesAllMultiGpuTest(144-155)
tests/trainer/sdk_tests/rhai_features_tests.go (2)
tests/common/support/core.go (1)
CreateConfigMap(33-54)tests/common/support/environment.go (4)
GetStorageBucketDefaultEndpoint(155-158)GetStorageBucketAccessKeyId(165-168)GetStorageBucketSecretKey(170-173)GetStorageBucketName(175-178)
🪛 markdownlint-cli2 (0.18.1)
tests/trainer/resources/disconnected_env/README.md
158-158: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
314-314: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
328-328: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
344-344: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
369-369: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
384-384: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
394-394: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
411-411: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.10)
tests/trainer/resources/disconnected_env/install_kubeflow.py
59-59: subprocess call: check for execution of untrusted input
(S603)
118-118: Probable insecure usage of temporary file or directory: "/tmp/"
(S108)
124-124: subprocess call: check for execution of untrusted input
(S603)
135-135: Consider moving this statement to an else block
(TRY300)
137-137: Do not catch blind exception: Exception
(BLE001)
tests/trainer/resources/rhai_features.ipynb
125-125: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
249-249: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/trainer/resources/disconnected_env/prestage_models_datasets.py
211-211: Consider moving this statement to an else block
(TRY300)
212-212: Do not catch blind exception: Exception
(BLE001)
238-238: Do not catch blind exception: Exception
(BLE001)
243-243: Unused function argument: prefix
(ARG001)
243-243: Unused function argument: is_last
(ARG001)
300-301: try-except-pass detected, consider logging the exception
(S110)
300-300: Do not catch blind exception: Exception
(BLE001)
322-322: Loop control variable file_size not used within loop body
(B007)
322-322: Loop control variable relative not used within loop body
(B007)
349-349: Do not catch blind exception: Exception
(BLE001)
387-387: f-string without any placeholders
Remove extraneous f prefix
(F541)
419-419: f-string without any placeholders
Remove extraneous f prefix
(F541)
425-425: f-string without any placeholders
Remove extraneous f prefix
(F541)
465-465: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
481-481: f-string without any placeholders
Remove extraneous f prefix
(F541)
504-504: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
520-520: f-string without any placeholders
Remove extraneous f prefix
(F541)
774-774: Do not catch blind exception: Exception
(BLE001)
846-846: f-string without any placeholders
Remove extraneous f prefix
(F541)
847-847: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (16)
tests/trainer/resources/disconnected_env/prestage_models_datasets.py (1)
1-91: LGTM - Well-structured CLI tool with good UX.The prestage script is well-designed with:
- Clear preset system for common use cases
- Dry-run and check modes for safe operation
- Resume support for interrupted uploads
- Good progress feedback with tqdm
Also applies to: 545-851
tests/trainer/resources/disconnected_env/README.md (2)
1-469: LGTM - Comprehensive and well-organized documentation.The README provides excellent guidance for disconnected environment setup with:
- Clear step-by-step instructions
- Helpful troubleshooting section
- Visual test flow diagram
- Quick reference tables
This will significantly help operators deploy in air-gapped environments.
12-12: The Go 1.24 requirement stated in the README is accurate. Thego.modfile specifiesgo 1.24.6with toolchaingo1.24.11, and the README correctly documents this constraint. Additionally, section 5.1 of the same README provides clear instructions for verifying and setting up Go 1.24. No issues found.Likely an incorrect or invalid review comment.
tests/trainer/kubeflow_sdk_test.go (1)
91-122: LGTM - Well-structured multi-GPU test additions.The new multi-GPU tests follow the established patterns:
- Consistent naming convention (
TestRhai...MultiGpu{Cuda,Rocm})- Proper use of
MultiNodeMultiGpu(2, support.NVIDIA/AMD, 2)tags for 2x2 GPU configuration- Correct delegation to the corresponding runner functions with explicit node/GPU counts
The tests align well with the
RunRhaiFeaturesXxxMultiGpuTestfunctions defined inrhai_features_tests.go.tests/trainer/sdk_tests/rhai_features_tests.go (4)
69-71: LGTM - Clean multi-GPU configuration extension.The
NumNodesandNumGpusPerNodefields are well-integrated intoRhaiFeatureConfigwith sensible defaults (2 nodes, 1 GPU per node). The existing single-GPU test runners now explicitly set these defaults, maintaining backward compatibility.Also applies to: 82-84, 96-98, 110-112
115-155: LGTM - Multi-GPU test runners follow established patterns.The three new multi-GPU runner functions (
RunRhaiFeaturesProgressionMultiGpuTest,RunRhaiFeaturesCheckpointMultiGpuTest,RunRhaiFeaturesAllMultiGpuTest) correctly:
- Accept
numNodesandnumGpusPerNodeparameters- Delegate to the shared
runRhaiFeaturesTestWithConfigwith appropriate config- Mirror the structure of existing single-GPU runners
174-187: LGTM - ConfigMap now includes install helper script.The ConfigMap creation now properly bundles both the notebook and the
install_kubeflow.pyscript, enabling the disconnected installation workflow. The relative pathresources/disconnected_env/install_kubeflow.pycorrectly references from thesdk_testspackage directory. Based on learnings, this path resolution is correct since the working directory duringgo testis the package directory being tested.
281-317: LGTM - Shell command properly propagates multi-GPU config.The shell command correctly exports
NUM_NODESandNUM_GPUS_PER_NODEenvironment variables, which the notebook consumes to configure the training job. The format string properly interpolates all the configuration values.tests/trainer/resources/disconnected_env/install_kubeflow.py (3)
37-52: Version verification imports kubeflow but doesn't handle stale imports.If
install_from_pypi()installs a new version after a previous import failed, theverify_kubeflow_version()call will import the newly installed package. However, if kubeflow was previously imported (e.g., an older version), Python won't re-import it. This scenario is unlikely in the current flow but worth noting.The current implementation is fine for the intended use case (fresh install in container).
76-139: S3 fallback implementation is solid with appropriate error handling.The S3 fallback logic:
- Validates all required credentials before attempting download
- Uses proper S3v4 signing and path-style addressing for MinIO compatibility
- Preserves the original wheel filename (important for pip)
- Disables SSL verification appropriately for self-signed certs
The use of
/tmp/for the wheel download is acceptable for a container environment where the script runs once during initialization.
142-156: LGTM - Clean main function with proper exit codes.The main function follows a clear PyPI-first, S3-fallback pattern and returns appropriate exit codes for success (0) and failure (1).
tests/trainer/resources/rhai_features.ipynb (5)
20-51: LGTM - Well-documented training function with auto-detection.The updated
train_bloom()function has a clear docstring explaining the auto-detection behavior. The local path check (config.jsonexistence) is a reliable way to detect pre-downloaded models.
95-133: LGTM - Local vs HuggingFace mode switching is well-implemented.The branching logic correctly handles:
- Local mode: Loads from pre-downloaded paths on shared PVC
- HuggingFace mode: Downloads with rank-0 coordination via
dist.barrier()The rank-based synchronization prevents redundant downloads in distributed training.
242-341: LGTM - S3 pre-download logic is well-structured.The S3 download cell:
- Properly explains the PVC mount path differences (notebook vs training pods)
- Cleans up old checkpoints to avoid resume conflicts
- Uses idempotent downloads (skips if already present)
- Has clear progress feedback
The path mapping comment is helpful for understanding the shared storage architecture.
363-391: LGTM - Multi-GPU configuration correctly propagates.The notebook correctly:
- Reads
NUM_NODESandNUM_GPUS_PER_NODEfrom environment- Uses
num_gpus_per_nodefor GPU resource requests- Logs the training configuration for debugging
259-265: Good practice: Cleaning up old checkpoints before training.The cleanup of old checkpoints (
/opt/app-root/src/checkpoints) is important to prevent the JIT checkpoint feature from resuming from stale checkpoints during test reruns.
| // S3/MinIO configuration for disconnected environments (optional) | ||
| s3Endpoint, _ := GetStorageBucketDefaultEndpoint() | ||
| s3AccessKey, _ := GetStorageBucketAccessKeyId() | ||
| s3SecretKey, _ := GetStorageBucketSecretKey() | ||
| s3Bucket, _ := GetStorageBucketName() | ||
| modelS3Prefix := os.Getenv("MODEL_S3_PREFIX") | ||
| if modelS3Prefix == "" { | ||
| modelS3Prefix = "models/distilgpt2" | ||
| } | ||
| datasetS3Prefix := os.Getenv("DATASET_S3_PREFIX") | ||
| if datasetS3Prefix == "" { | ||
| datasetS3Prefix = "alpaca-cleaned-datasets" | ||
| } | ||
|
|
||
| // Build S3 export commands (only if configured) | ||
| s3Exports := "" | ||
| if s3Endpoint != "" && s3Bucket != "" { | ||
| test.T().Logf("S3 mode: endpoint=%s, bucket=%s", s3Endpoint, s3Bucket) | ||
| s3Exports = fmt.Sprintf( | ||
| "export AWS_DEFAULT_ENDPOINT='%s'; "+ | ||
| "export AWS_ACCESS_KEY_ID='%s'; "+ | ||
| "export AWS_SECRET_ACCESS_KEY='%s'; "+ | ||
| "export AWS_STORAGE_BUCKET='%s'; "+ | ||
| "export MODEL_S3_PREFIX='%s'; "+ | ||
| "export DATASET_S3_PREFIX='%s'; ", | ||
| s3Endpoint, s3AccessKey, s3SecretKey, s3Bucket, modelS3Prefix, datasetS3Prefix, | ||
| ) | ||
| } else { | ||
| test.T().Log("HuggingFace mode: S3 not configured, will download from HF Hub") | ||
| } | ||
|
|
||
| // PyPI mirror configuration for disconnected environments (optional) | ||
| pipIndexUrl := os.Getenv("PIP_INDEX_URL") | ||
| pipExports := "" | ||
| pipInstallFlags := "" | ||
| if pipIndexUrl != "" { | ||
| test.T().Logf("PyPI mirror: %s", pipIndexUrl) | ||
| // Extract hostname for trusted-host | ||
| pipTrustedHost := os.Getenv("PIP_TRUSTED_HOST") | ||
| if pipTrustedHost == "" { | ||
| // Extract hostname from URL | ||
| pipTrustedHost = strings.TrimPrefix(strings.TrimPrefix(pipIndexUrl, "https://"), "http://") | ||
| if idx := strings.Index(pipTrustedHost, ":"); idx > 0 { | ||
| pipTrustedHost = pipTrustedHost[:idx] | ||
| } else if idx := strings.Index(pipTrustedHost, "/"); idx > 0 { | ||
| pipTrustedHost = pipTrustedHost[:idx] | ||
| } | ||
| } | ||
| pipExports = fmt.Sprintf( | ||
| "export PIP_INDEX_URL='%s'; "+ | ||
| "export PIP_TRUSTED_HOST='%s'; "+ | ||
| "export PYTHONHTTPSVERIFY='0'; ", | ||
| pipIndexUrl, pipTrustedHost, | ||
| ) | ||
| pipInstallFlags = fmt.Sprintf("--index-url '%s' --trusted-host '%s' ", pipIndexUrl, pipTrustedHost) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if any logging statements might expose credentials in the test file
rg -n "Logf.*secret|Logf.*password|Logf.*access.?key" tests/trainer/sdk_tests/Repository: opendatahub-io/distributed-workloads
Length of output: 62
🏁 Script executed:
#!/bin/bash
# Read the file to see the full context
cat -n tests/trainer/sdk_tests/rhai_features_tests.go | sed -n '214,269p'Repository: opendatahub-io/distributed-workloads
Length of output: 2566
🏁 Script executed:
#!/bin/bash
# Search for uses of s3Exports and pipExports that might log them
rg -n "s3Exports|pipExports" tests/trainer/sdk_tests/rhai_features_tests.go -A 3 -B 1Repository: opendatahub-io/distributed-workloads
Length of output: 908
🏁 Script executed:
#!/bin/bash
# Find the function call around line 313
sed -n '300,330p' tests/trainer/sdk_tests/rhai_features_tests.goRepository: opendatahub-io/distributed-workloads
Length of output: 1253
🏁 Script executed:
#!/bin/bash
# Search for any logging that includes s3Exports or pipExports
rg -n "Log.*s3Exports|Log.*pipExports|Logf.*s3Exports|Logf.*pipExports" tests/trainer/sdk_tests/Repository: opendatahub-io/distributed-workloads
Length of output: 62
🏁 Script executed:
#!/bin/bash
# Check how these export strings are used - likely passed to shell execution
rg -n "sh\.|bash|exec|Cmd|Command" tests/trainer/sdk_tests/rhai_features_tests.go | head -20Repository: opendatahub-io/distributed-workloads
Length of output: 161
🏁 Script executed:
#!/bin/bash
# See the full shellCmd construction
sed -n '270,295p' tests/trainer/sdk_tests/rhai_features_tests.goRepository: opendatahub-io/distributed-workloads
Length of output: 870
🏁 Script executed:
#!/bin/bash
# Search for PIP_INDEX_URL usage across the codebase to understand if credentials can be embedded
rg -n "PIP_INDEX_URL" tests/ -B 2 -A 2Repository: opendatahub-io/distributed-workloads
Length of output: 6240
🏁 Script executed:
#!/bin/bash
# Check what environment variables are being used for credentials
rg -n "GetStorageBucket|PIP_" tests/trainer/sdk_tests/rhai_features_tests.goRepository: opendatahub-io/distributed-workloads
Length of output: 463
Address potential credential exposure in PyPI URL logging at line 250.
The S3 credentials are handled safely—they're passed via environment variables in the shell command without being logged directly (line 231 only logs endpoint and bucket).
However, the PyPI URL logging at line 250 (test.T().Logf("PyPI mirror: %s", pipIndexUrl)) is a security risk: if PIP_INDEX_URL contains embedded credentials in the format https://user:password@registry/simple/, they will be exposed in test logs. Consider masking the credentials before logging or extract only the hostname.
🤖 Prompt for AI Agents
In @tests/trainer/sdk_tests/rhai_features_tests.go around lines 214 - 269, The
PyPI mirror log currently prints pipIndexUrl raw which may contain embedded
credentials; sanitize before logging by parsing pipIndexUrl (e.g., via net/url)
to remove Userinfo or log only the hostname (reuse pipTrustedHost) instead of
the full URL; update the test.T().Logf call that references pipIndexUrl and
ensure pipTrustedHost computation still extracts the host for trusted-host
usage.
| required_version = get_required_version() | ||
| print(f"Attempting to install kubeflow=={required_version} from PyPI...") | ||
| result = subprocess.run( | ||
| [sys.executable, "-m", "pip", "install", "--quiet", f"kubeflow=={required_version}"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeflow package is not in default Pypi index but in red hat index only, so it should use the index to install SDK
| print(f"Verified: kubeflow version {installed_version} matches required {required_version}") | ||
| return True | ||
| else: | ||
| print(f"Version mismatch: installed '{installed_version}', required '{required_version}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For installing from Red hat index getting:
Version mismatch: installed 'v0.2.1+rhai0', required '0.2.1+rhai0'
| } | ||
|
|
||
| // Multi-GPU CUDA tests - 2 nodes, 2 GPUs each (requires 4 total NVIDIA GPUs) | ||
| func TestRhaiTrainingProgressionMultiGpuCuda(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to test separately multinode single GPU and multinode multi GPU?
how much time does it take to run one test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the new model downloading implementation is quite slow, downloading from HuggingFace
Description
RHOAIENG-42362
Enables RHAI feature tests (progression tracking, JIT checkpointing) to run in disconnected OpenShift environments where cluster nodes have no internet access.
Key Features
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.