-
Notifications
You must be signed in to change notification settings - Fork 58
RHOAIENG-33283: Elegently handle working dir and runtime env #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-33283: Elegently handle working dir and runtime env #922
Conversation
|
@kryanbeane: This pull request references RHOAIENG-33283 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c4a0d92 to
aeba0c0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## ray-jobs-feature #922 +/- ##
====================================================
+ Coverage 94.04% 94.26% +0.22%
====================================================
Files 22 24 +2
Lines 1914 2040 +126
====================================================
+ Hits 1800 1923 +123
- Misses 114 117 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
39320fc to
8974240
Compare
Signed-off-by: Pat O'Connor <[email protected]>
8974240 to
ef92ba7
Compare
|
/hold |
7cd5a95 to
b195de7
Compare
|
/override codecov/patch |
|
@kryanbeane: Overrode contexts on behalf of kryanbeane: codecov/patch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@kryanbeane: This pull request references RHOAIENG-33283 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@kryanbeane: This pull request references RHOAIENG-33283 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Thanks! Verified the changes against ROSA cluster and executed rayjob e2e tests (also against ROSA cluster) successfully. what do we do with the failing e2e tests here? |
66eab4c to
ba7a926
Compare
Co-authored-by: Pat O'Connor <[email protected]>
ba7a926 to
e9a7766
Compare
VerificationTest 1: Lifecycled Cluster - Verified as described Test 2: Long-Lived Cluster - Verified The verification steps in the description needed an update as they were not providing a from codeflare_sdk import Cluster, ClusterConfiguration, RayJob |
|
I didn't see a failure on OR |
c2d866b to
c5ef9a6
Compare
c5ef9a6 to
a97d969
Compare
|
Verification - Existing Cluster
Verification - Lifecycled Cluster
Verifying that file not present produces an error Verification - .ipynb file is not included |
|
/approve Verified and feel free to remove the hold when you are ready @kryanbeane |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: laurafitzgerald The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
1a61bfd
into
project-codeflare:ray-jobs-feature
## WHAT Addition of rayjob tests (with Bryan's latest changes (matching timestamp of this PR) fro m [this PR](project-codeflare/codeflare-sdk#922) and bumping codeflare-sdk tag (to match the upcoming release version) ## VERIFICATION Verified locally with a [test branch](https://github.com/red-hat-data-services/ods-ci/compare/master...pawelpaszki:ods-ci:[RHOAIENG-33408](https://issues.redhat.com//browse/RHOAIENG-33408)-test) with temp tags to execute only relevant tests ``` bash-5.2# git checkout RHOAIENG-33408-test branch 'RHOAIENG-33408-test' set up to track 'pawel/RHOAIENG-33408-test'. Switched to a new branch 'RHOAIENG-33408-test' bash-5.2# export WORKSPACE="/workspace/ods-ci/ods_ci" bash-5.2# cd ods_ci/ bash-5.2# ./run_robot_test.sh --include ttt --skip-oclogin true test-variables.yml ./run_robot_test.sh: line 211: distro: command not found INFO: we found a yq executable skipping OC login as per parameter --skip-oclogin Git revision refname='RHOAIENG-33408-test', venvdir='RHOAIENG-33408-test'. Checking whether '/root/.local/ods-ci/RHOAIENG-33408-test/.venv' exists. Checking whether '/root/.local/ods-ci/master/.venv' exists. Pre-created virtual environment has not been found in '/root/.local/ods-ci/master/.venv'. All dependencies will be installed from scratch. Python '' is not of the correct version Configuring poetry to use Python /root/.pyenv/shims/python3.11 Creating virtualenv ods-ci-VVJNOhYl-py3.11 in /root/.cache/pypoetry/virtualenvs Using virtualenv: /root/.cache/pypoetry/virtualenvs/ods-ci-VVJNOhYl-py3.11 Installing dependencies from lock file Package operations: 239 installs, 0 updates, 0 removals - Installing attrs (24.2.0) - Installing pyasn1 (0.6.1) ... - Installing robotframework-openshift (1.0.0 1297347) Installing the current project: ods-ci (0.1.0) ============================================================================== Tests ============================================================================== Tests.Distributed Workloads ============================================================================== Tests.Distributed Workloads.Workloads Orchestration ============================================================================== 2025-10-14 13:59:32,344 - RPA.core.certificates - INFO - Truststore not in use, HTTPS traffic validated against `certifi` package. (requires Python 3.10.12 and 'pip' 23.2.1 at minimum) Tests.Distributed Workloads.Workloads Orchestration.Test-Run-Codeflare-Sdk-... ============================================================================== Cloning into 'codeflare-sdk'... Note: switching to 'c5ef9a6c5384e167a24c5c1ac261d3a1f6e3d432'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false Product:RHODS Version:2.24.0 [ WARN ] No Prometheus found Run TestRayJobRayVersionValidationOauth test with Python 3.11 :: R... "Running codeflare-sdk test: ray_version_validation_oauth_test.py" HEAD is now at c5ef9a6 RHOAIENG-33283: Change ConfigMaps to Secrets * (no branch) Creating virtualenv codeflare-sdk-_B-kuLxP-py3.11 in /root/.cache/pypoetry/virtualenvs Using virtualenv: /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11 Installing dependencies from lock file Package operations: 175 installs, 0 updates, 0 removals - Installing attrs (25.3.0) - Installing rpds-py (0.26.0) .. - Installing python-client (0.0.0-dev b2fd91b) Warning: The file chosen for install of virtualenv 20.35.1 (virtualenv-20.35.1-py3-none-any.whl) is yanked. Reason for being yanked: Backwards incompatible changes Installing the current project: codeflare-sdk (0.31.1) ============================= test session starts ============================== platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.6.0 -- /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/bin/python cachedir: .pytest_cache rootdir: /workspace/ods-ci/ods_ci/codeflare-sdk configfile: pyproject.toml plugins: anyio-4.9.0, mock-3.11.1, timeout-2.3.1 timeout: 900.0s timeout method: signal timeout func_only: False collecting ... collected 2 items tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_incompatible_ray_version_oauth creating Kueue resources ... 'test-resource-flavor-zdzxw' created! 'test-cluster-queue-6p6d6' created 'test-local-queue-97jxx' created in namespace 'test-ns-11cqv' Creating RayJob with incompatible Ray image in cluster config: quay.io/modh/ray:2.46.1-py311-cu121 Attempting to submit RayJob 'incompatible-lifecycle-rayjob' with incompatible Ray version... ✅ Ray version validation correctly prevented RayJob submission with incompatible cluster config! PASSED 'test-cluster-queue-6p6d6' cluster-queue deleted 'test-resource-flavor-zdzxw' resource-flavor deleted tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_unknown_ray_version_oauth creating Kueue resources ... 'test-resource-flavor-apb9z' created! 'test-cluster-queue-lfkds' created 'test-local-queue-6tos0' created in namespace 'test-ns-nlaen' Creating RayJob with image where Ray version cannot be determined: quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e Attempting to submit RayJob 'unknown-version-rayjob' with unknown Ray version... ✅ RayJob submission succeeded with warning for unknown Ray version! Note: RayJob 'unknown-version-rayjob' was submitted successfully but may need manual cleanup. PASSED 'test-cluster-queue-lfkds' cluster-queue deleted 'test-resource-flavor-apb9z' resource-flavor deleted =============================== warnings summary =============================== ../../../../root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373 /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: collect_ignore self._warn_or_fail_if_strict(f"Unknown config option: {key}\n") tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_incompatible_ray_version_oauth tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_incompatible_ray_version_oauth tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_incompatible_ray_version_oauth tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_unknown_ray_version_oauth tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_unknown_ray_version_oauth tests/e2e/rayjob/ray_version_validation_oauth_test.py::TestRayJobRayVersionValidationOauth::test_rayjob_lifecycled_cluster_unknown_ray_version_oauth /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/kubernetes/client/rest.py:44: DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.6.0. Instead access HTTPResponse.headers directly. return self.urllib3_response.getheaders() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================== 2 passed, 7 warnings in 8.21s ========================= Run TestRayJobRayVersionValidationOauth test with Python 3.11 :: R... | PASS | ------------------------------------------------------------------------------ Run TestRayJobExistingCluster test with Python 3.11 :: Run Python ... "Running codeflare-sdk test: rayjob_existing_cluster_test.py" HEAD is now at c5ef9a6 RHOAIENG-33283: Change ConfigMaps to Secrets * (no branch) Using virtualenv: /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11 Installing dependencies from lock file No dependencies to install or update Installing the current project: codeflare-sdk (0.31.1) ============================= test session starts ============================== platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.6.0 -- /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/bin/python cachedir: .pytest_cache rootdir: /workspace/ods-ci/ods_ci/codeflare-sdk configfile: pyproject.toml plugins: anyio-4.9.0, mock-3.11.1, timeout-2.3.1 timeout: 900.0s timeout method: signal timeout func_only: False collecting ... collected 1 item tests/e2e/rayjob/rayjob_existing_cluster_test.py::TestRayJobExistingCluster::test_existing_kueue_cluster creating Kueue resources ... 'test-resource-flavor-ly938' created! 'test-cluster-queue-q79cp' created 'test-local-queue-tn4bl' created in namespace 'test-ns-kh658' Insecure request warnings have been disabled Warning: TLS verification has been disabled - Endpoint checks will be bypassed Written to: /root/.codeflare/resources/kueue-cluster.yaml Written to: /root/.codeflare/resources/kueue-cluster.yaml Ray Cluster: 'kueue-cluster' has successfully been applied. For optimal resource management, you should delete this Ray Cluster when no longer in use. Waiting for cluster 'kueue-cluster' to be ready... Waiting for requested resources to be set up... Requested cluster is up and running! Dashboard is ready! ✓ Cluster 'kueue-cluster' is ready Ray Cluster: 'kueue-cluster' has successfully been deleted PASSED 'test-cluster-queue-q79cp' cluster-queue deleted 'test-resource-flavor-ly938' resource-flavor deleted =============================== warnings summary =============================== ../../../../root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373 /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: collect_ignore self._warn_or_fail_if_strict(f"Unknown config option: {key}\n") tests/e2e/rayjob/rayjob_existing_cluster_test.py::TestRayJobExistingCluster::test_existing_kueue_cluster tests/e2e/rayjob/rayjob_existing_cluster_test.py::TestRayJobExistingCluster::test_existing_kueue_cluster tests/e2e/rayjob/rayjob_existing_cluster_test.py::TestRayJobExistingCluster::test_existing_kueue_cluster /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/kubernetes/client/rest.py:44: DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.6.0. Instead access HTTPResponse.headers directly. return self.urllib3_response.getheaders() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =================== 1 passed, 4 warnings in 76.72s (0:01:16) =================== Run TestRayJobExistingCluster test with Python 3.11 :: Run Python ... | PASS | ------------------------------------------------------------------------------ Run TestRayJobLifecycledCluster test with Python 3.11 :: Run Pytho... "Running codeflare-sdk test: rayjob_lifecycled_cluster_test.py" HEAD is now at c5ef9a6 RHOAIENG-33283: Change ConfigMaps to Secrets * (no branch) Using virtualenv: /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11 Installing dependencies from lock file No dependencies to install or update Installing the current project: codeflare-sdk (0.31.1) ============================= test session starts ============================== platform linux -- Python 3.11.5, pytest-7.4.0, pluggy-1.6.0 -- /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/bin/python cachedir: .pytest_cache rootdir: /workspace/ods-ci/ods_ci/codeflare-sdk configfile: pyproject.toml plugins: anyio-4.9.0, mock-3.11.1, timeout-2.3.1 timeout: 900.0s timeout method: signal timeout func_only: False collecting ... collected 2 items tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_managed creating Kueue resources ... 'test-resource-flavor-4eshz' created! 'test-cluster-queue-6kxf0' created 'test-local-queue-6l8bq' created in namespace 'test-ns-u48bn' ✓ Secret kueue-lifecycled-files verified with proper owner reference PASSED 'test-cluster-queue-6kxf0' cluster-queue deleted 'test-resource-flavor-4eshz' resource-flavor deleted tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_resource_queueing Creating limited Kueue resources for preemption testing... 'limited-flavor-lvt8y' created! ✓ Created limited ClusterQueue: limited-cq-slkj3 'limited-lq-8k4tm' created in namespace 'test-ns-95l04' ✓ Limited Kueue resources created successfully Waiting for Kueue admission of job 'waiter'... ✓ Job 'waiter' admitted by Kueue (no longer suspended) PASSED 'limited-cq-slkj3' cluster-queue deleted 'limited-flavor-lvt8y' resource-flavor deleted =============================== warnings summary =============================== ../../../../root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373 /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: collect_ignore self._warn_or_fail_if_strict(f"Unknown config option: {key}\n") tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_managed tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_managed tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_managed tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_resource_queueing tests/e2e/rayjob/rayjob_lifecycled_cluster_test.py::TestRayJobLifecycledCluster::test_lifecycled_kueue_resource_queueing /root/.cache/pypoetry/virtualenvs/codeflare-sdk-_B-kuLxP-py3.11/lib/python3.11/site-packages/kubernetes/client/rest.py:44: DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.6.0. Instead access HTTPResponse.headers directly. return self.urllib3_response.getheaders() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =================== 2 passed, 6 warnings in 98.01s (0:01:38) =================== Run TestRayJobLifecycledCluster test with Python 3.11 :: Run Pytho... | PASS | ------------------------------------------------------------------------------ "Removing directory codeflare-sdk" "Log back as cluster admin" "Logging in as cluster admin to cleanup RBAC permissions" "Removing kueue-batch-user-rolebinding ClusterRoleBinding" Error from server (NotFound): error when deleting "tests/Resources/Files/kueue-batch-user-rolebinding.yaml": clusterrolebindings.rbac.authorization.k8s.io "kueue-batch-user-rolebinding" not found "Warning: Unable to delete kueue-batch-user-rolebinding ClusterRoleBinding (may not exist)" "Removing kueue-batch-user-specific-rolebinding ClusterRoleBinding" Error from server (NotFound): error when deleting "tests/Resources/Files/kueue-batch-user-specific-rolebinding.yaml": clusterrolebindings.rbac.authorization.k8s.io "kueue-batch-user-specific-rolebinding" not found "Warning: Unable to delete kueue-batch-user-specific-rolebinding ClusterRoleBinding (may not exist)" "Removing kueue-batch-user-role ClusterRole" Error from server (NotFound): error when deleting "tests/Resources/Files/kueue-batch-user-role.yaml": clusterroles.rbac.authorization.k8s.io "kueue-batch-user-role" not found "Warning: Unable to delete kueue-batch-user-role ClusterRole (may not exist)" Tests.Distributed Workloads.Workloads Orchestration.Test-Run-Codef... | PASS | 3 tests, 3 passed, 0 failed ============================================================================== Tests.Distributed Workloads.Workloads Orchestration | PASS | 3 tests, 3 passed, 0 failed ============================================================================== Tests.Distributed Workloads | PASS | 3 tests, 3 passed, 0 failed ============================================================================== Tests | PASS | 3 tests, 3 passed, 0 failed ============================================================================== Output: /workspace/ods-ci/ods_ci/test-output/ods-ci-2025-10-14-13-59-qDi05w2uXs/output.xml XUnit: /workspace/ods-ci/ods_ci/test-output/ods-ci-2025-10-14-13-59-qDi05w2uXs/xunit_test_result.xml Log: /workspace/ods-ci/ods_ci/test-output/ods-ci-2025-10-14-13-59-qDi05w2uXs/log.html Report: /workspace/ods-ci/ods_ci/test-output/ods-ci-2025-10-14-13-59-qDi05w2uXs/test_report.html 0 ```
Issue link
RHOAIENG-33283
What changes have been made
runtime_envfor local files, it'll parse imports etc and mount what it thinks is needed. Same for existing and lifecycled.working_dirit doesn't need to mount anything, just parses the requirements file and populatesruntimeEnvYAMLwith what it needs fromruntime_envin the SDK.Verification steps
**IMPORTANT: ** These steps assume you have some kind of files in the directory to use. Create a directory and a sample python file inside to use for testing!
Setup
Test 1: Lifecycled Cluster
Expected: Creates ConfigMap, runs job, auto-deletes cluster when done.
Test 2: Long-Lived Cluster
Expected: Job runs on existing cluster, cluster persists after job completes.
Validation Test (Should Fail)
Expected: Fails with clear error about working_dir conflict.
Checks