Skip to content

Conversation

@kryanbeane
Copy link
Contributor

@kryanbeane kryanbeane commented Sep 18, 2025

Issue link

RHOAIENG-32532

What changes have been made

  • Added local queue label support to RayJobs
  • Updated E2E tests
  • Made RayJob E2Es support both Kind and Openshift

Verification steps

Be oc logged into an openshift cluster with Kuberay enabled. Kueue should be disabled in the DSC, and you should install the latest version of RHBoK (RedHat Build of Kueue) installed. Search for Kueue in Operator Hub to find this. Run the below:

pip install -e .
poetry run pytest tests/e2e/rayjob/ -v -s -x

You can also test this manually. Try to create a RayJob with the usual Kueue resources created (follow Pat's KS to set up the resources). You should see the local queue admit the job and the cluster get created.

Check the Workloads CR, there should only be one for the job, and no workloads for the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 18, 2025

@kryanbeane: This pull request references RHOAIENG-32532 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-32532

What changes have been made

  • Added local queue label support to RayJobs
  • Updated E2E tests
  • Made RayJob E2Es support both Kind and Openshift

Verification steps

Be oc logged into an openshift cluster with Kuberay enabled. Kueue should be disabled in the DSC, and you should install the latest version of RHBoK (RedHat Build of Kueue) installed. Search for Kueue in Operator Hub to find this. Run the below:

pip install -e .
poetry run pytest tests/e2e/rayjob/ -v -s -x

You can also test this manually. Try to create a RayJob with the usual Kueue resources created (follow Pat's KS to set up the resources). You should see the local queue admit the job and the cluster get created.

Check the Workloads CR, there should only be one for the job, and no workloads for the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 18, 2025

@kryanbeane: This pull request references RHOAIENG-32532 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-32532

What changes have been made

  • Added local queue label support to RayJobs
  • Updated E2E tests
  • Made RayJob E2Es support both Kind and Openshift

Verification steps

Be oc logged into an openshift cluster with Kuberay enabled. Kueue should be disabled in the DSC, and you should install the latest version of RHBoK (RedHat Build of Kueue) installed. Search for Kueue in Operator Hub to find this. Run the below:

pip install -e .
poetry run pytest tests/e2e/rayjob/ -v -s -x

You can also test this manually. Try to create a RayJob with the usual Kueue resources created (follow Pat's KS to set up the resources). You should see the local queue admit the job and the cluster get created.

Check the Workloads CR, there should only be one for the job, and no workloads for the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 18, 2025

@kryanbeane: This pull request references RHOAIENG-32532 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-32532

What changes have been made

  • Added local queue label support to RayJobs
  • Updated E2E tests
  • Made RayJob E2Es support both Kind and Openshift

Verification steps

Be oc logged into an openshift cluster with Kuberay enabled. Kueue should be disabled in the DSC, and you should install the latest version of RHBoK (RedHat Build of Kueue) installed. Search for Kueue in Operator Hub to find this. Run the below:

pip install -e .
poetry run pytest tests/e2e/rayjob/ -v -s -x

You can also test this manually. Try to create a RayJob with the usual Kueue resources created (follow Pat's KS to set up the resources). You should see the local queue admit the job and the cluster get created.

Check the Workloads CR, there should only be one for the job, and no workloads for the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

@kryanbeane: This pull request references RHOAIENG-32532 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-32532

What changes have been made

  • Added local queue label support to RayJobs
  • Updated E2E tests
  • Made RayJob E2Es support both Kind and Openshift

Verification steps

Be oc logged into an openshift cluster with Kuberay enabled. Kueue should be disabled in the DSC, and you should install the latest version of RHBoK (RedHat Build of Kueue) installed. Search for Kueue in Operator Hub to find this. Run the below:

pip install -e .
poetry run pytest tests/e2e/rayjob/ -v -s -x

You can also test this manually. Try to create a RayJob with the usual Kueue resources created (follow Pat's KS to set up the resources). You should see the local queue admit the job and the cluster get created.

Check the Workloads CR, there should only be one for the job, and no workloads for the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.04%. Comparing base (8c9dd7d) to head (01c8a7b).
⚠️ Report is 2 commits behind head on ray-jobs-feature.

Files with missing lines Patch % Lines
src/codeflare_sdk/ray/cluster/cluster.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                  @@
##           ray-jobs-feature     #910      +/-   ##
====================================================
- Coverage             94.17%   94.04%   -0.14%     
====================================================
  Files                    22       22              
  Lines                  1924     1914      -10     
====================================================
- Hits                   1812     1800      -12     
- Misses                  112      114       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kryanbeane kryanbeane force-pushed the kueue-integration branch 8 times, most recently from cb5dbdf to 80d1307 Compare September 19, 2025 11:24
@pawelpaszki
Copy link
Contributor

if these tests are to executed during any release testing - can you explain if there are any setup (or other) steps required for the tests execution?

@pawelpaszki
Copy link
Contributor

I have verified the changes on a ROSA cluster by running the tests - looks good. waiting for kind e2e tests to pass

@kryanbeane
Copy link
Contributor Author

@pawelpaszki

if these tests are to executed during any release testing - can you explain if there are any setup (or other) steps required for the tests execution?

All we'll need is RHBoK installed and RayJob and RayCluster integrations enabled

@kryanbeane
Copy link
Contributor Author

E2E is still failing. I'll add a fix in a new commit so we only need to re-review that commit 👍🏻

@kryanbeane kryanbeane force-pushed the kueue-integration branch 5 times, most recently from aeccfc5 to e5950a0 Compare September 24, 2025 15:24
@kryanbeane kryanbeane force-pushed the kueue-integration branch 4 times, most recently from f06d22e to 2f0e79c Compare September 26, 2025 23:35
@kryanbeane kryanbeane force-pushed the kueue-integration branch 12 times, most recently from 6051ec4 to eb21aac Compare October 3, 2025 18:02
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2025
@laurafitzgerald laurafitzgerald force-pushed the ray-jobs-feature branch 4 times, most recently from ad304cc to 665dcb2 Compare October 6, 2025 11:44
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
@LilyLinh LilyLinh self-requested a review October 8, 2025 09:12
@laurafitzgerald
Copy link
Contributor

/approve
/lgtm

Verified this works as expected.
Observed with RHBOK using SDK I was able to
Submit a RayCluster which was appropriately labelled and admitted.
Submit a RayJob to that existing cluster which was not labelled as expected.
Submit a RayJob to a lifecycled RayCluster which was admitted as expected at the RayJob level.

We do need to ensure that for cluster level kueue that the config map named kueue-manager-config entry has manageJobsWithoutQueueName which allows for RayJobs to be submitted to RayClusters which are already being admitted by kueue.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: laurafitzgerald

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 219d1c5 into project-codeflare:ray-jobs-feature Oct 8, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants