-
Notifications
You must be signed in to change notification settings - Fork 58
Refactor ray creation #751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ray creation #751
Conversation
c18c7bc to
b64153b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #751 +/- ##
==========================================
- Coverage 94.12% 92.91% -1.21%
==========================================
Files 36 36
Lines 2417 2400 -17
==========================================
- Hits 2275 2230 -45
- Misses 142 170 +28 ☔ View full report in Codecov by Sentry. |
4564e31 to
202d75e
Compare
202d75e to
8701047
Compare
ChristianZaccaria
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR, awesome work Mark! :)
I left some minor nitpicks. I'll now give this a run in a cluster.
ChristianZaccaria
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed when comparing the YAMLs between this PR and main branch, on this PR, the RayCluster yaml generated contains explicitly:
imagePullPolicy: Always
In main, this is unset. By default, if the image tag is :latest the imagePullPolicy is set to Always, otherwise, the default is ifNotPresent. ifNotPresent may be preferred here to only pull the image if it's not cached or doesn't already exist on the node.
8701047 to
6ce8630
Compare
6ce8630 to
f32fd75
Compare
ChristianZaccaria
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- YAMLs look as expected when comparing with
mainand this PR.
I tested the following scenarios:- LocalQueue set, no image (defaults to py3.9 image on OpenShift).
- LocalQueue not set, no image (defaults to py3.9 image on OpenShift).
- Python3.11 environment: the image changes to Ray image for py3.11.
- Testing most parameters: Set
envs, AppWrapper true, no LQ, set custom image, and set gpus. - Testing most parameters: Set
envs, AppWrapper true, set LQ, set custom image, and set gpus.
- AppWrappers and RayClusters work as expected.
get_cluster()works well!
/lgtm thanks! Great work!
f32fd75 to
74d3f38
Compare
f1ed63c to
9595458
Compare
9595458 to
5f5c2ab
Compare
5f5c2ab to
30bf971
Compare
|
/retest |
KPostOffice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ChristianZaccaria, KPostOffice The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6ec44c5
into
project-codeflare:main
Issue link
Closes: RHOAIENG-10385 and RHOAIENG-8846
What changes have been made
ray_versiona variable for potential future automationcreate_resourceget_clustermethod to generate a new ClusterConfiguration with just thenameandnamespaceof the cluster and retrieved yaml.envconfig param now actually worksVerification steps
Setup
Notebook server ODH/RHOAI/Local
git clone https://github.com/project-codeflare/codeflare-sdk.gitpoetry build- install if needed (pip install poetry)pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whlTesting
All
ClusterConfigurationparameters must be tested with the new cluster creation method.Keep a special eye out for the following as they were the most complex to implement:
Recommendation
Have 2 separate virtual envs 1 with main SDK and 1 with this PR's SDK and compare created Ray Clusters.
Small things like blank image pull secrets and some example metadata labels are removed from every generated ray cluster so they wont be an exact match. The important thing to look out for is if the configurations match 👍
Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.
Test the new and improved
get_cluster()function.NOTE: You can compare the original & retrieved clusters by setting
write_to_file=TrueonClusterConfigurationandget_cluster()NOTE 2:
get_cluster()will also retrieve the mtls/oauth containers as well. This has no impact on the ability to create the cluster after deleting it throughget_cluster()->cluster.down()->cluster.up()cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, write_to_file=True)cluster.methodscluster.down()thencluster.up()Checks