- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6.9k
 
Add documentation for ArgoCD #58340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add documentation for ArgoCD #58340
Conversation
| ignoreDifferences: | ||
| - group: ray.io | ||
| kind: RayCluster | ||
| name: raycluster-kuberay | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Bug
The ignoreDifferences section for the RayCluster application specifies name: raycluster-kuberay, but the Helm chart's releaseName is raycluster. This mismatch means the ignoreDifferences rule won't apply to the deployed RayCluster, potentially causing conflicts between ArgoCD and the Ray Autoscaler over worker replica counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds valuable documentation for deploying Ray on Kubernetes using ArgoCD. The example is comprehensive and highlights the crucial ignoreDifferences configuration needed for the Ray autoscaler to work correctly with ArgoCD. My review includes suggestions to improve the example by using specific image tags for better reproducibility, clarifying the need to adjust jsonPointers for different numbers of worker groups, and a minor formatting fix.
| jsonPointers: | ||
| - /spec/workerGroupSpecs/0/replicas | ||
| - /spec/workerGroupSpecs/1/replicas | ||
| - /spec/workerGroupSpecs/2/replicas | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add a comment here explaining that these paths need to be adjusted based on the number of worker groups in the RayCluster. This makes the example more adaptable for users with different configurations.
| jsonPointers: | |
| - /spec/workerGroupSpecs/0/replicas | |
| - /spec/workerGroupSpecs/1/replicas | |
| - /spec/workerGroupSpecs/2/replicas | |
| jsonPointers: # Adjust this list to match the number of worker groups | |
| - /spec/workerGroupSpecs/0/replicas | |
| - /spec/workerGroupSpecs/1/replicas | |
| - /spec/workerGroupSpecs/2/replicas | 
| valuesObject: | ||
| image: | ||
| repository: docker.io/rayproject/ray | ||
| tag: latest | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the latest tag for Docker images is generally discouraged for production deployments as it can make your setup non-deterministic. It's better to pin to a specific version to ensure reproducibility. Since this example uses Autoscaler v2, a version like 2.10.0 or newer would be appropriate. This suggestion also applies to the images on lines 110 and 121.
| tag: latest | |
| tag: "2.10.0" | 
| It has been observed that without this `ignoreDifferences` section, ArgoCD | ||
| and the Ray Autoscaler may conflict, resulting in unexpected behaviour when | ||
| it comes to requesting workers dynamically (e.g. `ray.autoscaler.sdk.request_resources`). | ||
| More specifically, when requesting N workers, the Autoscaler would not spin up N workers. No newline at end of file | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline at the end of the file. It's a common convention and can prevent issues with some tools.
| More specifically, when requesting N workers, the Autoscaler would not spin up N workers. | |
| More specifically, when requesting N workers, the Autoscaler would not spin up N workers. | |
| 
           @Future-Outlier @rueian PTAL  | 
    
Description
I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.
Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.
I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.
The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the
ignoreDifferencesmust be managed properly to get the expected behaviour of the Autoscaler.I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:
ignoreDifferencessection, request X number of workers viaray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.ignoreDifferencessection, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like theray.autoscaler.sdk.request_resourcesdoes nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.