Add documentation for ArgoCD #58340

SheldonTsen · 2025-10-31T11:26:25Z

Description

I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.

Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.

I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.

The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the ignoreDifferences must be managed properly to get the expected behaviour of the Autoscaler.

I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:

introduce the ignoreDifferences section, request X number of workers via ray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.
removed the ignoreDifferences section, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like the ray.autoscaler.sdk.request_resources does nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.
repeat back and forth in my environment multiple times to confirm.
In my case, I was testing X = 40/80/100/200.
For small X, like <10, it appears to work, but you will see that pods can shutdown well within the idle limit, but get spun back up again.

cursor · 2025-10-31T11:27:48Z

doc/source/cluster/kubernetes/examples/argocd.md

+  ignoreDifferences:
+    - group: ray.io
+      kind: RayCluster
+      name: raycluster-kuberay


Bug: Bug

The ignoreDifferences section for the RayCluster application specifies name: raycluster-kuberay, but the Helm chart's releaseName is raycluster. This mismatch means the ignoreDifferences rule won't apply to the deployed RayCluster, potentially causing conflicts between ArgoCD and the Ray Autoscaler over worker replica counts.

gemini-code-assist

Code Review

This pull request adds valuable documentation for deploying Ray on Kubernetes using ArgoCD. The example is comprehensive and highlights the crucial ignoreDifferences configuration needed for the Ray autoscaler to work correctly with ArgoCD. My review includes suggestions to improve the example by using specific image tags for better reproducibility, clarifying the need to adjust jsonPointers for different numbers of worker groups, and a minor formatting fix.

gemini-code-assist · 2025-10-31T11:28:41Z

doc/source/cluster/kubernetes/examples/argocd.md

+      jsonPointers:
+        - /spec/workerGroupSpecs/0/replicas
+        - /spec/workerGroupSpecs/1/replicas
+        - /spec/workerGroupSpecs/2/replicas


It would be helpful to add a comment here explaining that these paths need to be adjusted based on the number of worker groups in the RayCluster. This makes the example more adaptable for users with different configurations.

Suggested change

jsonPointers:

- /spec/workerGroupSpecs/0/replicas

- /spec/workerGroupSpecs/1/replicas

- /spec/workerGroupSpecs/2/replicas

jsonPointers: # Adjust this list to match the number of worker groups

- /spec/workerGroupSpecs/0/replicas

- /spec/workerGroupSpecs/1/replicas

- /spec/workerGroupSpecs/2/replicas

gemini-code-assist · 2025-10-31T11:28:41Z

doc/source/cluster/kubernetes/examples/argocd.md

+      valuesObject:
+        image:
+          repository: docker.io/rayproject/ray
+          tag: latest


Using the latest tag for Docker images is generally discouraged for production deployments as it can make your setup non-deterministic. It's better to pin to a specific version to ensure reproducibility. Since this example uses Autoscaler v2, a version like 2.10.0 or newer would be appropriate. This suggestion also applies to the images on lines 110 and 121.

Suggested change

tag: latest

tag: "2.10.0"

gemini-code-assist · 2025-10-31T11:28:41Z

doc/source/cluster/kubernetes/examples/argocd.md

+It has been observed that without this `ignoreDifferences` section, ArgoCD
+and the Ray Autoscaler may conflict, resulting in unexpected behaviour when
+it comes to requesting workers dynamically (e.g. `ray.autoscaler.sdk.request_resources`).
+More specifically, when requesting N workers, the Autoscaler would not spin up N workers.


Please add a newline at the end of the file. It's a common convention and can prevent issues with some tools.

Suggested change

More specifically, when requesting N workers, the Autoscaler would not spin up N workers.

More specifically, when requesting N workers, the Autoscaler would not spin up N workers.

edoakes · 2025-10-31T14:16:01Z

@Future-Outlier @rueian PTAL

add doc

71339bd

SheldonTsen requested review from a team as code owners October 31, 2025 11:26

SheldonTsen mentioned this pull request Oct 31, 2025

Add example ArgoCD Deployment ray-project/kuberay#4148

Closed

cursor bot reviewed Oct 31, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

ray-gardener bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 31, 2025

edoakes added kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side and removed core Issues that should be addressed in Ray Core labels Oct 31, 2025

edoakes assigned rueian and Future-Outlier Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add documentation for ArgoCD #58340

Add documentation for ArgoCD #58340

SheldonTsen commented Oct 31, 2025 •

edited

Loading

Uh oh!

cursor bot Oct 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

edoakes commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	More specifically, when requesting N workers, the Autoscaler would not spin up N workers.
	More specifically, when requesting N workers, the Autoscaler would not spin up N workers.

Add documentation for ArgoCD #58340

Are you sure you want to change the base?

Add documentation for ArgoCD #58340

Conversation

SheldonTsen commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

cursor bot Oct 31, 2025

Choose a reason for hiding this comment

Bug: Bug

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SheldonTsen commented Oct 31, 2025 •

edited

Loading