Skip to content

Conversation

@SheldonTsen
Copy link

@SheldonTsen SheldonTsen commented Oct 31, 2025

Description

I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.

Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.

I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.

The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the ignoreDifferences must be managed properly to get the expected behaviour of the Autoscaler.

I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:

  • introduce the ignoreDifferences section, request X number of workers via ray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.
  • removed the ignoreDifferences section, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like the ray.autoscaler.sdk.request_resources does nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.
  • repeat back and forth in my environment multiple times to confirm.
  • In my case, I was testing X = 40/80/100/200.
  • For small X, like <10, it appears to work, but you will see that pods can shutdown well within the idle limit, but get spun back up again.

ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

The ignoreDifferences section for the RayCluster application specifies name: raycluster-kuberay, but the Helm chart's releaseName is raycluster. This mismatch means the ignoreDifferences rule won't apply to the deployed RayCluster, potentially causing conflicts between ArgoCD and the Ray Autoscaler over worker replica counts.

Fix in Cursor Fix in Web

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds valuable documentation for deploying Ray on Kubernetes using ArgoCD. The example is comprehensive and highlights the crucial ignoreDifferences configuration needed for the Ray autoscaler to work correctly with ArgoCD. My review includes suggestions to improve the example by using specific image tags for better reproducibility, clarifying the need to adjust jsonPointers for different numbers of worker groups, and a minor formatting fix.

Comment on lines +73 to +76
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be helpful to add a comment here explaining that these paths need to be adjusted based on the number of worker groups in the RayCluster. This makes the example more adaptable for users with different configurations.

Suggested change
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
jsonPointers: # Adjust this list to match the number of worker groups
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas

valuesObject:
image:
repository: docker.io/rayproject/ray
tag: latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the latest tag for Docker images is generally discouraged for production deployments as it can make your setup non-deterministic. It's better to pin to a specific version to ensure reproducibility. Since this example uses Autoscaler v2, a version like 2.10.0 or newer would be appropriate. This suggestion also applies to the images on lines 110 and 121.

Suggested change
tag: latest
tag: "2.10.0"

It has been observed that without this `ignoreDifferences` section, ArgoCD
and the Ray Autoscaler may conflict, resulting in unexpected behaviour when
it comes to requesting workers dynamically (e.g. `ray.autoscaler.sdk.request_resources`).
More specifically, when requesting N workers, the Autoscaler would not spin up N workers. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please add a newline at the end of the file. It's a common convention and can prevent issues with some tools.

Suggested change
More specifically, when requesting N workers, the Autoscaler would not spin up N workers.
More specifically, when requesting N workers, the Autoscaler would not spin up N workers.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 31, 2025
@edoakes edoakes added kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side and removed core Issues that should be addressed in Ray Core labels Oct 31, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 31, 2025

@Future-Outlier @rueian PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community docs An issue or change related to documentation kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants