Skip to content

Support Ray AutoscalerOptions in RayCluster spec #7005

@pingsutw

Description

@pingsutw

Motivation

Users who need to configure Ray autoscaler behavior (e.g., environment variables, upscaling mode, resource limits) currently cannot do so because the Flyte Ray integration doesn't pass AutoscalerOptions through to the KubeRay RayCluster CR.

Currently, only enable_autoscaling (bool) and per-worker min_replicas/max_replicas are supported. The KubeRay RayCluster CRD has a rich autoscalerOptions field that allows configuring:

  • upscalingModeDefault, Aggressive, or Conservative
  • env — environment variables for the autoscaler sidecar container
  • resources — CPU/memory requests and limits for the autoscaler container
  • image — custom autoscaler container image
  • idleTimeoutSeconds — how long idle workers are kept before scale-down

Proposal

Expose these options end-to-end across the Flyte stack:

1. flyteidl (protobuf definitions)

Add an AutoscalerOptions message to flyteidl/plugins/ray.proto:

message AutoscalerOptions {
  string upscaling_mode = 1;           // "Default", "Aggressive", "Conservative"
  int32 idle_timeout_seconds = 2;
  repeated EnvVar env = 3;             // autoscaler sidecar env vars
  string image = 4;                    // custom autoscaler image
  Resources resources = 5;             // autoscaler container resources
}

Add a field to the existing RayCluster message:

message RayCluster {
  ...
  AutoscalerOptions autoscaler_options = N;
}

2. flytepropeller (backend Ray plugin)

Update the Ray plugin handler to:

  • Read AutoscalerOptions from the task's protobuf RayCluster message
  • Map it onto the KubeRay RayCluster CR's .spec.autoscalerOptions

3. flytekit / flyte-sdk (Python SDK)

In the Ray plugin (plugins/ray/src/flyteplugins/ray/task.py in flyte-sdk):

  • Add an AutoscalerOptions dataclass:
@dataclass
class AutoscalerOptions:
    upscaling_mode: Optional[str] = None        # "Default", "Aggressive", "Conservative"
    idle_timeout_seconds: Optional[int] = None
    env: Optional[Dict[str, str]] = None
    image: Optional[str] = None
    resources: Optional[Resources] = None
  • Add autoscaler_options: Optional[AutoscalerOptions] = None to RayJobConfig
  • Update RayFunctionTask.custom_config() to serialize autoscaler options into the RayCluster protobuf

Currently RayJobConfig only has:

enable_autoscaling: bool = False

The new field would sit alongside it:

enable_autoscaling: bool = False
autoscaler_options: Optional[AutoscalerOptions] = None

Goal

Allow users to write:

RayJobConfig(
    enable_autoscaling=True,
    autoscaler_options=AutoscalerOptions(
        upscaling_mode="Conservative",
        idle_timeout_seconds=120,
        image="rayproject/ray:2.9.0",
    ),
    worker_node_config=[...],
)

And have KubeRay receive the full autoscalerOptions on the resulting RayCluster CR.

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions