-
Notifications
You must be signed in to change notification settings - Fork 801
Description
Motivation
Users who need to configure Ray autoscaler behavior (e.g., environment variables, upscaling mode, resource limits) currently cannot do so because the Flyte Ray integration doesn't pass AutoscalerOptions through to the KubeRay RayCluster CR.
Currently, only enable_autoscaling (bool) and per-worker min_replicas/max_replicas are supported. The KubeRay RayCluster CRD has a rich autoscalerOptions field that allows configuring:
upscalingMode—Default,Aggressive, orConservativeenv— environment variables for the autoscaler sidecar containerresources— CPU/memory requests and limits for the autoscaler containerimage— custom autoscaler container imageidleTimeoutSeconds— how long idle workers are kept before scale-down
Proposal
Expose these options end-to-end across the Flyte stack:
1. flyteidl (protobuf definitions)
Add an AutoscalerOptions message to flyteidl/plugins/ray.proto:
message AutoscalerOptions {
string upscaling_mode = 1; // "Default", "Aggressive", "Conservative"
int32 idle_timeout_seconds = 2;
repeated EnvVar env = 3; // autoscaler sidecar env vars
string image = 4; // custom autoscaler image
Resources resources = 5; // autoscaler container resources
}Add a field to the existing RayCluster message:
message RayCluster {
...
AutoscalerOptions autoscaler_options = N;
}2. flytepropeller (backend Ray plugin)
Update the Ray plugin handler to:
- Read
AutoscalerOptionsfrom the task's protobufRayClustermessage - Map it onto the KubeRay
RayClusterCR's.spec.autoscalerOptions
3. flytekit / flyte-sdk (Python SDK)
In the Ray plugin (plugins/ray/src/flyteplugins/ray/task.py in flyte-sdk):
- Add an
AutoscalerOptionsdataclass:
@dataclass
class AutoscalerOptions:
upscaling_mode: Optional[str] = None # "Default", "Aggressive", "Conservative"
idle_timeout_seconds: Optional[int] = None
env: Optional[Dict[str, str]] = None
image: Optional[str] = None
resources: Optional[Resources] = None- Add
autoscaler_options: Optional[AutoscalerOptions] = NonetoRayJobConfig - Update
RayFunctionTask.custom_config()to serialize autoscaler options into theRayClusterprotobuf
Currently RayJobConfig only has:
enable_autoscaling: bool = FalseThe new field would sit alongside it:
enable_autoscaling: bool = False
autoscaler_options: Optional[AutoscalerOptions] = NoneGoal
Allow users to write:
RayJobConfig(
enable_autoscaling=True,
autoscaler_options=AutoscalerOptions(
upscaling_mode="Conservative",
idle_timeout_seconds=120,
image="rayproject/ray:2.9.0",
),
worker_node_config=[...],
)And have KubeRay receive the full autoscalerOptions on the resulting RayCluster CR.
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes