|
| 1 | +## Summary - AWS Fleet support for ray autoscaler |
| 2 | + |
| 3 | +### General Motivation |
| 4 | + |
| 5 | +Today, AWS autoscaler requires developers to choose their EC2 instance types upfront and codify it in their [autoscaler config](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml#L73-L80). EC2 offers a lot of instances types across different families with varied pricing. Hence, developers can realize cost-savings if their workload does not have a strict dependency on a specific family of instances. [EC2 Fleet](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet.html) is an AWS offering that allows developers to launch a group of instances depending on various parameters like, maximum amount per hour they are willing to pay, supplement their primary on-demand capacity with spot capacity, specify maximum spot prices for each instance, choose from multiple allocation strategies for on-demand and spot capacity etc. This proposal outlines the work needed to support EC2 Fleet on ray autoscaler. |
| 6 | + |
| 7 | +#### Key requirements |
| 8 | + |
| 9 | +- Autoscaler must offer existing functionalities and hence changes must be backward compatible. |
| 10 | +- Developers must be able to make use of most of the critical features AWS EC2 Fleet offers. |
| 11 | +- [Ray autoscaling monitor](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/monitor.py), [CLI](https://github.com/ray-project/ray/blob/7b4b88b4082297d3790b9e542090228970708270/python/ray/autoscaler/_private/commands.py#L692), [standard autoscaler](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/autoscaler.py), and placement groups must be able to provision nodes via EC2 fleet. |
| 12 | +- EC2 Fleet must not interfere with the autoscaler activities. |
| 13 | +- EC2 Fleet must be supported for both head and worker node types. |
| 14 | + |
| 15 | +### Should this change be within `ray` or outside? |
| 16 | + |
| 17 | +Yes. Specifically within AWS autoscaler. |
| 18 | + |
| 19 | +## Stewardship |
| 20 | + |
| 21 | +### Required Reviewers |
| 22 | + |
| 23 | +- TBD |
| 24 | + |
| 25 | +### Shepherd of the Proposal (should be a senior committer) |
| 26 | + |
| 27 | +- TBD |
| 28 | + |
| 29 | +## Design and Architecture |
| 30 | + |
| 31 | +### Proposed Architecture |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | +As described above, we will be adapting **node provider** to support instance creation via [create_fleet](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.create_fleet) API of EC2. The existing components that invoke **node provider** are only shown for clarity. |
| 36 | + |
| 37 | +### How EC2 fleet works? |
| 38 | + |
| 39 | +An EC2 Fleet contains configuration to launch a group of instances. There are [three types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet-request-type.html) of fleet requests, namely, `request`, `maintain` and `instant`. Because of the constraints mentioned in the **requirements** section (i.e., EC2 must not interfere with autoscaler activities), we would only be able to make use of `instant` type. Unlike, `maintain` and `request`, `instant` type `create_fleet` creates a fleet entity synchronously in an AWS account which is simply a configuration. Below are couple of observations related to `instant` type EC2 fleets: |
| 40 | + |
| 41 | +- `instant` type fleets cannot be deleted without terminating instances it created. |
| 42 | +- directly terminating instances created by `instant` fleet using `terminate_instances` API does not affect the fleet state. |
| 43 | +- spot interruptions do not replace an instance created by `instant` fleet. |
| 44 | +- there is no limit to number of EC2 `instant` type fleets one can have in an account. |
| 45 | + |
| 46 | +### How will EC2 fleets be created? |
| 47 | + |
| 48 | +An EC2 fleet configuration will be encapsulated within a single node type from the autoscaler perspective. We aim to create a new EC2 `instant` fleet as part of `create_node` method that is pretty much the entry point for any scale-up request from any other component (per diagram above). Ray's autoscaler will make sure `max_workers` config is honoured as well as perform any bin-packing of the resources across node types. EC2 automatically tags the instances created this way with a tag named `aws:ec2:fleet-id`. |
| 49 | + |
| 50 | +### How will the demand be converted into resources? |
| 51 | + |
| 52 | +Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregated them based on the latest family or high spec instance. Hence, autoscaler will end up spinning less nodes than necessary which avoid overscaling problem. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue |
| 53 | + |
| 54 | +### How will spot interruptions be handled? |
| 55 | + |
| 56 | +Developers often choose to use spot instances if their application can tolerate faults while they could optimize for costs. When there is a spot interruption, GCS will fail to receive heartbeat from that node essentially marking that node as dead which could trigger a scale up signal from autoscaler. Fortunately, there is [no quota](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/fleet-quotas.html) for number of fleets with `instant` type which allows us to scale up unlimited number of times each with different fleet id. |
| 57 | + |
| 58 | +### How will EC2 fleets clean up happen? |
| 59 | + |
| 60 | +There will be no change in how the instances are terminated today. However, terminating all the instances in a fleet does not necessarily delete the fleet. The fleet entity continues to be in active state. Hence, a `post_process` step will be introduced after each autoscaler update which cleans up any unused active fleets. We cannot describe fleets of `instant` type via EC2 APIs, hence it must be stored in either GCS kv store or any in-memory storage accessible to autoscaler. |
| 61 | + |
| 62 | +### How will caching stopped nodes work? |
| 63 | + |
| 64 | +When stopped nodes are cached, they are reused next time the cluster is setup. There is no change in this behavior w.r.t EC2 fleets. Any changes in instance states within a fleet does not affect fleet state based on the POC. Hence, they have to be explicitly deleted as part of `post_process`. |
| 65 | + |
| 66 | +### How will customer experience looks like? |
| 67 | + |
| 68 | +The autoscaler config or the customer experience with respect to provisioning workers using EC2 `create_instances` API will not change. An optional config key `node_launch_method` (note the naming convention) will be introduced within the `node_config` that determines whether to use `create_fleet` API or `create_instances` API while provisioning EC2 instances. However, default value for this config key will be `create_instances`. Any other arguments within the `node_config` will be passed to the appropriate API as is. However, the request type will be overridden to `instant` as Ray autoscaler cannot allow EC2 fleets to interfere with the scaling. |
| 69 | + |
| 70 | +```yaml |
| 71 | +.... |
| 72 | +ray.worker.default: |
| 73 | + max_workers: 100 |
| 74 | + node_config: |
| 75 | + node_launch_method: create_fleet |
| 76 | + OnDemandOptions: |
| 77 | + AllocationStrategy: lowest-price |
| 78 | + LaunchTemplateConfigs: |
| 79 | + - LaunchTemplateSpecification: |
| 80 | + LaunchTemplateName: RayClusterLaunchTemplate |
| 81 | + Version: $Latest |
| 82 | + Overrides: |
| 83 | + - InstanceType: r3.8xlarge |
| 84 | + ImageId: ami-04af5926cc5ad5248 |
| 85 | + - InstanceType: r4.8xlarge |
| 86 | + ImageId: ami-04af5926cc5ad5248 |
| 87 | + - InstanceType: r5.8xlarge |
| 88 | + ImageId: ami-04af5926cc5ad5248 |
| 89 | + TargetCapacitySpecification: |
| 90 | + DefaultTargetCapacityType: 'on-demand' |
| 91 | +.... |
| 92 | +``` |
| 93 | + |
| 94 | +### What are the known limitations? |
| 95 | + |
| 96 | +- The fleet request types `maintain` and `request` are not supported. Also, a hard limit on number of fleets of type `maintain` or `request` may block us from supporting that in future as well. |
| 97 | +- EC2 fleet cannot span different subnets from same availability zone. This is more of a limitation of EC2 fleets which could impact availability. |
| 98 | +- Autoscaler can scale up multiple times less than `upscaling_speed` when the `InstanceType` overrides are heterogeneous from instance size point of view. Hence, developers must be cautioned to follow best practices through proper documentation. |
| 99 | + |
| 100 | +### Proposed code changes |
| 101 | + |
| 102 | +- The default version of `boto3` that is installed must be upgraded to latest i.e., `1.26.41` as `ImageId` override param may be missing in earlier version. |
| 103 | +- Update [config.py](https://github.com/ray-project/ray/blob/e464bf07af9f6513bf71156d1226885dde7a8f46/python/ray/autoscaler/_private/aws/config.py) to parse and update configuration related to fleets. |
| 104 | +- Update the [node_provider.py](https://github.com/ray-project/ray/blame/00d43d39f58f2de7bb7cd963450f7a763b928d10/python/ray/autoscaler/_private/aws/node_provider.py#L250) to create instances using EC2 fleet API like [this](https://github.com/ray-project/ray/commit/81942b4f8c8e9d9c6a037d068e559769e8a27a70). |
| 105 | +- EC2 does not delete a fleet when all of its instances are terminated. Hence, implement [post_process](https://github.com/ray-project/ray/blob/c51b0c9a5664e5c6df3d92f9093b56e61b48f514/python/ray/autoscaler/node_provider.py#L258) method for aws node provider to clean up any active fleets which has only terminated instances. |
| 106 | +- Add an example [autoscaler config](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws) documentation to help developers in utilizing the EC2 fleet functionality. |
| 107 | +- Update ray test suite to cover integration and unit tests. |
| 108 | + |
| 109 | +## Compatibility, Deprecation, and Migration Plan |
| 110 | + |
| 111 | +The changes are supposed to be backward compatible. |
| 112 | + |
| 113 | +## Test Plan and Acceptance Criteria |
| 114 | + |
| 115 | +### Test Plan |
| 116 | + |
| 117 | +- Setup ray head using EC2 fleets via CLI |
| 118 | + |
| 119 | +- Setup ray worker nodes using EC2 fleets via CLI |
| 120 | + |
| 121 | +- Setup ray cluster using the EC2 fleet and run varieties of ray applications with: |
| 122 | + |
| 123 | + - task requesting n cpus |
| 124 | + - task requesting n gpus |
| 125 | + - actors requesting n cpus |
| 126 | + - actors requesting n gpus |
| 127 | + - tasks and actors requesting combination of cpu/gpu and custom resources. |
| 128 | + |
| 129 | +- Setup cluster with below outlined autoscaler config variations: |
| 130 | + |
| 131 | + - fleet request contains InstanceRequirements overrides. |
| 132 | + - fleet request contains InstanceType overrides. |
| 133 | + |
| 134 | +### Acceptance Criteria |
| 135 | + |
| 136 | +- Ray application comply with resource (cpu / gpu / memory / custom) allocation. |
| 137 | +- InstanceRequirements, InstanceType parameters work as expected. |
| 138 | +- Less than 30% performance overhead. |
0 commit comments