Skip to content

Commit 2237ee5

Browse files
committed
[autoscaler] AWS Fleet support for ray autoscaler
Signed-off-by: Raghavendra Dani <[email protected]>
1 parent 5fcb3a0 commit 2237ee5

File tree

1 file changed

+138
-0
lines changed

1 file changed

+138
-0
lines changed

reps/2023-01-09-aws-fleet-support.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
## Summary - AWS Fleet support for ray autoscaler
2+
3+
### General Motivation
4+
5+
Today, AWS autoscaler requires developers to choose their EC2 instance types upfront and codify it in their [autoscaler config](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml#L73-L80). EC2 offers a lot of instances types across different families with varied pricing. Hence, developers can realize cost-savings if their workload does not have a strict dependency on a specific family of instances. [EC2 Fleet](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet.html) is an AWS offering that allows developers to launch a group of instances depending on various parameters like, maximum amount per hour they are willing to pay, supplement their primary on-demand capacity with spot capacity, specify maximum spot prices for each instance, choose from multiple allocation strategies for on-demand and spot capacity etc. This proposal outlines the work needed to support EC2 Fleet on ray autoscaler.
6+
7+
#### Key requirements
8+
9+
- Autoscaler must offer existing functionalities and hence changes must be backward compatible.
10+
- Developers must be able to make use of most of the critical features AWS EC2 Fleet offers.
11+
- [Ray autoscaling monitor](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/monitor.py), [CLI](https://github.com/ray-project/ray/blob/7b4b88b4082297d3790b9e542090228970708270/python/ray/autoscaler/_private/commands.py#L692), [standard autoscaler](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/autoscaler.py), and placement groups must be able to provision nodes via EC2 fleet.
12+
- EC2 Fleet must not interfere with the autoscaler activities.
13+
- EC2 Fleet must be supported for both head and worker node types.
14+
15+
### Should this change be within `ray` or outside?
16+
17+
Yes. Specifically within AWS autoscaler.
18+
19+
## Stewardship
20+
21+
### Required Reviewers
22+
23+
- TBD
24+
25+
### Shepherd of the Proposal (should be a senior committer)
26+
27+
- TBD
28+
29+
## Design and Architecture
30+
31+
### Proposed Architecture
32+
33+
![Flex Fleet REP](https://user-images.githubusercontent.com/8843998/211219167-eb18a917-c17b-46df-94ad-83aedd8c878b.png)
34+
35+
As described above, we will be adapting **node provider** to support instance creation via [create_fleet](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.create_fleet) API of EC2. The existing components that invoke **node provider** are only shown for clarity.
36+
37+
### How EC2 fleet works?
38+
39+
An EC2 Fleet contains configuration to launch a group of instances. There are [three types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet-request-type.html) of fleet requests, namely, `request`, `maintain` and `instant`. Because of the constraints mentioned in the **requirements** section (i.e., EC2 must not interfere with autoscaler activities), we would only be able to make use of `instant` type. Unlike, `maintain` and `request`, `instant` type `create_fleet` creates a fleet entity synchronously in an AWS account which is simply a configuration. Below are couple of observations related to `instant` type EC2 fleets:
40+
41+
- `instant` type fleets cannot be deleted without terminating instances it created.
42+
- directly terminating instances created by `instant` fleet using `terminate_instances` API does not affect the fleet state.
43+
- spot interruptions do not replace an instance created by `instant` fleet.
44+
- there is no limit to number of EC2 `instant` type fleets one can have in an account.
45+
46+
### How will EC2 fleets be created?
47+
48+
An EC2 fleet configuration will be encapsulated within a single node type from the autoscaler perspective. We aim to create a new EC2 `instant` fleet as part of `create_node` method that is pretty much the entry point for any scale-up request from any other component (per diagram above). Ray's autoscaler will make sure `max_workers` config is honoured as well as perform any bin-packing of the resources across node types. EC2 automatically tags the instances created this way with a tag named `aws:ec2:fleet-id`.
49+
50+
### How will the demand be converted into resources?
51+
52+
Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregated them based on the latest family or high spec instance. Hence, autoscaler will end up spinning less nodes than necessary which avoid overscaling problem. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue
53+
54+
### How will spot interruptions be handled?
55+
56+
Developers often choose to use spot instances if their application can tolerate faults while they could optimize for costs. When there is a spot interruption, GCS will fail to receive heartbeat from that node essentially marking that node as dead which could trigger a scale up signal from autoscaler. Fortunately, there is [no quota](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/fleet-quotas.html) for number of fleets with `instant` type which allows us to scale up unlimited number of times each with different fleet id.
57+
58+
### How will EC2 fleets clean up happen?
59+
60+
There will be no change in how the instances are terminated today. However, terminating all the instances in a fleet does not necessarily delete the fleet. The fleet entity continues to be in active state. Hence, a `post_process` step will be introduced after each autoscaler update which cleans up any unused active fleets. We cannot describe fleets of `instant` type via EC2 APIs, hence it must be stored in either GCS kv store or any in-memory storage accessible to autoscaler.
61+
62+
### How will caching stopped nodes work?
63+
64+
When stopped nodes are cached, they are reused next time the cluster is setup. There is no change in this behavior w.r.t EC2 fleets. Any changes in instance states within a fleet does not affect fleet state based on the POC. Hence, they have to be explicitly deleted as part of `post_process`.
65+
66+
### How will customer experience looks like?
67+
68+
The autoscaler config or the customer experience with respect to provisioning workers using EC2 `create_instances` API will not change. An optional config key `node_launch_method` (note the naming convention) will be introduced within the `node_config` that determines whether to use `create_fleet` API or `create_instances` API while provisioning EC2 instances. However, default value for this config key will be `create_instances`. Any other arguments within the `node_config` will be passed to the appropriate API as is. However, the request type will be overridden to `instant` as Ray autoscaler cannot allow EC2 fleets to interfere with the scaling.
69+
70+
```yaml
71+
....
72+
ray.worker.default:
73+
max_workers: 100
74+
node_config:
75+
node_launch_method: create_fleet
76+
OnDemandOptions:
77+
AllocationStrategy: lowest-price
78+
LaunchTemplateConfigs:
79+
- LaunchTemplateSpecification:
80+
LaunchTemplateName: RayClusterLaunchTemplate
81+
Version: $Latest
82+
Overrides:
83+
- InstanceType: r3.8xlarge
84+
ImageId: ami-04af5926cc5ad5248
85+
- InstanceType: r4.8xlarge
86+
ImageId: ami-04af5926cc5ad5248
87+
- InstanceType: r5.8xlarge
88+
ImageId: ami-04af5926cc5ad5248
89+
TargetCapacitySpecification:
90+
DefaultTargetCapacityType: 'on-demand'
91+
....
92+
```
93+
94+
### What are the known limitations?
95+
96+
- The fleet request types `maintain` and `request` are not supported. Also, a hard limit on number of fleets of type `maintain` or `request` may block us from supporting that in future as well.
97+
- EC2 fleet cannot span different subnets from same availability zone. This is more of a limitation of EC2 fleets which could impact availability.
98+
- Autoscaler can scale up multiple times less than `upscaling_speed` when the `InstanceType` overrides are heterogeneous from instance size point of view. Hence, developers must be cautioned to follow best practices through proper documentation.
99+
100+
### Proposed code changes
101+
102+
- The default version of `boto3` that is installed must be upgraded to latest i.e., `1.26.41` as `ImageId` override param may be missing in earlier version.
103+
- Update [config.py](https://github.com/ray-project/ray/blob/e464bf07af9f6513bf71156d1226885dde7a8f46/python/ray/autoscaler/_private/aws/config.py) to parse and update configuration related to fleets.
104+
- Update the [node_provider.py](https://github.com/ray-project/ray/blame/00d43d39f58f2de7bb7cd963450f7a763b928d10/python/ray/autoscaler/_private/aws/node_provider.py#L250) to create instances using EC2 fleet API like [this](https://github.com/ray-project/ray/commit/81942b4f8c8e9d9c6a037d068e559769e8a27a70).
105+
- EC2 does not delete a fleet when all of its instances are terminated. Hence, implement [post_process](https://github.com/ray-project/ray/blob/c51b0c9a5664e5c6df3d92f9093b56e61b48f514/python/ray/autoscaler/node_provider.py#L258) method for aws node provider to clean up any active fleets which has only terminated instances.
106+
- Add an example [autoscaler config](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws) documentation to help developers in utilizing the EC2 fleet functionality.
107+
- Update ray test suite to cover integration and unit tests.
108+
109+
## Compatibility, Deprecation, and Migration Plan
110+
111+
The changes are supposed to be backward compatible.
112+
113+
## Test Plan and Acceptance Criteria
114+
115+
### Test Plan
116+
117+
- Setup ray head using EC2 fleets via CLI
118+
119+
- Setup ray worker nodes using EC2 fleets via CLI
120+
121+
- Setup ray cluster using the EC2 fleet and run varieties of ray applications with:
122+
123+
- task requesting n cpus
124+
- task requesting n gpus
125+
- actors requesting n cpus
126+
- actors requesting n gpus
127+
- tasks and actors requesting combination of cpu/gpu and custom resources.
128+
129+
- Setup cluster with below outlined autoscaler config variations:
130+
131+
- fleet request contains InstanceRequirements overrides.
132+
- fleet request contains InstanceType overrides.
133+
134+
### Acceptance Criteria
135+
136+
- Ray application comply with resource (cpu / gpu / memory / custom) allocation.
137+
- InstanceRequirements, InstanceType parameters work as expected.
138+
- Less than 30% performance overhead.

0 commit comments

Comments
 (0)