|
| 1 | +# Meta |
| 2 | + |
| 3 | +[meta]: #meta |
| 4 | + |
| 5 | +- Name: Implementing a Hash-Based Load Balancing Algorithm for CF Routing |
| 6 | +- Start Date: 2025-04-07 |
| 7 | +- Author(s): b1tamara, Soha-Albaghdady |
| 8 | +- Status: Draft <!-- Acceptable values: Draft, Approved, On Hold, Superseded --> |
| 9 | +- RFC Pull Request: https://github.com/cloudfoundry/community/pull/1222 |
| 10 | + |
| 11 | +## Summary |
| 12 | + |
| 13 | +Cloud Foundry uses round-robin and least-connection algorithms for load balancing between Gorouters and backends. While |
| 14 | +effective in many scenarios, these algorithms may not be ideal for certain use cases. Therefore, this RFC proposes to |
| 15 | +introduce a hash-based routing on a per-route basis. |
| 16 | +The hash-based load balancing algorithm uses the hash of a request header to make routing decisions, focusing on |
| 17 | +distributing users across instances rather than individual requests, thereby improving load balancing in specific |
| 18 | +scenarios. |
| 19 | + |
| 20 | +## Motivation |
| 21 | + |
| 22 | +Cloud Foundry offers two load balancing algorithms to manage request distribution between Gorouters and backends. The |
| 23 | +round-robin algorithm ensures the number of requests is distributed equally across all available backends, and the |
| 24 | +least-connection algorithm tries to keep the number of active requests equal across all backends. A recent enhancement |
| 25 | +allows these load balancing algorithms to be configured on the application route level. |
| 26 | + |
| 27 | +However, these existing algorithms are not ideal for scenarios that require routing based on specific identifiers. |
| 28 | + |
| 29 | +One use case is optimizing resource management of complex in-memory caches. While 12-factor apps are stateless and can |
| 30 | +retrieve necessary information from backing services, it is often useful to cache data and reduce latency. When a cache |
| 31 | +is limited in size (e.g., Least Recently Used), exposing each app instance to all users may lead to thrashing and lower |
| 32 | +cache efficiency. By "pinning" users to a particular instance, the cache can remain effective. In the event of an |
| 33 | +instance exchange (up or downscaling, evacuation, rolling update), another instance can still provide a response and |
| 34 | +fill its cache without interruption for the user. For most users, subsequent requests can be processed at lower latency |
| 35 | +by utilizing a warm and effective cache. |
| 36 | + |
| 37 | +Another use case: users from different tenants send requests to application instances that establish connections to |
| 38 | +tenant-specific databases. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +With the current load balancing algorithms, each tenant eventually creates a connection to |
| 43 | +each application instance, which then creates connection pools to every customer database. As a result, all tenants |
| 44 | +might span up a full mesh, leading to too many open connections to the customer databases, impacting performance. This |
| 45 | +limitation highlights a gap in achieving efficient load distribution, particularly when dealing with limited or |
| 46 | +memory-intensive resources in backend services, and can be addressed through hash-based routing. In short, hash-based |
| 47 | +routing is an algorithm that facilitates the distribution of requests to application instances by using a stable hash |
| 48 | +derived from request identifiers, such as headers. |
| 49 | + |
| 50 | +## Proposal |
| 51 | + |
| 52 | +We propose introducing hash-based routing as a load balancing algorithm for use on a per-route basis to address the |
| 53 | +issues described in the earlier use cases. |
| 54 | + |
| 55 | +The approach leverages an HTTP header, which is associated with each incoming request and contains the specific |
| 56 | +identifier. This one is used to compute a hash value, which will serve as the basis for routing decisions. |
| 57 | + |
| 58 | +In the previously mentioned use cases, the specific identifier included in the header can serve as the basis for hash |
| 59 | +calculation. This hash value determines the appropriate application instance for each request, ensuring |
| 60 | +that all requests with this identifier are consistently routed to the same instance or might be routed to another |
| 61 | +instance when the instance is saturated. Consequently, the load balancing algorithm effectively directs requests for a |
| 62 | +single tenant to a particular application instance, so that instance can minimize database connection overhead and |
| 63 | +optimize connection pooling, enhancing efficiency and system performance. |
| 64 | + |
| 65 | +### Requirements |
| 66 | + |
| 67 | +#### Only Application Per-Route Load Balancing |
| 68 | + |
| 69 | +Hash-based load balancing solves a particular load pattern, rather than serving as a general-purpose load balancing |
| 70 | +algorithm. Consequently, it will be configured exclusively as a per-route option for applications and will not be |
| 71 | +offered as a global setting. |
| 72 | + |
| 73 | +#### Minimal rehashing over all Gorouter VMs |
| 74 | + |
| 75 | +Rehashing should be minimized, especially when the number of application instances changes over time. |
| 76 | + |
| 77 | +For the scenario when a new application instance (e.g. app_instance3) is added, Gorouter updates the mapping so that it |
| 78 | +maps part of the hashes to the new instance. |
| 79 | + |
| 80 | +| Hash | Application instance(s) before | Application instance(s) after a new instance added | |
| 81 | +|-------|--------------------------------|----------------------------------------------------| |
| 82 | +| Hash1 | app_instance1 | app_instance1 | |
| 83 | +| Hash2 | app_instance1 | app_instance3 | |
| 84 | +| Hash3 | app_instance2 | app_instance2 | |
| 85 | +| ... | ... | ... | |
| 86 | +| HashN | app_instance2 | app_instance3 | |
| 87 | + |
| 88 | +For the scenario when the application is scaled down, Gorouter updates the mapping immediately after routes update, so |
| 89 | +that it remaps hashes associated with the app_instance3: |
| 90 | + |
| 91 | +| Hash | Application instance(s) before | Application instance(s) after the app_instance_3 removed | |
| 92 | +|-------|--------------------------------|----------------------------------------------------------| |
| 93 | +| Hash1 | app_instance1 | app_instance1 | |
| 94 | +| Hash2 | app_instance3 | app_instance1 | |
| 95 | +| Hash3 | app_instance2 | app_instance2 | |
| 96 | +| ... | ... | ... | |
| 97 | +| HashN | app_instance3 | app_instance2 | |
| 98 | + |
| 99 | + |
| 100 | +#### Considering a balance factor |
| 101 | + |
| 102 | +Before routing a request, the current load on each application instance must be evaluated using a balance factor. This |
| 103 | +load is measured by the number of in-flight requests. For example, with a balance factor of 1.5, no application instance |
| 104 | +should exceed 150% of the average number of in-flight requests across all application instances. Consequently, requests |
| 105 | +must be distributed to different application instances that are not overloaded. |
| 106 | + |
| 107 | +Example: |
| 108 | + |
| 109 | +| Application instance | Current request count | Current request count / Average number of in-flight requests | |
| 110 | +|----------------------|-----------------------|--------------------------------------------------------------| |
| 111 | +| app_instance1 | 10 | 20% | |
| 112 | +| app_instance2 | 50 | 100% | |
| 113 | +| app_instance3 | 90 | 180% | |
| 114 | + |
| 115 | +Based on the average number of 50 requests, the current request count to app_instance3 exceeds the balance factor. As a |
| 116 | +result, new requests to app_instance3 must be distributed to different application instances. |
| 117 | + |
| 118 | +#### Deterministic handling of overflow traffic to the next application instance |
| 119 | + |
| 120 | +The application instance is considered overloaded when the current request load of this application exceeds the balance |
| 121 | +factor. Overflow traffic should always be directed to the same next instance rather than to a random one. |
| 122 | + |
| 123 | +A possible presentation of deterministic handling can be a ring like: |
| 124 | + |
| 125 | + |
| 126 | + |
| 127 | +### Required Changes |
| 128 | + |
| 129 | +#### Gorouter |
| 130 | + |
| 131 | +- Gorouter MUST be extended to take a specific identifier via the request header |
| 132 | +- Gorouter MUST implement hash calculation, based on the provided header |
| 133 | +- Gorouter MAY store the mapping between computed hash values and application instances locally to avoid |
| 134 | + expensive recalculations for each incoming request |
| 135 | +- Gorouters SHOULD NOT implement a distributed shared cache |
| 136 | +- Gorouter MUST assess the current number of in-flight requests across all application instances mapped to a |
| 137 | + particular route to consider overload situations |
| 138 | +- Gorouter MAY update its local hash table following the registration or deregistration of an endpoint, ensuring |
| 139 | + minimal rehashing |
| 140 | +- Gorouter SHOULD NOT incur any performance hit when 0 apps use hash routing. |
| 141 | + |
| 142 | +For a detailed understanding of the workflows on Gorouter's side, please refer to the [activity diagrams](#diagrams). |
| 143 | + |
| 144 | +#### Cloud Controller |
| 145 | + |
| 146 | +- The `loadbalancing` property of |
| 147 | + the [route object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object) MUST be |
| 148 | + updated to include `hash` as an acceptable value |
| 149 | +- The [route options object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object) |
| 150 | + MUST include two new properties, `hash_header` and `hash_balance`, to configure a request header as the hashing key |
| 151 | + and the balance factor |
| 152 | +- It MUST implement the validation of the following requirements: |
| 153 | + - The `hash_header` property is mandatory when load balancing is set to hash |
| 154 | + - The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` or setting |
| 155 | + it explicitly to 0 means the load situation will not be considered |
| 156 | + - To account for overload situations, `hash_balance` values should be greater than 1.1. During the implementation |
| 157 | + phase, the values will be evaluated to identify the best fit for the recommended range |
| 158 | + - For load balancing algorithms other than hash, the `hash_balance` and `hash_header` properties MUST not be set |
| 159 | + |
| 160 | +An example for manifest with these properties: |
| 161 | + |
| 162 | +```yaml |
| 163 | +version: 1 |
| 164 | +applications: |
| 165 | + - name: test |
| 166 | + routes: |
| 167 | + - route: test.example.com |
| 168 | + options: |
| 169 | + loadbalancing: hash |
| 170 | + hash_header: tenant-id |
| 171 | + hash_balance: 1.25 |
| 172 | + - route: anothertest.example.com |
| 173 | + options: |
| 174 | + loadbalancing: least-connection |
| 175 | +``` |
| 176 | +
|
| 177 | +The decision to introduce plain keys was influenced by the following points: |
| 178 | +
|
| 179 | +- Simple to use |
| 180 | +- It allows for easy addition of more load-balancing-related properties if new requirements arise in the future |
| 181 | +- It complies with |
| 182 | + the [RFC #0027 that introduced per-route options](https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0027-generic-per-route-features.md#proposal), |
| 183 | + which states that the map must use strings as keys and can use numbers, strings, and the literals true and false as |
| 184 | + values |
| 185 | +
|
| 186 | +### Components Where No Changes Are Required |
| 187 | +
|
| 188 | +#### CF CLI |
| 189 | +
|
| 190 | +The [current implementation of route option in the CF CLI](https://github.com/cloudfoundry/cli/blob/main/resources/options_resource.go) |
| 191 | +supports the use of `--option KEY=VALUE`, where the key and value are sent directly to CC for validation. Consequently, |
| 192 | +the `create-route`, `update-route`, and `map-route` commands require no modifications, as they already accept the |
| 193 | +proposed properties. |
| 194 | +Example: |
| 195 | + |
| 196 | +```bash |
| 197 | +cf create-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=1.25 |
| 198 | +cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=1.25 |
| 199 | +cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id |
| 200 | +cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_balance=1.25 |
| 201 | +cf map-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=1.25 |
| 202 | +``` |
| 203 | + |
| 204 | +#### Route-Emitter |
| 205 | + |
| 206 | +The options are raw JSON and will be passed directly to the Gorouter without any modifications. |
| 207 | + |
| 208 | +#### Route-Registrar |
| 209 | + |
| 210 | +In the scope of this RFC, it is not planned to implement hash-based routing in route-registrar for platform-routes. |
| 211 | + |
| 212 | +### Diagrams |
| 213 | + |
| 214 | +#### An activity diagram for routing decision for an incoming request |
| 215 | + |
| 216 | + |
| 217 | + |
| 218 | +#### A simplified activity diagram for Gorouter's endpoint registration process |
| 219 | + |
| 220 | + |
0 commit comments