|
26 | 26 | - [From one to many](#from-one-to-many)
|
27 | 27 | - [From packets to requests](#from-packets-to-requests)
|
28 | 28 | - [Not knowing service duration up front](#not-knowing-service-duration-up-front)
|
| 29 | + - [Support for LIST requests](#support-for-list-requests) |
| 30 | + - [Width of the request](#width-of-the-request) |
| 31 | + - [Determining the width](#determining-the-width) |
| 32 | + - [Dispatching the request](#dispatching-the-request) |
29 | 33 | - [Example Configuration](#example-configuration)
|
30 | 34 | - [Reaction to Configuration Changes](#reaction-to-configuration-changes)
|
31 | 35 | - [Default Behavior](#default-behavior)
|
@@ -627,7 +631,6 @@ The Fair Queuing for Server Requests algorithm below is used to pick a
|
627 | 631 | non-empty queue at that priority level. Then the request at the head
|
628 | 632 | of that queue is dispatched.
|
629 | 633 |
|
630 |
| - |
631 | 634 | #### Fair Queuing for Server Requests
|
632 | 635 |
|
633 | 636 | This is based on fair queuing but is modified to deal with serving
|
@@ -1066,8 +1069,191 @@ the remaining requests in that queue start getting faster service. In
|
1066 | 1069 | both cases, the service delivery in the virtual world has reacted
|
1067 | 1070 | properly to the true service duration.
|
1068 | 1071 |
|
1069 |
| -### Example Configuration |
| 1072 | +### Support for LIST requests |
| 1073 | + |
| 1074 | +Up until now, we were assuming that even though the requests aren't |
| 1075 | +necessarily equally expensive, their actual cost is actually greatly |
| 1076 | +reflected by the time it took to process them. But while being processed |
| 1077 | +each of them is consuming the equal amount of resources. |
| 1078 | + |
| 1079 | +It works well for requests that are touching only a single object. |
| 1080 | +However, given the fact that in practise the concurrency limits has to be |
| 1081 | +set much higher than number of available cores to achieve reasonable system |
| 1082 | +throughput, this no longer works that well for LIST requests that are orders |
| 1083 | +of magnitude more expensive. There are two aspects of that: |
| 1084 | +- for CPU the hand-wavy way of rationalizing it is that he ratio of time |
| 1085 | + the request is processed by the processor to the actual time of processing |
| 1086 | + the request starts to visibly differ (e.g. due to I/O waiting time - |
| 1087 | + there is communication with etcd in between for example). |
| 1088 | +- for memory the reasoning is more obvious as we simply keep all elements |
| 1089 | + that we process in memory |
| 1090 | + |
| 1091 | +As a result, kube-apiserver (and etcd) may be able to easily keep with N |
| 1092 | +simple in-flight requests (e.g. create or get a single Pod), but will explode |
| 1093 | +trying to process N requests listing all the pods in the system at the same |
| 1094 | +time. |
| 1095 | + |
| 1096 | +#### Width of the request |
| 1097 | + |
| 1098 | +In order to address this problem, we are introducing the concept of `width` |
| 1099 | +of the request. Instead of saying that every request is consuming a single |
| 1100 | +unit of concurrency, we allow for a request to consume `<width>` units of |
| 1101 | +concurrency while being processed. |
| 1102 | + |
| 1103 | +This basically means, that the cost of processing a given request is no |
| 1104 | +longer reflected by its `<processing latency>` and instead its cost is now |
| 1105 | +equal to `<width> x <processing latency>`. The rationale behind it is that |
| 1106 | +the request is now consuming `<width>` concurrency units for the duration |
| 1107 | +of its processing. |
| 1108 | + |
| 1109 | +While in theory the `width` can be an arbitrary non-integer number, for |
| 1110 | +practical reasons, we will assume it actually is an integer. Given that |
| 1111 | +our estimations here are very rough anyway that seems a reasonable |
| 1112 | +simplification that makes dispatching the budget a bit simpler. |
| 1113 | + |
| 1114 | +#### Determining the width |
| 1115 | + |
| 1116 | +While one can imagine arbitrarily sophisticated algorithms for it (including |
| 1117 | +exposing defining the width of requests via FlowSchema API), we want to start |
| 1118 | +with something relatively simple to first get operational experience with it |
| 1119 | +before investing into sophisticated algorithms or exposing a knob to users. |
| 1120 | + |
| 1121 | +In order to determine the function that will be approximating the `width` of |
| 1122 | +a request, we should first estimate how expensive a particular request is. |
| 1123 | +And we need to think about both dimensions that we're trying to protect from |
| 1124 | +overloading (CPU and RAM) and how many concurrency units a request can actually |
| 1125 | +consume. |
| 1126 | + |
| 1127 | +Let's start with CPU. The total cost of processing a LIST request should be |
| 1128 | +proportional to the number of processed objects. However, given that in |
| 1129 | +practice processing a single request isn't parallelized (and the fact that |
| 1130 | +we generally scale the number of total concurrency units linearly with amount |
| 1131 | +of available resources), a single request should consume no more than A |
| 1132 | +concurrency units. Fortunately that all compiles together because the |
| 1133 | +`processing latency` of the LIST request is actually proportional to the |
| 1134 | +number of processed objects, so the cost of the request (defined above as |
| 1135 | +`<width> x <processing latency>` really is proportaional to the number of |
| 1136 | +processed objects as expected. |
| 1137 | + |
| 1138 | +For RAM the situation is actually different. In order to process a LIST |
| 1139 | +request we actually store all objects that we process in memory. Given that |
| 1140 | +memory is uncompressable resource, we effectively need to reserve all that |
| 1141 | +memory for the whole time of processing that request. That suggests that |
| 1142 | +the `width` for the request from the RAM perspective should be proportional |
| 1143 | +to the number of processed items. |
| 1144 | + |
| 1145 | +So what we get is that: |
| 1146 | +``` |
| 1147 | + width_cpu(N) = min(A, B * N) |
| 1148 | + width_ram(N) = D * N |
| 1149 | +``` |
| 1150 | +where N is the number of items a given LIST request is processing. |
| 1151 | + |
| 1152 | +The question is how to combine them to a single number. While the main goal |
| 1153 | +is to stay on the safe side and protect from the overload, we also want to |
| 1154 | +maxiumize the utilization of the available concurrency units. |
| 1155 | +Fortunately, when we normalize CPU and RAM to percentage of available capacity, |
| 1156 | +it appears that almost all requests are much more cpu-intensive. Assuming |
| 1157 | +4GB:1CPU ratio and 10kB average object and the fact that processing larger |
| 1158 | +number of objects can utilize exactly 1 core, that means that we need to |
| 1159 | +process 400.000 objects to make the memory cost higher. |
| 1160 | +This means, that we can afford the potential minor efficiency that extremely |
| 1161 | +large requests would cause and just approximate it by protecting every resource |
| 1162 | +independently, which translates to the following function: |
| 1163 | +``` |
| 1164 | + width(n) = max(min(A, B * N), D * N) |
| 1165 | +``` |
| 1166 | +We're going to better tune the function based on experiments, but based on the |
| 1167 | +above back-of-envelope calculations showing that memory should almost never be |
| 1168 | +a limiting factor we will apprximate the width simply with: |
| 1169 | +``` |
| 1170 | +width_approx(n) = min(A, ceil(N / E)), where E = 1 / B |
| 1171 | +``` |
| 1172 | +Fortunately that logic will be well separated and purely in-memory so we |
| 1173 | +can decide to arbitrarily adjust it in future releases. |
| 1174 | + |
| 1175 | +Given that the estimation is well separated piece of logic, we can decide |
| 1176 | +to replace with much more sophisticated logic later (e.g. whether it is |
| 1177 | +served from etcd or from cache, whether it is namespaced or not, etc.). |
| 1178 | + |
| 1179 | +One more important aspect to resolve is what happens if a given priority |
| 1180 | +level doesn't have enough concurrency units assigned to it. To be on the |
| 1181 | +safe side we should probably implement borrowing across priority levels. |
| 1182 | +However, given we don't want to block introducing the `width` concept on |
| 1183 | +design and implementation of borrowing, until this is done we have two |
| 1184 | +main options: |
| 1185 | +- cap the `width` at the concurrency units assigned to the priority level |
| 1186 | +- reject requests for which we won't be able to allocate enough concurrency |
| 1187 | + units |
| 1188 | + |
| 1189 | +To avoid breaking some users, we will proceed with the first option (when |
| 1190 | +computing the cap we should also report requests that we believe are too |
| 1191 | +wide for a given priority level - it would allow operators to adjust configs). |
| 1192 | +That said, to accommodate for the inaccuracy here we will introduce a concept |
| 1193 | +of `additional latency` for a request. This basically means that after the |
| 1194 | +request finishes in a real world, we still don't mark it as finished in |
| 1195 | +the virtual world for `additional latency`. |
| 1196 | +Adjusting virtual time of a queue to do that is trivial. The other thing |
| 1197 | +to tweak is to ensure that the concurrency units will not get available |
| 1198 | +for other requests for that time (because currently all actions are |
| 1199 | +triggerred by starting or finishing some request). We will maintain that |
| 1200 | +possibility by wrapping the handler into another one that will be sleeping |
| 1201 | +for `additional latence` after the request is processed. |
| 1202 | + |
| 1203 | +Note that given the estimation for duration of processing the requests is |
| 1204 | +automatically corrected (both up and down), there is no need to change that |
| 1205 | +in the initial version. |
| 1206 | + |
| 1207 | +#### Dispatching the request |
| 1208 | + |
| 1209 | +The hardest part of adding support for LIST requests is dispatching the |
| 1210 | +requests. Now in order to start processing a request, it has to accumulate |
| 1211 | +`<width>` units of concurrency. |
| 1212 | + |
| 1213 | +The important requirement to recast now is fairness. As soon a single |
| 1214 | +request can consume more units of concurrency, the fairness is |
| 1215 | +no longer about the number of requests from a given queue, but rather |
| 1216 | +about number of consumed concurrency units. This justifes the above |
| 1217 | +definition of adjusting the cost of the request to now be equal to |
| 1218 | +`<width> x <processing latency>` (instead of just `<processing latency>`). |
| 1219 | + |
| 1220 | +At the same time, we want to maximally utilize the available capacity. |
| 1221 | +In other words, we want to minimize the time when some concurrency unit |
| 1222 | +is not used, but there are requests at a given PL that could use it. |
| 1223 | + |
| 1224 | +In order to achieve the above goals, we are introducing the following |
| 1225 | +modification to the current dispatching algorithm: |
| 1226 | +- as soon as we choose the request to dispatch (i.e. the queue from which |
| 1227 | + the first request should be dispatched), we start accumulating concurrency |
| 1228 | + units until we accumulate `<width>` and only then dispatch the request. |
| 1229 | + In other words, if the chosen request has width `<width>` and there |
| 1230 | + are less then `<width>` available seats, we don't dispatch any other request |
| 1231 | + (at a given priority level) until we will have `<width>` available seats |
| 1232 | + at which point we dispatch this request. |
| 1233 | + Such approach (as opposed to dispatching individual concurrency units |
| 1234 | + independently one-by-one) allows us to not waste too many seats and avoid |
| 1235 | + deadlocks if we would be dispatching seats to multiple LIST requests |
| 1236 | + without having enough of them for a given priority level. |
| 1237 | +- however, to ensure fairness (especially over longer period of times) |
| 1238 | + we need to change how virtual time is advanced too. We will change the |
| 1239 | + semantics of virtual time tracked by the queues to correspond to work, |
| 1240 | + instead of just wall time. That means when we estimate a request's |
| 1241 | + virtual duration, we will use `estimated width x estimated latency` instead |
| 1242 | + of just estimated latecy. And when a request finishes, we will update |
| 1243 | + the virtual time for it with `seats x actual latency` (note that seats |
| 1244 | + will always equal the estimated width, since we have no way to figure out |
| 1245 | + if a request used less concurrency than we granted it). |
| 1246 | + |
| 1247 | +However, now the queueing mechanism also requires adjustment. So far, |
| 1248 | +when putting a request into the queue, we were choosing the shortest queue. |
| 1249 | +It worked, because it well proxied the total cost of processing all requests |
| 1250 | +in that queue. |
| 1251 | +After the above changes, the size of the queue is no longer correctly |
| 1252 | +approximating the cost of processing the queued items. Given that the |
| 1253 | +total cost of processing a request is now `<width> x <processing latency>`, |
| 1254 | +the weight of the queue should now reflect that. |
1070 | 1255 |
|
| 1256 | +### Example Configuration |
1071 | 1257 |
|
1072 | 1258 | For requests from admins and requests in service of other, potentially
|
1073 | 1259 | system, requests.
|
|
0 commit comments