Skip to content

Commit 1ffcd8c

Browse files
authored
Merge pull request kubernetes#2632 from wojtek-t/pf_list_watch
Update P&F KEP with support LIST requests
2 parents 2000e28 + 79df637 commit 1ffcd8c

File tree

1 file changed

+188
-2
lines changed
  • keps/sig-api-machinery/1040-priority-and-fairness

1 file changed

+188
-2
lines changed

keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 188 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@
2626
- [From one to many](#from-one-to-many)
2727
- [From packets to requests](#from-packets-to-requests)
2828
- [Not knowing service duration up front](#not-knowing-service-duration-up-front)
29+
- [Support for LIST requests](#support-for-list-requests)
30+
- [Width of the request](#width-of-the-request)
31+
- [Determining the width](#determining-the-width)
32+
- [Dispatching the request](#dispatching-the-request)
2933
- [Example Configuration](#example-configuration)
3034
- [Reaction to Configuration Changes](#reaction-to-configuration-changes)
3135
- [Default Behavior](#default-behavior)
@@ -627,7 +631,6 @@ The Fair Queuing for Server Requests algorithm below is used to pick a
627631
non-empty queue at that priority level. Then the request at the head
628632
of that queue is dispatched.
629633

630-
631634
#### Fair Queuing for Server Requests
632635

633636
This is based on fair queuing but is modified to deal with serving
@@ -1066,8 +1069,191 @@ the remaining requests in that queue start getting faster service. In
10661069
both cases, the service delivery in the virtual world has reacted
10671070
properly to the true service duration.
10681071

1069-
### Example Configuration
1072+
### Support for LIST requests
1073+
1074+
Up until now, we were assuming that even though the requests aren't
1075+
necessarily equally expensive, their actual cost is actually greatly
1076+
reflected by the time it took to process them. But while being processed
1077+
each of them is consuming the equal amount of resources.
1078+
1079+
It works well for requests that are touching only a single object.
1080+
However, given the fact that in practise the concurrency limits has to be
1081+
set much higher than number of available cores to achieve reasonable system
1082+
throughput, this no longer works that well for LIST requests that are orders
1083+
of magnitude more expensive. There are two aspects of that:
1084+
- for CPU the hand-wavy way of rationalizing it is that he ratio of time
1085+
the request is processed by the processor to the actual time of processing
1086+
the request starts to visibly differ (e.g. due to I/O waiting time -
1087+
there is communication with etcd in between for example).
1088+
- for memory the reasoning is more obvious as we simply keep all elements
1089+
that we process in memory
1090+
1091+
As a result, kube-apiserver (and etcd) may be able to easily keep with N
1092+
simple in-flight requests (e.g. create or get a single Pod), but will explode
1093+
trying to process N requests listing all the pods in the system at the same
1094+
time.
1095+
1096+
#### Width of the request
1097+
1098+
In order to address this problem, we are introducing the concept of `width`
1099+
of the request. Instead of saying that every request is consuming a single
1100+
unit of concurrency, we allow for a request to consume `<width>` units of
1101+
concurrency while being processed.
1102+
1103+
This basically means, that the cost of processing a given request is no
1104+
longer reflected by its `<processing latency>` and instead its cost is now
1105+
equal to `<width> x <processing latency>`. The rationale behind it is that
1106+
the request is now consuming `<width>` concurrency units for the duration
1107+
of its processing.
1108+
1109+
While in theory the `width` can be an arbitrary non-integer number, for
1110+
practical reasons, we will assume it actually is an integer. Given that
1111+
our estimations here are very rough anyway that seems a reasonable
1112+
simplification that makes dispatching the budget a bit simpler.
1113+
1114+
#### Determining the width
1115+
1116+
While one can imagine arbitrarily sophisticated algorithms for it (including
1117+
exposing defining the width of requests via FlowSchema API), we want to start
1118+
with something relatively simple to first get operational experience with it
1119+
before investing into sophisticated algorithms or exposing a knob to users.
1120+
1121+
In order to determine the function that will be approximating the `width` of
1122+
a request, we should first estimate how expensive a particular request is.
1123+
And we need to think about both dimensions that we're trying to protect from
1124+
overloading (CPU and RAM) and how many concurrency units a request can actually
1125+
consume.
1126+
1127+
Let's start with CPU. The total cost of processing a LIST request should be
1128+
proportional to the number of processed objects. However, given that in
1129+
practice processing a single request isn't parallelized (and the fact that
1130+
we generally scale the number of total concurrency units linearly with amount
1131+
of available resources), a single request should consume no more than A
1132+
concurrency units. Fortunately that all compiles together because the
1133+
`processing latency` of the LIST request is actually proportional to the
1134+
number of processed objects, so the cost of the request (defined above as
1135+
`<width> x <processing latency>` really is proportaional to the number of
1136+
processed objects as expected.
1137+
1138+
For RAM the situation is actually different. In order to process a LIST
1139+
request we actually store all objects that we process in memory. Given that
1140+
memory is uncompressable resource, we effectively need to reserve all that
1141+
memory for the whole time of processing that request. That suggests that
1142+
the `width` for the request from the RAM perspective should be proportional
1143+
to the number of processed items.
1144+
1145+
So what we get is that:
1146+
```
1147+
width_cpu(N) = min(A, B * N)
1148+
width_ram(N) = D * N
1149+
```
1150+
where N is the number of items a given LIST request is processing.
1151+
1152+
The question is how to combine them to a single number. While the main goal
1153+
is to stay on the safe side and protect from the overload, we also want to
1154+
maxiumize the utilization of the available concurrency units.
1155+
Fortunately, when we normalize CPU and RAM to percentage of available capacity,
1156+
it appears that almost all requests are much more cpu-intensive. Assuming
1157+
4GB:1CPU ratio and 10kB average object and the fact that processing larger
1158+
number of objects can utilize exactly 1 core, that means that we need to
1159+
process 400.000 objects to make the memory cost higher.
1160+
This means, that we can afford the potential minor efficiency that extremely
1161+
large requests would cause and just approximate it by protecting every resource
1162+
independently, which translates to the following function:
1163+
```
1164+
width(n) = max(min(A, B * N), D * N)
1165+
```
1166+
We're going to better tune the function based on experiments, but based on the
1167+
above back-of-envelope calculations showing that memory should almost never be
1168+
a limiting factor we will apprximate the width simply with:
1169+
```
1170+
width_approx(n) = min(A, ceil(N / E)), where E = 1 / B
1171+
```
1172+
Fortunately that logic will be well separated and purely in-memory so we
1173+
can decide to arbitrarily adjust it in future releases.
1174+
1175+
Given that the estimation is well separated piece of logic, we can decide
1176+
to replace with much more sophisticated logic later (e.g. whether it is
1177+
served from etcd or from cache, whether it is namespaced or not, etc.).
1178+
1179+
One more important aspect to resolve is what happens if a given priority
1180+
level doesn't have enough concurrency units assigned to it. To be on the
1181+
safe side we should probably implement borrowing across priority levels.
1182+
However, given we don't want to block introducing the `width` concept on
1183+
design and implementation of borrowing, until this is done we have two
1184+
main options:
1185+
- cap the `width` at the concurrency units assigned to the priority level
1186+
- reject requests for which we won't be able to allocate enough concurrency
1187+
units
1188+
1189+
To avoid breaking some users, we will proceed with the first option (when
1190+
computing the cap we should also report requests that we believe are too
1191+
wide for a given priority level - it would allow operators to adjust configs).
1192+
That said, to accommodate for the inaccuracy here we will introduce a concept
1193+
of `additional latency` for a request. This basically means that after the
1194+
request finishes in a real world, we still don't mark it as finished in
1195+
the virtual world for `additional latency`.
1196+
Adjusting virtual time of a queue to do that is trivial. The other thing
1197+
to tweak is to ensure that the concurrency units will not get available
1198+
for other requests for that time (because currently all actions are
1199+
triggerred by starting or finishing some request). We will maintain that
1200+
possibility by wrapping the handler into another one that will be sleeping
1201+
for `additional latence` after the request is processed.
1202+
1203+
Note that given the estimation for duration of processing the requests is
1204+
automatically corrected (both up and down), there is no need to change that
1205+
in the initial version.
1206+
1207+
#### Dispatching the request
1208+
1209+
The hardest part of adding support for LIST requests is dispatching the
1210+
requests. Now in order to start processing a request, it has to accumulate
1211+
`<width>` units of concurrency.
1212+
1213+
The important requirement to recast now is fairness. As soon a single
1214+
request can consume more units of concurrency, the fairness is
1215+
no longer about the number of requests from a given queue, but rather
1216+
about number of consumed concurrency units. This justifes the above
1217+
definition of adjusting the cost of the request to now be equal to
1218+
`<width> x <processing latency>` (instead of just `<processing latency>`).
1219+
1220+
At the same time, we want to maximally utilize the available capacity.
1221+
In other words, we want to minimize the time when some concurrency unit
1222+
is not used, but there are requests at a given PL that could use it.
1223+
1224+
In order to achieve the above goals, we are introducing the following
1225+
modification to the current dispatching algorithm:
1226+
- as soon as we choose the request to dispatch (i.e. the queue from which
1227+
the first request should be dispatched), we start accumulating concurrency
1228+
units until we accumulate `<width>` and only then dispatch the request.
1229+
In other words, if the chosen request has width `<width>` and there
1230+
are less then `<width>` available seats, we don't dispatch any other request
1231+
(at a given priority level) until we will have `<width>` available seats
1232+
at which point we dispatch this request.
1233+
Such approach (as opposed to dispatching individual concurrency units
1234+
independently one-by-one) allows us to not waste too many seats and avoid
1235+
deadlocks if we would be dispatching seats to multiple LIST requests
1236+
without having enough of them for a given priority level.
1237+
- however, to ensure fairness (especially over longer period of times)
1238+
we need to change how virtual time is advanced too. We will change the
1239+
semantics of virtual time tracked by the queues to correspond to work,
1240+
instead of just wall time. That means when we estimate a request's
1241+
virtual duration, we will use `estimated width x estimated latency` instead
1242+
of just estimated latecy. And when a request finishes, we will update
1243+
the virtual time for it with `seats x actual latency` (note that seats
1244+
will always equal the estimated width, since we have no way to figure out
1245+
if a request used less concurrency than we granted it).
1246+
1247+
However, now the queueing mechanism also requires adjustment. So far,
1248+
when putting a request into the queue, we were choosing the shortest queue.
1249+
It worked, because it well proxied the total cost of processing all requests
1250+
in that queue.
1251+
After the above changes, the size of the queue is no longer correctly
1252+
approximating the cost of processing the queued items. Given that the
1253+
total cost of processing a request is now `<width> x <processing latency>`,
1254+
the weight of the queue should now reflect that.
10701255

1256+
### Example Configuration
10711257

10721258
For requests from admins and requests in service of other, potentially
10731259
system, requests.

0 commit comments

Comments
 (0)