Skip to content

Commit 8b8aa9c

Browse files
authored
Merge pull request kubernetes#2669 from wojtek-t/pf_watch_1
Update P&F KEP with support WATCH requests
2 parents af83259 + 08996fc commit 8b8aa9c

File tree

1 file changed

+182
-0
lines changed
  • keps/sig-api-machinery/1040-priority-and-fairness

1 file changed

+182
-0
lines changed

keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,13 @@
3030
- [Width of the request](#width-of-the-request)
3131
- [Determining the width](#determining-the-width)
3232
- [Dispatching the request](#dispatching-the-request)
33+
- [Support for WATCH requests](#support-for-watch-requests)
34+
- [Watch initialization](#watch-initialization)
35+
- [Keeping the watch up-to-date](#keeping-the-watch-up-to-date)
36+
- [Estimating cost of the request](#estimating-cost-of-the-request)
37+
- [Multiple apiservers](#multiple-apiservers)
38+
- [Cost of the watch event](#cost-of-the-watch-event)
39+
- [Dispatching the request](#dispatching-the-request-1)
3340
- [Example Configuration](#example-configuration)
3441
- [Reaction to Configuration Changes](#reaction-to-configuration-changes)
3542
- [Default Behavior](#default-behavior)
@@ -1253,6 +1260,181 @@ approximating the cost of processing the queued items. Given that the
12531260
total cost of processing a request is now `<width> x <processing latency>`,
12541261
the weight of the queue should now reflect that.
12551262

1263+
### Support for WATCH requests
1264+
1265+
The next thing to consider is support for long-running requests. However,
1266+
solving it in a generic case is hard, because we don't have any way to
1267+
predict how expensive those requests will be. Moreover, for requests like
1268+
port forwarding this is completely outside our control (being application
1269+
specific). As a result, as the first step we're going to limit our focus
1270+
to just WATCH requests.
1271+
1272+
However, even for WATCH requests, there are effectively two separate problems
1273+
that has to be considered and addressed.
1274+
1275+
#### Watch initialization
1276+
1277+
While this is an important piece to address, to allow incremental progress
1278+
we are leaving addressing this problem for the future release.
1279+
1280+
Note for the future when we get to this:
1281+
1282+
The most compatible approach would be to simply include WATCH requests as
1283+
ones that are processed by APF dispatcher, but allow them to send an artificial
1284+
`request finished` signal to the dispatcher once their initialization is done.
1285+
However, that requires more detailed design.
1286+
1287+
#### Keeping the watch up-to-date
1288+
1289+
Once the watch is initialized, we have to keep it up-to-date with incoming
1290+
changes. That basically means that whenever some object that this particular
1291+
watch is interested in is added/updated/deleted, and appropriate event has
1292+
to be sent.
1293+
1294+
Similarly as in generic case of long-running requests, predicting this cost
1295+
up front is impossible. Fortunately, in this case we have a full control over
1296+
those events, because those are effectively a result of the mutating requests
1297+
that out mechanism is explicitly admitting.
1298+
So instead of associating the cost of sending watch events to the corresponding
1299+
WATCH request, we will reverse the situation and associate the cost of sending
1300+
watch events to the mutating request that triggering them.
1301+
1302+
In other words, we will be throttling the write requests to ensure that
1303+
apiserver would be able to keep up with the watch traffic instead of throttling
1304+
the watchers themselves. The main reason for that is that throttling watchers
1305+
themselves isn't really effective: we either need to send all objects to them
1306+
anyway or in case we close them, they will try to resume from last received
1307+
event anyway. Which means we don't get anything on throttling them.
1308+
1309+
##### Estimating cost of the request
1310+
1311+
Let's start with an assumption that sending every watch event is equally
1312+
expensive. We will discuss how to generalize it below.
1313+
1314+
With the above assumption, a cost of a mutating request associated with
1315+
sending watch events triggerred by it is proportional to the number of
1316+
watchers that has to process that event. So let's describe how we can
1317+
estimate this number.
1318+
1319+
Obviously, we can't afford going over all watches to compute that - we need
1320+
to keep this information already precomputed. What if we would simply
1321+
store an in-memory map from (resource type, namespace, name) tuple into
1322+
number of opened watches that are interested in a given event. The size of
1323+
that map won't be larger then the total number of watches, so that is
1324+
acceptable.
1325+
Note that each watch can also specify label and field selectors. However,
1326+
in most of cases a particular object has to be processed for them anyway
1327+
to check if the selectors are satisfied. So we can ignore those selectors
1328+
as the object contributes to the cost (even if it will not be send as not
1329+
satisfying the selector).
1330+
The only exception to this is caused by a few predefined selectors that
1331+
kube-apiserver is optimized for (this includes pods from a given node and
1332+
nodes/secrets/configmaps specifying metadata.name field selector). Given
1333+
their simplicity, we can extend our mapping to handle those too.
1334+
1335+
Having such in-memory map, we can quickly estimate the cost of a request.
1336+
It's not as simple as taking a single map item as it requires adding watches
1337+
for the whole namespace and all objects of a given type. But it be done in
1338+
O(1) map accesses.
1339+
Keeping such map up-to-date is also easy - whenever a new watch start we
1340+
increment a corresponding entry, when it ends we decrement it.
1341+
1342+
##### Multiple apiservers
1343+
1344+
All the above works well in the case of single kube-apiserver. But if there
1345+
are N kube-apiservers, there is no guarantee that the number of watches are
1346+
evenly distributed across them.
1347+
1348+
To address the problem, individual kube-apiservers has to publish the
1349+
information about number of watches for other kube-apiservers. We obviously
1350+
don't want to introduce a new communications channel, so that can be done
1351+
only by writing necessary information to the storage layer (etcd).
1352+
However, writing a map that can contain tens (or hundreds?) of thousands of
1353+
entries wouldn't be efficient. So we need to smartly hash that to a smaller
1354+
structure to avoid loosing too much information.
1355+
If we would have a hashing function that can combine only a similar buckets
1356+
(e.g. it won't combine "all Endpoints" bucket with "pods from node X") then
1357+
we can simply write maximum from all entries that are hashed to the same value.
1358+
This means that some costs may be overestimated, but if we resaonably hash
1359+
requests originating by system components, that seems acceptable.
1360+
The above can be achieved by hashing each resource type to a separate set of
1361+
buckets, and within a resource type hashing (namespace, name) as simple as:
1362+
```
1363+
hash(ns, name) = 0 if namespace == "" & name == ""
1364+
hash(ns, name) = 1 + hash(namespace)%A if name == ""
1365+
hash(ns, name) = 1 + A + hash(namespace/name)%B otherwise
1366+
```
1367+
For small enough A and B (e.g. A=3, B=6), the representation should have less
1368+
than 1000 entries, so it would be small enough to make periodic updates in etcd
1369+
reasonable.
1370+
1371+
We can optimize amount of data written to etcd by frequently (say once per
1372+
second) checking what changed, but writing rarely (say once per minute) or
1373+
if values in some buckets significantly increased.
1374+
The above algorithm would allow us to avoid some more complicated time-smearing,
1375+
as whenever something quickly grows we report it, but we don't immediately
1376+
downscale which is a way to somehow incorporate a history.
1377+
1378+
However, we will treat the above as a feasibility proof. We will just start
1379+
with the simplest apprach of treating each kube-apiserver independently.
1380+
We will implement the above (i.e. knowledge sharing between kube-apiserver),
1381+
if the independence assumption will not work good enough.
1382+
The above description shows that it won't result in almost any wasted work
1383+
if the code will be well structured.
1384+
1385+
##### Cost of the watch event
1386+
1387+
We assumed above that the cost of processing every watch event is equal.
1388+
However, in practice the cost associated with sending an event consists
1389+
of two main parts:
1390+
- the cost of going through event change logic
1391+
- the cost of processing the event object (e.g. deserialization or sending data
1392+
over network)
1393+
The first one is close to equal independently of the event, the second one
1394+
is more proportional to the size of the object.
1395+
However, estimating size of the object is hard to predict for PATCH or DELETE
1396+
requests. Additionally, even for POST or PUT requests where we could potentially
1397+
estimate it based on size of the body of the requests, we may not yet have
1398+
access to it when we need to make a decision.
1399+
1400+
One way to estimate it would be to keep a running average of watch event size
1401+
per bucket. While it won't give us an accurate estimate, it should amortize
1402+
well over time.
1403+
1404+
Obviously some normalization will be needed here, but it's impossible to assess
1405+
it on paper, and we are leaving this for the implementation and tuning phase
1406+
to figure out the details.
1407+
1408+
##### Dispatching the request
1409+
1410+
We described how we can estimate the cost of the request associated with
1411+
watch events triggerred by it. But we didn't yet said how this translates
1412+
to dispatching the request.
1413+
1414+
First of all, we need to decide how to translate the estimated cost to the
1415+
`width` of the request and its latency. Second, we need to introduce the
1416+
changes to our virtual world, as the fact that the request finished doesn't
1417+
mean that sending all associated watch events has also finished (as they are
1418+
send asynchronously).
1419+
1420+
Given that individual watch events are to significant extent processed
1421+
independently in individual goroutines, it actually makes sense to adjust the
1422+
`width` of the request based on the expected number of triggerred events.
1423+
However, we don't want to inflate the width of every single request that is
1424+
triggering some watch event (as described in the above sections, setting
1425+
the width to greater is reducing our ability to fully utilize our capacity).
1426+
The exact function to computing the width should be figured out during
1427+
further experiments, but the initial candidate for it would be:
1428+
```
1429+
width(request) = min(floor(expected events / A), concurrency units in PL)
1430+
```
1431+
1432+
However, adjusting the width is not enough because, as mentioned above,
1433+
processing watch events happens asynchronously. As a result, we will use
1434+
the mechanism of `additional latency` that is described in the section
1435+
about LIST requests to compensate for asynchronous cost of the request
1436+
(which in virtual world quals `<width> x <additional latency>`.
1437+
12561438
### Example Configuration
12571439

12581440
For requests from admins and requests in service of other, potentially

0 commit comments

Comments
 (0)