|
30 | 30 | - [Width of the request](#width-of-the-request)
|
31 | 31 | - [Determining the width](#determining-the-width)
|
32 | 32 | - [Dispatching the request](#dispatching-the-request)
|
| 33 | + - [Support for WATCH requests](#support-for-watch-requests) |
| 34 | + - [Watch initialization](#watch-initialization) |
| 35 | + - [Keeping the watch up-to-date](#keeping-the-watch-up-to-date) |
| 36 | + - [Estimating cost of the request](#estimating-cost-of-the-request) |
| 37 | + - [Multiple apiservers](#multiple-apiservers) |
| 38 | + - [Cost of the watch event](#cost-of-the-watch-event) |
| 39 | + - [Dispatching the request](#dispatching-the-request-1) |
33 | 40 | - [Example Configuration](#example-configuration)
|
34 | 41 | - [Reaction to Configuration Changes](#reaction-to-configuration-changes)
|
35 | 42 | - [Default Behavior](#default-behavior)
|
@@ -1253,6 +1260,181 @@ approximating the cost of processing the queued items. Given that the
|
1253 | 1260 | total cost of processing a request is now `<width> x <processing latency>`,
|
1254 | 1261 | the weight of the queue should now reflect that.
|
1255 | 1262 |
|
| 1263 | +### Support for WATCH requests |
| 1264 | + |
| 1265 | +The next thing to consider is support for long-running requests. However, |
| 1266 | +solving it in a generic case is hard, because we don't have any way to |
| 1267 | +predict how expensive those requests will be. Moreover, for requests like |
| 1268 | +port forwarding this is completely outside our control (being application |
| 1269 | +specific). As a result, as the first step we're going to limit our focus |
| 1270 | +to just WATCH requests. |
| 1271 | + |
| 1272 | +However, even for WATCH requests, there are effectively two separate problems |
| 1273 | +that has to be considered and addressed. |
| 1274 | + |
| 1275 | +#### Watch initialization |
| 1276 | + |
| 1277 | +While this is an important piece to address, to allow incremental progress |
| 1278 | +we are leaving addressing this problem for the future release. |
| 1279 | + |
| 1280 | +Note for the future when we get to this: |
| 1281 | + |
| 1282 | +The most compatible approach would be to simply include WATCH requests as |
| 1283 | +ones that are processed by APF dispatcher, but allow them to send an artificial |
| 1284 | +`request finished` signal to the dispatcher once their initialization is done. |
| 1285 | +However, that requires more detailed design. |
| 1286 | + |
| 1287 | +#### Keeping the watch up-to-date |
| 1288 | + |
| 1289 | +Once the watch is initialized, we have to keep it up-to-date with incoming |
| 1290 | +changes. That basically means that whenever some object that this particular |
| 1291 | +watch is interested in is added/updated/deleted, and appropriate event has |
| 1292 | +to be sent. |
| 1293 | + |
| 1294 | +Similarly as in generic case of long-running requests, predicting this cost |
| 1295 | +up front is impossible. Fortunately, in this case we have a full control over |
| 1296 | +those events, because those are effectively a result of the mutating requests |
| 1297 | +that out mechanism is explicitly admitting. |
| 1298 | +So instead of associating the cost of sending watch events to the corresponding |
| 1299 | +WATCH request, we will reverse the situation and associate the cost of sending |
| 1300 | +watch events to the mutating request that triggering them. |
| 1301 | + |
| 1302 | +In other words, we will be throttling the write requests to ensure that |
| 1303 | +apiserver would be able to keep up with the watch traffic instead of throttling |
| 1304 | +the watchers themselves. The main reason for that is that throttling watchers |
| 1305 | +themselves isn't really effective: we either need to send all objects to them |
| 1306 | +anyway or in case we close them, they will try to resume from last received |
| 1307 | +event anyway. Which means we don't get anything on throttling them. |
| 1308 | + |
| 1309 | +##### Estimating cost of the request |
| 1310 | + |
| 1311 | +Let's start with an assumption that sending every watch event is equally |
| 1312 | +expensive. We will discuss how to generalize it below. |
| 1313 | + |
| 1314 | +With the above assumption, a cost of a mutating request associated with |
| 1315 | +sending watch events triggerred by it is proportional to the number of |
| 1316 | +watchers that has to process that event. So let's describe how we can |
| 1317 | +estimate this number. |
| 1318 | + |
| 1319 | +Obviously, we can't afford going over all watches to compute that - we need |
| 1320 | +to keep this information already precomputed. What if we would simply |
| 1321 | +store an in-memory map from (resource type, namespace, name) tuple into |
| 1322 | +number of opened watches that are interested in a given event. The size of |
| 1323 | +that map won't be larger then the total number of watches, so that is |
| 1324 | +acceptable. |
| 1325 | +Note that each watch can also specify label and field selectors. However, |
| 1326 | +in most of cases a particular object has to be processed for them anyway |
| 1327 | +to check if the selectors are satisfied. So we can ignore those selectors |
| 1328 | +as the object contributes to the cost (even if it will not be send as not |
| 1329 | +satisfying the selector). |
| 1330 | +The only exception to this is caused by a few predefined selectors that |
| 1331 | +kube-apiserver is optimized for (this includes pods from a given node and |
| 1332 | +nodes/secrets/configmaps specifying metadata.name field selector). Given |
| 1333 | +their simplicity, we can extend our mapping to handle those too. |
| 1334 | + |
| 1335 | +Having such in-memory map, we can quickly estimate the cost of a request. |
| 1336 | +It's not as simple as taking a single map item as it requires adding watches |
| 1337 | +for the whole namespace and all objects of a given type. But it be done in |
| 1338 | +O(1) map accesses. |
| 1339 | +Keeping such map up-to-date is also easy - whenever a new watch start we |
| 1340 | +increment a corresponding entry, when it ends we decrement it. |
| 1341 | + |
| 1342 | +##### Multiple apiservers |
| 1343 | + |
| 1344 | +All the above works well in the case of single kube-apiserver. But if there |
| 1345 | +are N kube-apiservers, there is no guarantee that the number of watches are |
| 1346 | +evenly distributed across them. |
| 1347 | + |
| 1348 | +To address the problem, individual kube-apiservers has to publish the |
| 1349 | +information about number of watches for other kube-apiservers. We obviously |
| 1350 | +don't want to introduce a new communications channel, so that can be done |
| 1351 | +only by writing necessary information to the storage layer (etcd). |
| 1352 | +However, writing a map that can contain tens (or hundreds?) of thousands of |
| 1353 | +entries wouldn't be efficient. So we need to smartly hash that to a smaller |
| 1354 | +structure to avoid loosing too much information. |
| 1355 | +If we would have a hashing function that can combine only a similar buckets |
| 1356 | +(e.g. it won't combine "all Endpoints" bucket with "pods from node X") then |
| 1357 | +we can simply write maximum from all entries that are hashed to the same value. |
| 1358 | +This means that some costs may be overestimated, but if we resaonably hash |
| 1359 | +requests originating by system components, that seems acceptable. |
| 1360 | +The above can be achieved by hashing each resource type to a separate set of |
| 1361 | +buckets, and within a resource type hashing (namespace, name) as simple as: |
| 1362 | +``` |
| 1363 | + hash(ns, name) = 0 if namespace == "" & name == "" |
| 1364 | + hash(ns, name) = 1 + hash(namespace)%A if name == "" |
| 1365 | + hash(ns, name) = 1 + A + hash(namespace/name)%B otherwise |
| 1366 | +``` |
| 1367 | +For small enough A and B (e.g. A=3, B=6), the representation should have less |
| 1368 | +than 1000 entries, so it would be small enough to make periodic updates in etcd |
| 1369 | +reasonable. |
| 1370 | + |
| 1371 | +We can optimize amount of data written to etcd by frequently (say once per |
| 1372 | +second) checking what changed, but writing rarely (say once per minute) or |
| 1373 | +if values in some buckets significantly increased. |
| 1374 | +The above algorithm would allow us to avoid some more complicated time-smearing, |
| 1375 | +as whenever something quickly grows we report it, but we don't immediately |
| 1376 | +downscale which is a way to somehow incorporate a history. |
| 1377 | + |
| 1378 | +However, we will treat the above as a feasibility proof. We will just start |
| 1379 | +with the simplest apprach of treating each kube-apiserver independently. |
| 1380 | +We will implement the above (i.e. knowledge sharing between kube-apiserver), |
| 1381 | +if the independence assumption will not work good enough. |
| 1382 | +The above description shows that it won't result in almost any wasted work |
| 1383 | +if the code will be well structured. |
| 1384 | + |
| 1385 | +##### Cost of the watch event |
| 1386 | + |
| 1387 | +We assumed above that the cost of processing every watch event is equal. |
| 1388 | +However, in practice the cost associated with sending an event consists |
| 1389 | +of two main parts: |
| 1390 | +- the cost of going through event change logic |
| 1391 | +- the cost of processing the event object (e.g. deserialization or sending data |
| 1392 | + over network) |
| 1393 | +The first one is close to equal independently of the event, the second one |
| 1394 | +is more proportional to the size of the object. |
| 1395 | +However, estimating size of the object is hard to predict for PATCH or DELETE |
| 1396 | +requests. Additionally, even for POST or PUT requests where we could potentially |
| 1397 | +estimate it based on size of the body of the requests, we may not yet have |
| 1398 | +access to it when we need to make a decision. |
| 1399 | + |
| 1400 | +One way to estimate it would be to keep a running average of watch event size |
| 1401 | +per bucket. While it won't give us an accurate estimate, it should amortize |
| 1402 | +well over time. |
| 1403 | + |
| 1404 | +Obviously some normalization will be needed here, but it's impossible to assess |
| 1405 | +it on paper, and we are leaving this for the implementation and tuning phase |
| 1406 | +to figure out the details. |
| 1407 | + |
| 1408 | +##### Dispatching the request |
| 1409 | + |
| 1410 | +We described how we can estimate the cost of the request associated with |
| 1411 | +watch events triggerred by it. But we didn't yet said how this translates |
| 1412 | +to dispatching the request. |
| 1413 | + |
| 1414 | +First of all, we need to decide how to translate the estimated cost to the |
| 1415 | +`width` of the request and its latency. Second, we need to introduce the |
| 1416 | +changes to our virtual world, as the fact that the request finished doesn't |
| 1417 | +mean that sending all associated watch events has also finished (as they are |
| 1418 | +send asynchronously). |
| 1419 | + |
| 1420 | +Given that individual watch events are to significant extent processed |
| 1421 | +independently in individual goroutines, it actually makes sense to adjust the |
| 1422 | +`width` of the request based on the expected number of triggerred events. |
| 1423 | +However, we don't want to inflate the width of every single request that is |
| 1424 | +triggering some watch event (as described in the above sections, setting |
| 1425 | +the width to greater is reducing our ability to fully utilize our capacity). |
| 1426 | +The exact function to computing the width should be figured out during |
| 1427 | +further experiments, but the initial candidate for it would be: |
| 1428 | +``` |
| 1429 | + width(request) = min(floor(expected events / A), concurrency units in PL) |
| 1430 | +``` |
| 1431 | + |
| 1432 | +However, adjusting the width is not enough because, as mentioned above, |
| 1433 | +processing watch events happens asynchronously. As a result, we will use |
| 1434 | +the mechanism of `additional latency` that is described in the section |
| 1435 | +about LIST requests to compensate for asynchronous cost of the request |
| 1436 | +(which in virtual world quals `<width> x <additional latency>`. |
| 1437 | + |
1256 | 1438 | ### Example Configuration
|
1257 | 1439 |
|
1258 | 1440 | For requests from admins and requests in service of other, potentially
|
|
0 commit comments