Skip to content

Commit 09a5eab

Browse files
authored
Merge pull request kubernetes#3667 from wojtek-t/consistent_streaming_support_api
Update API proposal for KEP 3157
2 parents fe68421 + 96ae187 commit 09a5eab

File tree

1 file changed

+55
-22
lines changed
  • keps/sig-api-machinery/3157-watch-list

1 file changed

+55
-22
lines changed

keps/sig-api-machinery/3157-watch-list/README.md

Lines changed: 55 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ tags, and then generate with `hack/update-toc.sh`.
8585
- [Proposal](#proposal)
8686
- [Risks and Mitigations](#risks-and-mitigations)
8787
- [Design Details](#design-details)
88-
- [Required changes for a WATCH request with the RV="" and the ResourceVersionMatch=MostRecent](#required-changes-for-a-watch-request-with-the-rv-and-the-resourceversionmatchmostrecent)
88+
- [Required changes for a WATCH request with the SendInitialEvents=true](#required-changes-for-a-watch-request-with-the-sendinitialeventstrue)
8989
- [API changes](#api-changes)
9090
- [Important optimisations](#important-optimisations)
9191
- [Manual testing without the changes in place](#manual-testing-without-the-changes-in-place)
@@ -179,7 +179,7 @@ The kube-apiserver is vulnerable to memory explosion.
179179
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
180180
Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
181181
HA mode but also other programs that share the same machine.
182-
In this KEP we propose a potential solution to this issue.
182+
In this KEP we propose a solution to this issue.
183183

184184
## Motivation
185185

@@ -257,7 +257,7 @@ The "Design Details" section below is for the real
257257
nitty-gritty.
258258
-->
259259

260-
In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use consistent streaming from the watch-cache instead of paging from etcd.
260+
In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use streaming from the watch-cache instead of paging from etcd.
261261
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today).
262262
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
263263
This would allow us to keep memory allocations constant.
@@ -266,17 +266,17 @@ plus a few additional allocations, that will be explained later in this document
266266
The rough idea/plan is as follows:
267267

268268
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
269-
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache.
270-
- step 2a: send all objects currently stored in memory for the given resource.
269+
- step 2: upon receiving the request from an informer, compute the RV at which the result should be returned (possibly contacting etcd if consistent read was requested). It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will meet the consistency requirements of the request.
270+
- step 2a: send all objects currently stored in memory for the given resource type.
271271
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
272272
- step 2c: send a bookmark event to the informer with the given RV.
273273
- step 3: listen for further events using the request from step 1.
274274

275275
Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches.
276276
The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons.
277-
A watch started with RV="0" may return stale. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches
277+
A watch started with RV="0" may return stale data. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches.
278278

279-
Note 2: informers need consistent lists to avoid time-travel when switching to another HA instance of kube-apiserver with outdated/lagging watch cache.
279+
Note 2: informers need consistent lists to avoid time-travel when initializing after restart to avoid time travel in case of switching to another HA instance of kube-apiserver with outdated/lagging watch cache.
280280
See the following [issue](https://github.com/kubernetes/kubernetes/issues/59848) for more details.
281281

282282

@@ -310,7 +310,7 @@ required) or even code snippets. If there's any ambiguity about HOW your
310310
proposal will be implemented, this is the place to discuss them.
311311
-->
312312

313-
### Required changes for a WATCH request with the RV="" and the ResourceVersionMatch=MostRecent
313+
### Required changes for a WATCH request with the SendInitialEvents=true
314314

315315
The following sequence diagram depicts steps that are needed to complete the proposed feature.
316316
A high-level overview of each was provided in a table that follows immediately the diagram.
@@ -328,11 +328,11 @@ Whereas further down in this section we provided a detailed description of each
328328
</tr>
329329
<tr>
330330
<th>2.</th>
331-
<th>The watch cache contacts etcd for the most up-to-date ResourceVersion.</th>
331+
<th>If needed, the watch cache contacts etcd for the most up-to-date ResourceVersion.</th>
332332
</tr>
333333
<tr>
334334
<th>2a.</th>
335-
<th>The watch cache starts streaming initial data. The data it already has in memory.</th>
335+
<th>The watch cache starts streaming initial data it already has in memory.</th>
336336
</tr>
337337
<tr>
338338
<th>2b.</th>
@@ -352,14 +352,14 @@ Whereas further down in this section we provided a detailed description of each
352352
</tr>
353353
</table>
354354

355-
Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) and setting resourceVersionMatch=MostRecent (= ensure freshness).
355+
Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) to ensure freshness and setting resourceVersionMatch=NotOlderThan and sendInitialEvents=true.
356356
We do that only during the initial ListAndWatch call.
357357
Each event (ADD, UPDATE, DELETE) except the BOOKMARK event received from the server is collected.
358-
Passing resourceVersionMatch=MostRecent tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.
358+
Passing resourceVersion="" tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.
359359

360360
Note: This ensures that returned data is consistent, served from etcd via a quorum read and prevents "going back in time".
361361

362-
Note 2: Unfortunately as of today, the watch cache is vulnerable to stale reads, see https://github.com/kubernetes/kubernetes/issues/59848 for more details.
362+
Note 2: Watch cache currently doesn't have the feature of supporting resourceVersion="" and thus is vulnerable to stale reads, see https://github.com/kubernetes/kubernetes/issues/59848 for more details.
363363

364364
Step 2: Right after receiving a request from the reflector, the cacher gets the current resourceVersion (aka bookmarkAfterResourceVersion) directly from the etcd.
365365
It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data.
@@ -447,19 +447,52 @@ It replaces its internal store with the collected items (syncWith) and reuses th
447447

448448
#### API changes
449449

450-
Extend the optional `ResourceVersionMatch` query parameter of `ListOptions` with the following enumeration value:
450+
Extend the `ListOptions` struct with the following field:
451451

452452
```
453-
const (
454-
// ResourceVersionMatchMostRecent matches data at the most recent ResourceVersion.
455-
// The returned data is consistent, that is, served from etcd via a quorum read.
456-
// For watch calls, it begins with synthetic "Added" events of all resources up to the most recent ResourceVersion.
457-
// It ends with a synthetic "Bookmark" event containing the most recent ResourceVersion.
458-
// For list calls, it has the same semantics as leaving ResourceVersion and ResourceVersionMatch unset.
459-
ResourceVersionMatchMostRecent ResourceVersionMatch = "MostRecent"
460-
)
453+
type ListOptions struct {
454+
...
455+
456+
// SendInitialEvents, when set together with Watch option,
457+
// begin the watch stream with synthetic init events to build the
458+
// whole state of all resources followed by a synthetic "Bookmark"
459+
// event containing a ResourceVersion after which the server
460+
// continues streaming events.
461+
//
462+
// When SendInitialEvents option is set, we require ResourceVersionMatch
463+
// option to also be set. The semantic of the watch request is as following:
464+
// - ResourceVersionMatch = NotOlderThan
465+
// It starts with sending initial events for all objects (at some resource
466+
// version), potentially followed by an event stream until the state
467+
// becomes synced to a resource version as fresh as the one provided by
468+
// the ResourceVersion option. At this point, a synthetic bookmark event
469+
// is send and watch stream is continued to be send.
470+
// If RV is unset, this is interpreted as "consistent read" and the
471+
// bookmark event is send when the state is synced at least to the moment
472+
// when request started being processed.
473+
// - ResourceVersionMatch = Exact
474+
// Unsupported error is returned.
475+
// - ResourceVersionMatch unset (or set to any other value)
476+
// BadRequest error is returned.
477+
//
478+
// Defaults to true if ResourceVersion="" or ResourceVersion="0" (for backward
479+
// compatibility reasons) and to false otherwise.
480+
SendInitialEvents bool
481+
}
461482
```
462483

484+
The watch bookmark marking the end of initial events stream will have a dedicated
485+
annotation:
486+
```
487+
"k8s.io/initial-events-end": "true"
488+
```
489+
(the exact name is subject to change during API review). It will allow clients to
490+
precisely figure out when the initial stream of events is finished.
491+
492+
It's worth noting that explicitly setting SendInitialEvents to false with ResourceVersion="0"
493+
will result in not sending initial events, which makes the option works exactly the same
494+
across every potential resource version passed as a parameter.
495+
463496
#### Important optimisations
464497

465498
1. Avoid DeepCopying of initial data<br><br>

0 commit comments

Comments
 (0)