Merge pull request kubernetes#2463 from deads2k/paging

k8s-ci-robot · web-flow · commit cdc3b32438e2 · 2021-02-08T13:29:22.000-08:00
move Paginated API Lists from community repo
diff --git a/keps/sig-api-machinery/365-paginated-lists/README.MD b/keps/sig-api-machinery/365-paginated-lists/README.MD
@@ -0,0 +1,278 @@
+# Allow clients to retrieve consistent API lists in chunks
+
+On large clusters, performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET
+ /api/v1/secrets) can lead to significant variations in peak memory use on the server and contribute substantially to 
+ long tail request latency.
+
+When loading very large sets of objects -- some clusters are now reaching 100k pods or equivalent numbers of supporting 
+resources -- the system must:
+
+* Construct the full range description in etcd in memory and serialize it as protobuf in the client
+  * Some clusters have reported over 500MB being stored in a single object type
+  * This data is read from the underlying datastore and converted to a protobuf response
+  * Large reads to etcd can block writes to the same range (https://github.com/coreos/etcd/issues/7719)
+* The data from etcd has to be transferred to the apiserver in one large chunk
+* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the 
+client
+  * Much of the decoded etcd memory is copied into the struct used to serialize to the client
+* An API client like `kubectl get` will then decode the response from JSON or protobuf
+  * An API client with a slow connection may not be able to receive the entire response body within the default 60s 
+    timeout
+    * This may cause other failures downstream of that API client with their own timeouts
+  * The recently introduced client compression feature can assist
+  * The large response will also be loaded entirely into memory
+
+The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a 
+technique commonly referred to as paging or chunking. 
+By efficiently splitting large list ranges from etcd to clients into many smaller list ranges, we can reduce the peak 
+memory allocation on etcd and the apiserver, without losing the consistent read invariant our clients depend on.
+
+This proposal does not cover general purpose ranging or paging for arbitrary clients, such as allowing web user 
+interfaces to offer paged output, but does define some parameters for future extension. 
+To that end, this proposal uses the phrase "chunking" to describe retrieving a consistent snapshot range read from the 
+API server in distinct pieces.
+
+Our primary consistent store etcd3 offers support for efficient chunking with minimal overhead, and mechanisms exist 
+for other potential future stores such as SQL databases or Consul to also implement a simple form of consistent 
+chunking.
+
+Relevant issues:
+
+* https://github.com/kubernetes/kubernetes/issues/2349
+
+## Terminology
+
+**Consistent list** - A snapshot of all resources at a particular moment in time that has a single `resourceVersion` 
+that clients can begin watching from to receive updates. All Kubernetes controllers depend on this semantic. 
+Allows a controller to refresh its internal state, and then receive a stream of changes from the initial state.
+
+**API paging** - API parameters designed to allow a human to view results in a series of "pages".
+
+**API chunking** - API parameters designed to allow a client to break one large request into multiple smaller requests 
+without changing the semantics of the original request.
+
+
+## Proposed change:
+
+Expose a simple chunking mechanism to allow large API responses to be broken into consistent partial responses. 
+Clients would indicate a tolerance for chunking (opt-in) by specifying a desired maximum number of results to return in
+a `LIST` call. 
+The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the 
+client could pass to receive the next set of results.  
+The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not 
+allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` 
+without `continue`).
+
+```
+GET /api/v1/pods?limit=500
+{
+  "metadata": {"continue": "ABC...", "resourceVersion": "147"},
+  "items": [
+     // no more than 500 items
+   ]
+}
+GET /api/v1/pods?limit=500&continue=ABC...
+{
+  "metadata": {"continue": "DEF...", "resourceVersion": "147"},
+  "items": [
+     // no more than 500 items
+   ]
+}
+GET /api/v1/pods?limit=500&continue=DEF...
+{
+  "metadata": {"resourceVersion": "147"},
+  "items": [
+     // no more than 500 items
+   ]
+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple 
+serialization of a version identifier (to allow future extension), and any additional data needed by the server storage 
+to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may 
+alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes 
+with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. 
+Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent 
+chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server 
+**must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients 
+- clients **must** consider the continue token as opaque. 
+Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or 
+`field` selector is used. 
+If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. 
+A client **must** use the presence of a `continue` token in the response to determine whether more results are 
+available, regardless of the number of results returned. 
+A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. 
+If the server does not return a `continue` token, the server **must** return all remaining results. 
+The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens 
+last only a few minutes.
+
+The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a 
+mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
+
+
+### Proposed Implementations
+
+etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. 
+The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a 
+common prefix, with namespaces being a further prefix of those keys. 
+A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks 
+by defining a start key for the next chunk that skips the last key read.
+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes.  
+Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own 
+transactional characteristics to return a consistent list. 
+In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to 
+receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the 
+Kubernetes API list and watch pattern.
+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent 
+across the entire LIST) and the start key for the next set of results. 
+Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given 
+resource version, beginning at the provided start key, limited by the maximum number of requests provided by the 
+continue token (or optionally, by a different limit specified by the client). 
+If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at 
+the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that 
+malicious users cannot trick the server into serving results outside of its range. 
+The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have 
+stable versioning for the continue token.
+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource 
+type, plus support changefeed style updates in order to implement the WATCH primitive. 
+A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and 
+version, and filters out all historical versions of an object. 
+A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over 
+`(namespace, name, resourceVersion, deleted)` that makes the range query performant. 
+The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, 
+whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of 
+a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). 
+There are a number of approaches to mitigating this impact:
+
+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers
+4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are 
+always keys the user has access to.
+
+In the short term we have no supported subset filtering (i.e. a user who can LIST can also LIST ?fields= and vice 
+versa), so 1 is sufficient to address the sensitive key name issue. Because clients are required to proceed as if 
+limiting is not possible, the server is always free to ignore a chunked request for other reasons. 
+In the future, 4 may be the best option because we assume that most users starting a consistent read intend to finish 
+it, unlike more general user interface paging where only a small fraction of requests continue to the next page.
+
+
+### Handling expired resource versions
+
+If the required data to perform a consistent list is no longer available in the storage backend (by default, old 
+versions of objects in etcd3 are removed after 5 minutes), the server **must** return a `410 Gone ResourceExpired` 
+status response (the same as for watch), which means clients must start from the beginning.
+
+```
+# resourceVersion is expired
+GET /api/v1/pods?limit=500&continue=DEF...
+{
+  "kind": "Status",
+  "code": 410,
+  "reason": "ResourceExpired"
+}
+```
+
+Some clients may wish to follow a failed paged list with a full list attempt.
+
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  
+Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. 
+It should be possible to alter the interval at which compaction runs to accommodate larger clusters.
+
+
+#### Types of clients and impact
+
+Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
+
+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent 
+    view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results 
+    (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the 
+    last key I processed"
+* Migration style commands
+  * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to 
+    check a large set of resources for errors) and is performing some expensive calculation on each, very large sets 
+    may not complete over the server expiration window.
+
+For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` 
+error that allows the client to restart from the same prefix key, but using the latest resource version.  
+This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still 
+be able to scan the entire working set. 
+It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.
+
+
+### Rate limiting
+
+Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests 
+differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.
+
+
+### Chunk by default?
+
+On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher 
+overhead per request (request/response processing, authentication, authorization).  
+Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak 
+memory load of a very large cluster (10k resources and up).  
+In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 
+store is an MVCC store and must always filter some values to serve a list.
+
+For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically 
+improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set. 
+It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system.
+
+It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking.
+
+
+### Other solutions
+
+Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce 
+the peak CPU and memory used inside the client, apiserver, or etcd processes. 
+
+Various optimizations exist that can and should be applied to minimizing the amount of data that is transferred from 
+etcd to the client or number of allocations made in each location, but do not how response size scales with number of entries.
+
+
+## Plan
+
+The initial chunking implementation would focus on consistent listing on server and client as well as measuring the 
+impact of chunking on total system load, since chunking will slightly increase the cost to view large data sets because 
+of the additional per page processing. 
+The initial implementation should make the fewest assumptions possible in constraining future backend storage.
+
+For the initial alpha release, chunking would be behind a feature flag and attempts to provide the `continue` or `limit` 
+flags should be ignored. While disabled, a `continue` token should never be returned by the server as part of a list.
+
+Future work might offer more options for clients to page in an inconsistent fashion, or allow clients to directly 
+specify the parts of the namespace / name keyspace they wish to range over (paging).
diff --git a/keps/sig-api-machinery/365-paginated-lists/kep.yaml b/keps/sig-api-machinery/365-paginated-lists/kep.yaml
@@ -0,0 +1,44 @@
+title: Paginated API Lists
+kep-number: 365
+authors:
+  - "@smarterclayton"
+owning-sig: sig-api-machinery
+participating-sigs:
+  - sig-scalability
+status: implemented
+creation-date: 2017-08-29
+reviewers:
+  - "@liggitt"
+approvers:
+  - "@deads2k"
+  - "@lavalamp"
+prr-approvers:
+  - "@wojtek-t"
+see-also:
+replaces:
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: beta
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.9"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.8"
+  beta: "v1.9"
+#  stable: "v1.21"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: APIListChunking
+    components:
+      - kube-apiserver
+disable-supported: true
+
+# The following PRR answers are required at beta release
+metrics:
+  - apiserver_request_duration_seconds