Skip to content

Commit cdc3b32

Browse files
authored
Merge pull request kubernetes#2463 from deads2k/paging
move Paginated API Lists from community repo
2 parents 724da61 + 92fc0a7 commit cdc3b32

File tree

2 files changed

+322
-0
lines changed

2 files changed

+322
-0
lines changed
Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
# Allow clients to retrieve consistent API lists in chunks
2+
3+
On large clusters, performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET
4+
/api/v1/secrets) can lead to significant variations in peak memory use on the server and contribute substantially to
5+
long tail request latency.
6+
7+
When loading very large sets of objects -- some clusters are now reaching 100k pods or equivalent numbers of supporting
8+
resources -- the system must:
9+
10+
* Construct the full range description in etcd in memory and serialize it as protobuf in the client
11+
* Some clusters have reported over 500MB being stored in a single object type
12+
* This data is read from the underlying datastore and converted to a protobuf response
13+
* Large reads to etcd can block writes to the same range (https://github.com/coreos/etcd/issues/7719)
14+
* The data from etcd has to be transferred to the apiserver in one large chunk
15+
* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the
16+
client
17+
* Much of the decoded etcd memory is copied into the struct used to serialize to the client
18+
* An API client like `kubectl get` will then decode the response from JSON or protobuf
19+
* An API client with a slow connection may not be able to receive the entire response body within the default 60s
20+
timeout
21+
* This may cause other failures downstream of that API client with their own timeouts
22+
* The recently introduced client compression feature can assist
23+
* The large response will also be loaded entirely into memory
24+
25+
The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a
26+
technique commonly referred to as paging or chunking.
27+
By efficiently splitting large list ranges from etcd to clients into many smaller list ranges, we can reduce the peak
28+
memory allocation on etcd and the apiserver, without losing the consistent read invariant our clients depend on.
29+
30+
This proposal does not cover general purpose ranging or paging for arbitrary clients, such as allowing web user
31+
interfaces to offer paged output, but does define some parameters for future extension.
32+
To that end, this proposal uses the phrase "chunking" to describe retrieving a consistent snapshot range read from the
33+
API server in distinct pieces.
34+
35+
Our primary consistent store etcd3 offers support for efficient chunking with minimal overhead, and mechanisms exist
36+
for other potential future stores such as SQL databases or Consul to also implement a simple form of consistent
37+
chunking.
38+
39+
Relevant issues:
40+
41+
* https://github.com/kubernetes/kubernetes/issues/2349
42+
43+
## Terminology
44+
45+
**Consistent list** - A snapshot of all resources at a particular moment in time that has a single `resourceVersion`
46+
that clients can begin watching from to receive updates. All Kubernetes controllers depend on this semantic.
47+
Allows a controller to refresh its internal state, and then receive a stream of changes from the initial state.
48+
49+
**API paging** - API parameters designed to allow a human to view results in a series of "pages".
50+
51+
**API chunking** - API parameters designed to allow a client to break one large request into multiple smaller requests
52+
without changing the semantics of the original request.
53+
54+
55+
## Proposed change:
56+
57+
Expose a simple chunking mechanism to allow large API responses to be broken into consistent partial responses.
58+
Clients would indicate a tolerance for chunking (opt-in) by specifying a desired maximum number of results to return in
59+
a `LIST` call.
60+
The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the
61+
client could pass to receive the next set of results.
62+
The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not
63+
allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit`
64+
without `continue`).
65+
66+
```
67+
GET /api/v1/pods?limit=500
68+
{
69+
"metadata": {"continue": "ABC...", "resourceVersion": "147"},
70+
"items": [
71+
// no more than 500 items
72+
]
73+
}
74+
GET /api/v1/pods?limit=500&continue=ABC...
75+
{
76+
"metadata": {"continue": "DEF...", "resourceVersion": "147"},
77+
"items": [
78+
// no more than 500 items
79+
]
80+
}
81+
GET /api/v1/pods?limit=500&continue=DEF...
82+
{
83+
"metadata": {"resourceVersion": "147"},
84+
"items": [
85+
// no more than 500 items
86+
]
87+
}
88+
```
89+
90+
The token returned by the server for `continue` would be an opaque serialized string that would contain a simple
91+
serialization of a version identifier (to allow future extension), and any additional data needed by the server storage
92+
to identify where to start the next range.
93+
94+
The continue token is not required to encode other filtering parameters present on the initial request, and clients may
95+
alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes
96+
with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification.
97+
Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent
98+
chunk reads.
99+
100+
If the resource version parameter specified on the request is inconsistent with the `continue` token, the server
101+
**must** reject the request with a `400 Bad Request` error.
102+
103+
The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients
104+
- clients **must** consider the continue token as opaque.
105+
Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
106+
107+
Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or
108+
`field` selector is used.
109+
If the entire result set is filtered, the server **may** return zero results with a valid `continue` token.
110+
A client **must** use the presence of a `continue` token in the response to determine whether more results are
111+
available, regardless of the number of results returned.
112+
A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned.
113+
If the server does not return a `continue` token, the server **must** return all remaining results.
114+
The server **may** return zero results with no `continue` token on the last call.
115+
116+
The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens
117+
last only a few minutes.
118+
119+
The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a
120+
mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
121+
122+
123+
### Proposed Implementations
124+
125+
etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case.
126+
The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a
127+
common prefix, with namespaces being a further prefix of those keys.
128+
A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks
129+
by defining a start key for the next chunk that skips the last key read.
130+
131+
etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes.
132+
Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own
133+
transactional characteristics to return a consistent list.
134+
In the near term our primary store remains etcd3 which can provide this capability at low complexity.
135+
136+
Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to
137+
receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the
138+
Kubernetes API list and watch pattern.
139+
140+
#### etcd3
141+
142+
For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent
143+
across the entire LIST) and the start key for the next set of results.
144+
Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given
145+
resource version, beginning at the provided start key, limited by the maximum number of requests provided by the
146+
continue token (or optionally, by a different limit specified by the client).
147+
If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at
148+
the next possible key, and the continue token set on the returned list.
149+
150+
The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that
151+
malicious users cannot trick the server into serving results outside of its range.
152+
The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have
153+
stable versioning for the continue token.
154+
155+
#### Possible SQL database implementation
156+
157+
A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource
158+
type, plus support changefeed style updates in order to implement the WATCH primitive.
159+
A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and
160+
version, and filters out all historical versions of an object.
161+
A consistent paged list over such a table might be similar to:
162+
163+
SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
164+
165+
where `namespace` and `name` are part of the continuation token and an index exists over
166+
`(namespace, name, resourceVersion, deleted)` that makes the range query performant.
167+
The highest returned resource version row for each `(namespace, name)` tuple would be returned.
168+
169+
170+
### Security implications of returning last or next key in the continue token
171+
172+
If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive,
173+
whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of
174+
a resource which is also a shared secret (for example, an access token stored as a kubernetes resource).
175+
There are a number of approaches to mitigating this impact:
176+
177+
1. Disable chunking on specific resources
178+
2. Disable chunking when the user does not have permission to view all resources within a range
179+
3. Encrypt the next key or the continue token using a shared secret across all API servers
180+
4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are
181+
always keys the user has access to.
182+
183+
In the short term we have no supported subset filtering (i.e. a user who can LIST can also LIST ?fields= and vice
184+
versa), so 1 is sufficient to address the sensitive key name issue. Because clients are required to proceed as if
185+
limiting is not possible, the server is always free to ignore a chunked request for other reasons.
186+
In the future, 4 may be the best option because we assume that most users starting a consistent read intend to finish
187+
it, unlike more general user interface paging where only a small fraction of requests continue to the next page.
188+
189+
190+
### Handling expired resource versions
191+
192+
If the required data to perform a consistent list is no longer available in the storage backend (by default, old
193+
versions of objects in etcd3 are removed after 5 minutes), the server **must** return a `410 Gone ResourceExpired`
194+
status response (the same as for watch), which means clients must start from the beginning.
195+
196+
```
197+
# resourceVersion is expired
198+
GET /api/v1/pods?limit=500&continue=DEF...
199+
{
200+
"kind": "Status",
201+
"code": 410,
202+
"reason": "ResourceExpired"
203+
}
204+
```
205+
206+
Some clients may wish to follow a failed paged list with a full list attempt.
207+
208+
The 5 minute default compaction interval for etcd3 bounds how long a list can run.
209+
Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters.
210+
It should be possible to alter the interval at which compaction runs to accommodate larger clusters.
211+
212+
213+
#### Types of clients and impact
214+
215+
Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
216+
217+
* Controllers with full caches
218+
* Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent
219+
view of resources, and so will either need to perform a full list or a paged list, without dropping results
220+
* `kubectl get`
221+
* Most administrators would probably prefer to see a very large set with some inconsistency rather than no results
222+
(due to a timeout under load). They would likely be ok with handling `410 ResourceExpired` as "continue from the
223+
last key I processed"
224+
* Migration style commands
225+
* Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to
226+
check a large set of resources for errors) and is performing some expensive calculation on each, very large sets
227+
may not complete over the server expiration window.
228+
229+
For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired`
230+
error that allows the client to restart from the same prefix key, but using the latest resource version.
231+
This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still
232+
be able to scan the entire working set.
233+
It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.
234+
235+
236+
### Rate limiting
237+
238+
Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests
239+
differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.
240+
241+
242+
### Chunk by default?
243+
244+
On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher
245+
overhead per request (request/response processing, authentication, authorization).
246+
Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak
247+
memory load of a very large cluster (10k resources and up).
248+
In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3
249+
store is an MVCC store and must always filter some values to serve a list.
250+
251+
For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically
252+
improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set.
253+
It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system.
254+
255+
It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking.
256+
257+
258+
### Other solutions
259+
260+
Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce
261+
the peak CPU and memory used inside the client, apiserver, or etcd processes.
262+
263+
Various optimizations exist that can and should be applied to minimizing the amount of data that is transferred from
264+
etcd to the client or number of allocations made in each location, but do not how response size scales with number of entries.
265+
266+
267+
## Plan
268+
269+
The initial chunking implementation would focus on consistent listing on server and client as well as measuring the
270+
impact of chunking on total system load, since chunking will slightly increase the cost to view large data sets because
271+
of the additional per page processing.
272+
The initial implementation should make the fewest assumptions possible in constraining future backend storage.
273+
274+
For the initial alpha release, chunking would be behind a feature flag and attempts to provide the `continue` or `limit`
275+
flags should be ignored. While disabled, a `continue` token should never be returned by the server as part of a list.
276+
277+
Future work might offer more options for clients to page in an inconsistent fashion, or allow clients to directly
278+
specify the parts of the namespace / name keyspace they wish to range over (paging).
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
title: Paginated API Lists
2+
kep-number: 365
3+
authors:
4+
- "@smarterclayton"
5+
owning-sig: sig-api-machinery
6+
participating-sigs:
7+
- sig-scalability
8+
status: implemented
9+
creation-date: 2017-08-29
10+
reviewers:
11+
- "@liggitt"
12+
approvers:
13+
- "@deads2k"
14+
- "@lavalamp"
15+
prr-approvers:
16+
- "@wojtek-t"
17+
see-also:
18+
replaces:
19+
20+
# The target maturity stage in the current dev cycle for this KEP.
21+
stage: beta
22+
23+
# The most recent milestone for which work toward delivery of this KEP has been
24+
# done. This can be the current (upcoming) milestone, if it is being actively
25+
# worked on.
26+
latest-milestone: "v1.9"
27+
28+
# The milestone at which this feature was, or is targeted to be, at each stage.
29+
milestone:
30+
alpha: "v1.8"
31+
beta: "v1.9"
32+
# stable: "v1.21"
33+
34+
# The following PRR answers are required at alpha release
35+
# List the feature gate name and the components for which it must be enabled
36+
feature-gates:
37+
- name: APIListChunking
38+
components:
39+
- kube-apiserver
40+
disable-supported: true
41+
42+
# The following PRR answers are required at beta release
43+
metrics:
44+
- apiserver_request_duration_seconds

0 commit comments

Comments
 (0)