You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -63,14 +67,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
63
67
## Summary
64
68
65
69
The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
66
-
for the latest observed state. However, `LIST` requests for previous states,
67
-
either via pagination or by specifying a `resourceVersion`, bypass the cache and
68
-
are served directly from etcd. This significantly increases the performance cost,
69
-
and in aggregate, can cause stability issues. This is especially pronounced when
70
-
dealing with large resources, as transferring large data blobs through multiple
71
-
systems can create significant memory pressure. This document proposes an
72
-
enhancement to the kube-apiserver's caching layer to enable efficient serving all
73
-
`LIST` requests from the cache.
70
+
for the latest observed state. However, `LIST` requests for previous states
71
+
(e.g., via pagination or by specifying a `resourceVersion`) often bypass this
72
+
cache and are served directly from etcd. This direct etcd access significantly
73
+
increases performance costs and can lead to stability issues, particularly
74
+
with large resources, due to memory pressure from transferring large data blobs.
75
+
76
+
This KEP proposes an enhancement to the kube-apiserver's watch cache to
77
+
generate B-tree snapshots, allowing it to serve `LIST` requests for previous
78
+
states directly from the cache. This change aims to improve API server
79
+
performance and stability. To support this snapshotting mechanism,
80
+
this proposal also details changes to the watch cache's compaction behavior to maintain Kubernetes Conformance
81
+
and introduces an automatic cache inconsistency detection mechanism.
74
82
75
83
## Motivation
76
84
@@ -100,33 +108,84 @@ leading to a more stable and reliable API server.
100
108
101
109
### Goals
102
110
103
-
- Reduce memory allocations by supporting all types of LIST requests from cache
104
-
- Ensure responses returned by cache are consistent with etcd
111
+
- Reduce memory allocations by serving historical LIST requests from cache
112
+
- Maintain Kubernetes conformance with regards to compaction
113
+
- Prevent inconsistent responses returned by cache due to bugs in caching logic
105
114
106
115
### Non-Goals
107
116
108
117
- Change semantics of the `LIST` request
109
118
- Support indexing when serving for all types of requests.
110
119
- Enforce that no client requests are served from etcd
120
+
- Support etcd server side compaction for watch cache
121
+
- Detection of watch cache memory corruption
111
122
112
123
## Proposal
113
124
114
-
This proposal leverages the recent rewrite of the watchcache storage layer to
115
-
use a B-tree ([kubernetes/kubernetes#126754](https://github.com/kubernetes/kubernetes/pull/126754)) to enable
116
-
efficient serving of remaining types of LIST requests from the watchcache.
117
-
This aims to improve API server performance and stability by minimizing direct etcd access for historical data retrieval.
118
-
This aligns with the future extensions outlined in KEP-365 (Paginated Lists): [link to KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/365-paginated-lists#potential-future-extensions).
125
+
We propose that the watch cache generate B-tree snapshots, allowing it to serve `LIST` requests for previous states.
126
+
These snapshots will be stored for the same duration as watch history and compacted using the same mechanisms.
127
+
This improves API server performance and stability by minimizing direct etcd access for historical data retrieval.
128
+
It also aligns with the future extensions outlined in [KEP-365: Paginated Lists].
129
+
130
+
Compaction is an important behavior, covered by Kubernetes Conformance tests.
131
+
Supporting compaction is required to ensure consistent behavior regardless of whether the watch cache is enabled or disabled.
132
+
Storing historical data in the watch cache, as this KEP proposes, breaks conformance.
133
+
Currently, watch cache is only compacted when it becomes full.
134
+
For resources with infrequent changes, this means data could be retained indefinitely,
135
+
far beyond etcd's compaction point, as highlighted in [#131011].
136
+
Therefore, to maintain conformance and ensure predictable behavior,
137
+
we propose that the existing etcd compaction mechanism also be responsible for compacting the snapshots in cache.
138
+
139
+
This proposal increases reliance on the watchcache, significantly elevating the impact of bugs in watch or caching logic.
140
+
Triggering a bug would no longer impact a single client but affect the cache read by all clients connecting to a particular API server.
141
+
As the proposed changes will result in all requests being served from the cache,
142
+
it would be exceptionally difficult to debug errors, as comparing responses to etcd would no longer be an option.
143
+
Consequently, we propose an automatic cache inconsistency detection mechanism that can run in production and replace manual debugging.
144
+
It will automate checking consistency against etcd, protecting against bugs in the watch cache or etcd watch implementation.
145
+
It is important to note that we do not plan to implement protection from memory corruption like bitflips.
0 commit comments