Skip to content

Commit 16ccca2

Browse files
committed
Move snapshottable API server cache to Beta
1 parent 54d8e48 commit 16ccca2

File tree

3 files changed

+120
-71
lines changed

3 files changed

+120
-71
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4988
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-api-machinery/4988-snapshottable-api-server-cache/README.md

Lines changed: 115 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,15 @@
77
- [Goals](#goals)
88
- [Non-Goals](#non-goals)
99
- [Proposal](#proposal)
10-
- [Risks and Mitigations](#risks-and-mitigations)
11-
- [Memory overhead](#memory-overhead)
12-
- [Design Details](#design-details)
13-
- [Snapshotting](#snapshotting)
10+
- [Serving list from snapshots](#serving-list-from-snapshots)
11+
- [Watch cache compaction](#watch-cache-compaction)
1412
- [Cache Inconsistency Detection Mechanism](#cache-inconsistency-detection-mechanism)
13+
- [Risks and Mitigations](#risks-and-mitigations)
14+
- [Snapshot memory overhead](#snapshot-memory-overhead)
15+
- [Consistency checking overhead](#consistency-checking-overhead)
16+
- [Design Details](#design-details)
17+
- [Snapshotting algorithm](#snapshotting-algorithm)
18+
- [Hasing algorithm](#hasing-algorithm)
1519
- [Test Plan](#test-plan)
1620
- [Prerequisite testing updates](#prerequisite-testing-updates)
1721
- [Unit tests](#unit-tests)
@@ -63,14 +67,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6367
## Summary
6468

6569
The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
66-
for the latest observed state. However, `LIST` requests for previous states,
67-
either via pagination or by specifying a `resourceVersion`, bypass the cache and
68-
are served directly from etcd. This significantly increases the performance cost,
69-
and in aggregate, can cause stability issues. This is especially pronounced when
70-
dealing with large resources, as transferring large data blobs through multiple
71-
systems can create significant memory pressure. This document proposes an
72-
enhancement to the kube-apiserver's caching layer to enable efficient serving all
73-
`LIST` requests from the cache.
70+
for the latest observed state. However, `LIST` requests for previous states
71+
(e.g., via pagination or by specifying a `resourceVersion`) often bypass this
72+
cache and are served directly from etcd. This direct etcd access significantly
73+
increases performance costs and can lead to stability issues, particularly
74+
with large resources, due to memory pressure from transferring large data blobs.
75+
76+
This KEP proposes an enhancement to the kube-apiserver's watch cache to
77+
generate B-tree snapshots, allowing it to serve `LIST` requests for previous
78+
states directly from the cache. This change aims to improve API server
79+
performance and stability. To support this snapshotting mechanism,
80+
this proposal also details changes to the watch cache's compaction behavior to maintain Kubernetes Conformance
81+
and introduces an automatic cache inconsistency detection mechanism.
7482

7583
## Motivation
7684

@@ -100,33 +108,84 @@ leading to a more stable and reliable API server.
100108

101109
### Goals
102110

103-
- Reduce memory allocations by supporting all types of LIST requests from cache
104-
- Ensure responses returned by cache are consistent with etcd
111+
- Reduce memory allocations by serving historical LIST requests from cache
112+
- Maintain Kubernetes conformance with regards to compaction
113+
- Prevent inconsistent responses returned by cache due to bugs in caching logic
105114

106115
### Non-Goals
107116

108117
- Change semantics of the `LIST` request
109118
- Support indexing when serving for all types of requests.
110119
- Enforce that no client requests are served from etcd
120+
- Support etcd server side compaction for watch cache
121+
- Detection of watch cache memory corruption
111122

112123
## Proposal
113124

114-
This proposal leverages the recent rewrite of the watchcache storage layer to
115-
use a B-tree ([kubernetes/kubernetes#126754](https://github.com/kubernetes/kubernetes/pull/126754)) to enable
116-
efficient serving of remaining types of LIST requests from the watchcache.
117-
This aims to improve API server performance and stability by minimizing direct etcd access for historical data retrieval.
118-
This aligns with the future extensions outlined in KEP-365 (Paginated Lists): [link to KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/365-paginated-lists#potential-future-extensions).
125+
We propose that the watch cache generate B-tree snapshots, allowing it to serve `LIST` requests for previous states.
126+
These snapshots will be stored for the same duration as watch history and compacted using the same mechanisms.
127+
This improves API server performance and stability by minimizing direct etcd access for historical data retrieval.
128+
It also aligns with the future extensions outlined in [KEP-365: Paginated Lists].
129+
130+
Compaction is an important behavior, covered by Kubernetes Conformance tests.
131+
Supporting compaction is required to ensure consistent behavior regardless of whether the watch cache is enabled or disabled.
132+
Storing historical data in the watch cache, as this KEP proposes, breaks conformance.
133+
Currently, watch cache is only compacted when it becomes full.
134+
For resources with infrequent changes, this means data could be retained indefinitely,
135+
far beyond etcd's compaction point, as highlighted in [#131011].
136+
Therefore, to maintain conformance and ensure predictable behavior,
137+
we propose that the existing etcd compaction mechanism also be responsible for compacting the snapshots in cache.
138+
139+
This proposal increases reliance on the watchcache, significantly elevating the impact of bugs in watch or caching logic.
140+
Triggering a bug would no longer impact a single client but affect the cache read by all clients connecting to a particular API server.
141+
As the proposed changes will result in all requests being served from the cache,
142+
it would be exceptionally difficult to debug errors, as comparing responses to etcd would no longer be an option.
143+
Consequently, we propose an automatic cache inconsistency detection mechanism that can run in production and replace manual debugging.
144+
It will automate checking consistency against etcd, protecting against bugs in the watch cache or etcd watch implementation.
145+
It is important to note that we do not plan to implement protection from memory corruption like bitflips.
146+
147+
[KEP-365: Paginated Lists]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/365-paginated-lists#potential-future-extensions
148+
[#131011]: https://github.com/kubernetes/kubernetes/issues/131011#issuecomment-2747497808
149+
150+
### Serving list from snapshots
151+
152+
The snapshotting mechanism utilizes ability of B-tree to create
153+
lazy copies of itself. This allows us to create snapshot on each watch event.
154+
Those snapshots capture the state of cache at historical resourceVersion,
155+
and can be used to serve `LIST` requests, by finding aprioripate snapshot and just reading from it.
156+
157+
### Watch cache compaction
158+
159+
We will expand the existing mechanism for compacting etcd to also compact the watch cache.
160+
Kubernetes supports periodic configuring compaction by default executed every 5 minutes.
161+
In the current algorithm each API Server executes a optimistic write on `compact_rev_key` key to store revision to be compacted.
162+
The one that is first to write successfully, executes the compaction request against etcd.
163+
We will expand it by opening a watch on `compact_rev_key` key, and informing watch cache about succesfull compactions done by any API server.
164+
When watch cache is informed about compaction, it will truncate snapshot history up to that revision.
165+
To avoid changes of existing behavior, we will not compact watch history; this should be considered in the future.
166+
167+
### Cache Inconsistency Detection Mechanism
168+
169+
The mechanism periodically calculates and compares a hash of the data for each resource in both the etcd and the watch cache.
170+
171+
It will be developed across multiple phases:
172+
* **Alpha:** In this phase, the detection will enabled only in the test environment.
173+
Enabled via `KUBE_WATCHCACHE_CONSISTANCY_CHECKER` environment variable,
174+
we will run in Kubernetes e2e tests to ensure that the mechanism works as expected.
175+
On mismatch the apiserver will panic making it easy to detect in tests.
176+
* **Beta:** The detection will be enabled by default. If an inconsistency is detected,
177+
snapshots stored in cache will be purged and the system will automatically fall
178+
back to serving LIST requests from etcd for the affected resource.
179+
This mechanism will only impact LIST requests that would be served from watch cache snapshots,
180+
effectively reverting to the behavior prior to this proposal,
181+
while other requests will continue to be served from the cache.
182+
Fallback will not be permanent, but will last until the next successful consistency check.
119183

120-
However, this increased reliance on the watchcache significantly elevates the impact of any bugs in the caching logic.
121-
Incorrect behavior would be locally within API server memory, making debugging exceptionally difficult.
122-
Given that the proposed changes will ultimately route *all* API server LIST calls through the cache,
123-
a robust mechanism for detecting inconsistencies is crucial.
124-
Therefore, we propose an automatic mechanism to validate cache consistency with etcd,
125-
providing users with confidence in the cache's accuracy without requiring manual debugging efforts.
184+
To monitor consistency failures we will expose `storage_consistency_checks_total` metric.
126185

127-
### Risks and Mitigations
186+
## Risks and Mitigations
128187

129-
#### Memory overhead
188+
#### Snapshot memory overhead
130189

131190
B-tree snapshots are designed to minimize memory overhead by storing pointers to
132191
the actual objects, rather than the objects themselves. Since the objects are
@@ -141,12 +200,23 @@ The results are promising:
141200
* **Memory Usage:** Memory in use profile collected during the test has shown
142201
Btree memory usage of 300MB, representing a 1.3% of total memory used.
143202

203+
#### Consistency checking overhead
204+
205+
Periodic execution of consistency checking will introduce additional overhead.
206+
This load is not negligible, as it requires downloading and decoding data from etcd.
207+
For saftly we still think it's important that feature is enabled by default,
208+
however we want to leave an option to disable it.
209+
For that we will introduce `DetectCacheInconsistency` feature gate in Beta.
210+
211+
For future we plan to improve etcd API to support cheap consistency checks.
212+
At that point disabling inconsistency checks will no longer be needed.
213+
144214
## Design Details
145215

146-
### Snapshotting
216+
### Snapshotting algorithm
147217

148218
1. **Snapshot Creation:** When a watch event is received, the cacher creates
149-
a snapshot of the B-tree based cache using the efficient [Clone()[] method.
219+
a snapshot of the B-tree based cache using the efficient [Clone()] method.
150220
This method creates a lazy copy of the tree structure, minimizing overhead.
151221
Since the watch cache already stores the history of watch events,
152222
the B-tree maintains just pointers to the in-use memory, storing only minimal necessary data.
@@ -167,45 +237,20 @@ The results are promising:
167237
cache is lagging behind. The API server performs a consistent read from
168238
etcd to confirm the existence of the future resourceVersion or waits for
169239
the watch cache to catch up.
170-
4. **Snapshot Cleanup:** Snapshots are subject to a Time-To-Live (TTL) mechanism
171-
similar to watch events. The proposed approach leverages the existing process
172-
that limits the number of events to 10,000 within a 75-second window
173-
(configurable via request timeout). Additionally, snapshots are purged during
174-
cache re-initialization.
175240

176241
[Clone()]: https://pkg.go.dev/github.com/google/btree#BTree.Clone
177242

178-
### Cache Inconsistency Detection Mechanism
179-
180-
We will periodically calculate a hash of the data in both the etcd datastore
181-
and the watch cache and compare them. For the Alpha phase, detection will be passive.
182-
A metric will be exposed, allowing users to configure alerts that trigger on hash mismatch,
183-
thus indicatory of potential inconsistency and enabling us to validate the mechanism.
184-
For the Beta phase, detection will become active, with automatic fallback to etcd if
185-
inconsistency is detected. This way we automaticaly restore the previous behavior.
243+
#### Hasing algorithm
186244

187-
The implementation works as follows. Every 5 minutes, for each resource,
188-
a hash calculation is performed. To avoid concurrent calculations,
189-
the start time for each resource's calculation is randomly offset by 1 to 5 minutes.
245+
Every 5 minutes, for each resource, we calculate hash for each resource.
190246
A non-consistent `LIST` request (`RV=0`) is sent to the watch cache to retrieve its latest available RV.
191247
This revision is then used to make a consistent `LIST` request (`RV=X`, where X is the revision from the cache) to etcd.
192248
This ensures comparison of the cache's latest state with the corresponding state in etcd,
193249
without explicit handling of potential cache staleness.
194250

195251
The 64-bit FNV algorithm (as implemented in [`hash/fnv`]([https://pkg.go.dev/hash/fnv](https://pkg.go.dev/hash/fnv)))
196-
is used to calculate the hash over the entire structure of the `LIST` response,
197-
using a technique similar to [gohugoio/hashstructure](https://github.com/gohugoio/hashstructure).
198-
While calculating the hash of the entire structure is computationally more expensive,
199-
the infrequency of this operation (every 5 minutes) makes the cost acceptable compared
200-
to frequent `LIST` operations directly against etcd.
201-
Hashing the entire structure helps prevent issues arising from object versioning differences.
202-
203-
The metric includes labels for `resource` (e.g., "pods"), `storage` (either "etcd" or "cache"), and `hash` (the calculated hash value). Example:
204-
```
205-
apiserver_storage_hash{resource="pods", storage="etcd", hash="f364dcd6b58ebf020cec3fe415e726ab16425b4d0344ac6b551d2769dd01b251"} 1
206-
apiserver_storage_hash{resource="pods", storage="cache", hash="f364dcd6b58ebf020cec3fe415e726ab16425b4d0344ac6b551d2769dd01b251"} 1
207-
```
208-
Metric values for each resource should be updated atomically to prevent false positives.
252+
is used to calculate the hash of object's namespace, name, and resourceVersion joined by a '/' byte.
253+
This should allow us to detect inconsistencies caused by bugs in applying watch events or bugs in etcd watch stream.
209254

210255
### Test Plan
211256

@@ -235,12 +280,15 @@ Test should cover couple of resources including resources with conversion.
235280
#### Alpha
236281

237282
- Snapshotting implemented behind a feature gate disabled by default.
238-
- Inconsistency detection is implemented behind a feature gate enabled by default.
283+
- Inconsistency detection is behind environment variable
284+
- Inconsistency detection run in e2e tests
239285

240286
#### Beta
241287

242288
- Inconsistency detection mechanism is qualified and no mismatch detected.
243-
- Fallback to etcd mechanism is implemented
289+
- Inconsistency detection moved behind a feature gate `DetectCacheInconsistency` enabled by default.
290+
- Automatic fallback to etcd is implemented
291+
- Pass Kubernetes conformance tests for compaction
244292

245293
#### GA
246294

@@ -296,15 +344,12 @@ additional value on top of feature tests themselves.
296344

297345
###### What specific metrics should inform a rollback?
298346

299-
Mismatch in hash label for different storage exposed by `apiserver_storage_hash` metric by the same apiserver.
347+
Snapshotting should automatically fallback to serving from etcd if inconsistency is detected.
348+
Rollback should be consider if there is a high number of inconsistencies detected by `storage_consistency_checks_total` metric.
300349

301350
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
302351

303-
<!--
304-
Describe manual testing that was done and the outcomes.
305-
Longer term, we may want to require automated upgrade/rollback tests, but we
306-
are missing a bunch of machinery and tooling and can't do that now.
307-
-->
352+
No need for tests, this feature doesn't cause any persistent side effects.
308353

309354
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
310355

@@ -330,7 +375,7 @@ This is control-plane feature, not a workload feature.
330375

331376
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
332377

333-
Yes, we are adding `apiserver_storage_hash` to check cache consistency.
378+
Yes, we are adding `storage_consistency_checks_total` to count the number of consistency checks performed and their outcomes.
334379

335380
### Dependencies
336381

@@ -380,15 +425,16 @@ The feature is kube-apiserver feature - it just doesn't work if kube-apiserver i
380425
###### What are other known failure modes?
381426

382427
Inconsistency of watch cache, should be addressed by the consistency checking mechanism.
383-
For the first iteration we will enable users to define an alert on a metric and detect if cache became inconsistent with etcd.
428+
For the first iteration we will enable users to define an alert on a metric and detect if cache becomes inconsistent with etcd.
384429

385430
###### What steps should be taken if SLOs are not being met to determine the problem?
386431

387432
Disabling the feature-gate.
388433

389434
## Implementation History
390435

391-
- 1.33: KEP proposed and approved for implementation
436+
- 1.33: Alpha
437+
- 1.34: Beta
392438

393439
## Drawbacks
394440

keps/sig-api-machinery/4988-snapshottable-api-server-cache/kep.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,11 @@ see-also:
1616
- "/keps/sig-api-machinery/2340-Consistent-reads-from-cache"
1717
- "/keps/sig-api-machinery/365-paginated-lists"
1818
replaces: []
19-
stage: alpha
20-
latest-milestone: "v1.33"
19+
stage: beta
20+
latest-milestone: "v1.34"
2121
milestone:
2222
alpha: "v1.33"
23+
beta: "v1.34"
2324
feature-gates:
2425
- name: DetectCacheInconsistency
2526
components:

0 commit comments

Comments
 (0)