Skip to content

Commit e2c97c0

Browse files
authored
Merge pull request #4608 from wojtek-t/consistent_reads_mitigation
Consistent reads mitigation
2 parents ad0a1d5 + 9afa3e0 commit e2c97c0

File tree

2 files changed

+106
-29
lines changed

2 files changed

+106
-29
lines changed

keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md

Lines changed: 103 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ read from etcd.
4949
- [Troubleshooting](#troubleshooting)
5050
- [Implementation History](#implementation-history)
5151
- [Alternatives](#alternatives)
52+
- [Per-request override](#per-request-override)
5253
<!-- /toc -->
5354

5455
## Summary
@@ -253,36 +254,94 @@ Since falling back to etcd won't work, we should fail the requests and rely on
253254
rate limiting to prevent cascading failure. I.e. `Retry-After` HTTP header (for
254255
well-behaved clients) and [Priority and Fairness](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md).
255256

256-
For such situations we will provide users with following tools:
257+
In order to mitigate such problems, let's present how the system currently works
258+
in different cases. In addition to that, we add column indicating whether a given
259+
case will change how watchcache implementation will be handling the request.
260+
261+
| ResourceVersion | ResourceVersionMatch | Continuation | Limit | etcd implementation | watchcache implementation | changed |
262+
|-----------------|----------------------|-------------------|---------------|-----------------------------------------|----------------------------------------------------|----------|
263+
| _unset_ | _unset_ | _unset_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | Yes |
264+
| _unset_ | _unset_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | |
265+
| _unset_ | _Exact_ | _unset_ / _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
266+
| _unset_ | _NotOlderThan_ | _unset_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | Yes |
267+
| _unset_ | _NotOlderThan_ | _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
268+
| _0_ | _unset_ | _unset_ | _unset_ / _N_ | Quorum read request | List from cache ignoring _limit_ | |
269+
| _0_ | _unset_ | _token_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | |
270+
| _0_ | _Exact_ | _unset_ / _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
271+
| _0_ | _NotOlderThan_ | _unset_ | _unset_ / _N_ | Quorum read request | List from cache ignoring _limit_ | |
272+
| _0_ | _NotOlderThan_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | |
273+
| _RV_ | _unset_ | _unset_ | _unset_ | Quorum read request | Wait for cache synced to _RV_+ and list from cache | |
274+
| _RV_ | _unset_ | _unset_ | _N_ | Read request from RV=_RV_ | Delegated to etcd | |
275+
| _RV_ | _unset_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | Deferred |
276+
| _RV_ | _Exact_ | _unset_ | _unset_ / _N_ | Read request from RV=_RV_ | Delegated to etcd | |
277+
| _RV_ | _Exact_ | _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
278+
| _RV_ | _NotOlderThan_ | _unset_ | _unset_ | Quorum read request + check for _RV_ | Wait for cache synced to _RV_+ and list from cache | |
279+
| _RV_ | _NotOlderThan_ | _unset_ | _N_ | Quorum read request + check for _RV_ | Delegated to etcd | Deferred |
280+
| _RV_ | _NotOlderThan_ | _token_ | _unset_/ _N_ | Fails [validation] | Fails [validation] | |
281+
282+
For watch requests both `Continuation` and `Limit` parameters are ignored (we should
283+
have added validation rules for them in the past), but we have `SendInitialEvents` one.
284+
The table for watch requests look like the following
285+
286+
| ResourceVersion | ResourceVersionMatch | SendInitialEvents | etcd implementation | watchcache implementation | changed |
287+
|-----------------|----------------------|------------------------|------------------------------------------------|-----------------------------------------|----------|
288+
| _unset_ | _unset_ | _unset_ | Quorum list + watch stream | Delegate to etcd | Deferred |
289+
| _unset_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
290+
| _unset_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
291+
| _unset_ | _NotOlderThan_ | false | Watch stream from etcd RV | Read etcd RV. Watch stream from it | |
292+
| _unset_ | _NotOlderThan_ | true | Quorum list + watch stream | Wait RV > etcd RV. List + watch stream | |
293+
| _unset_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |
294+
| _0_ | _unset_ | _unset_ | Quorum list + watch stream | List + watch stream | |
295+
| _0_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
296+
| _0_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
297+
| _0_ | _NotOlderThan_ | false | Watch stream from etcd RV | Watch stream from current watchcache RV | |
298+
| _0_ | _NotOlderThan_ | true | Quorum list + watch stream | List + watch stream | |
299+
| _0_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |
300+
| _RV_ | _unset_ | _unset_ | Watch stream from RV | Watch stream from RV | |
301+
| _RV_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
302+
| _RV_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
303+
| _RV_ | _NotOlderThan_ | false | Check RV > etcd RV. Watch stream from RV | Watch stream from RV | |
304+
| _RV_ | _NotOlderThan_ | true | Check RV > etcd RV. Quorum list + watch stream | Wait for RV. List + watch stream | |
305+
| _RV_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |
306+
307+
[validation]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/validation/validation.go#L28
308+
[etcd resolution]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L589-L627
309+
310+
As presented in the above tables, the semantics for a given request server from
311+
etcd and watchcache is a little bit different. It's a consequence of the fact that:
312+
* etcd design supports only `Exact` semantics - it allows for consistent list
313+
from a given resource version (either specific value or "now").
314+
The semantics of `NotOlderThan` is implemented as getting consistent list from
315+
"now" and checking if it satisfies the condition.
316+
* watchcache design supports only `NotOlderThan` semantics - it always waits
317+
until its resource version is at least as fresh as requested resource version
318+
and then returns the result from its current state
319+
320+
For the above reason, sending the same request to etcd and watchcache, especially
321+
when cluster state is changing, may legitimately return different results.
322+
323+
In order to allow debugging results returned from watchcache in a runnning cluster,
324+
the only reasonable procedure is:
325+
* send a request that is served from watchcache
326+
* send a request setting `ResourceVersionMatch=Exact` and `ResourceVersions` to value
327+
returned from the request returned in a previous point
328+
* compare the two results
329+
330+
The existing API already allows us to achieve it.
331+
332+
To further allow debugging and improve confidence we will provide users with the
333+
following tools:
257334
* a dedicated `apiserver_watch_cache_read_wait` metric to detect a problem with
258335
watch cache.
259-
* a per-request override to disable watch cache to allow debugging.
336+
* a `inconsistency detector` that for requests served from watchcache will be able
337+
to send a request to etcd (as described above) and compare the results
260338

261339
Metric `apiserver_watch_cache_read_wait` will measure wait time experienced by
262340
reads for watch cache to become fresh. If user notices a latency request in
263341
they can use this metric to confirm that the issue is caused by watch cache.
264342

265-
Per request override should allow user to compare request results without
266-
impacting other requests or requiring to redeploy whole cluster. The exact
267-
details of override API will be clarified during API review. In healthy
268-
situation, using this override should not cause any impact on the response,
269-
however it might increase resource usage. In our tests cpu load could increase
270-
tenfold. To prevent abuse access to it should be limited to users with
271-
`cluster-admin` role, rejecting the request otherwise.
272-
273-
In case of issues with watch cache users can use the `ConsistentListFromCache`
274-
feature flag to disable the feature or the existing `--watch-cache` flag to
275-
disable the whole watch cache.
276-
277-
We prefer to provide users an explicit flag and per-request override over an
278-
automatic fallback. It gives users full control and visibility into how request
279-
are handled and ensures accurate APF cost estimates. We expect watch being
280-
starved to happen very rarely, meaning its logic needs to be very simple to
281-
ensure it works properly. A simple fallback will not bring much benefit over
282-
what user can do manually. It will just make the harder to understand and
283-
predict behavior. APF estimates cost just based on request parameters,
284-
before it is passed to storage. If fallback was based on state of watch cache,
285-
cost of request would change after the APF decision increasing the risk of overload.
343+
The `inconsistency detector` will get enabled in our CI to detect issues with
344+
the introduced mechanism.
286345

287346
## Design Details
288347

@@ -379,7 +438,7 @@ Comparing resource usage and latency with and without consistent list from watch
379438

380439
- Feature is enabled by default.
381440
- Metric `apiserver_watch_cache_read_wait` is implemented.
382-
- Per-request watch cache opt-out is implemented.
441+
- Inconsistency detector is implemented and enabled in CI
383442
- Deprecate support of etcd v3.3.X, v3.4.24 and v3.5.7
384443

385444
#### GA
@@ -529,7 +588,7 @@ Use per-request override to compare latency when reading from watch cache vs etc
529588
## Implementation History
530589

531590
* 1.28 - Alpha
532-
* 1.30 - Beta
591+
* 1.31 - Beta
533592

534593
## Alternatives
535594

@@ -547,4 +606,22 @@ Allow clients to manage the initial resource version they provide to reflectors,
547606
Do a dynamic fallback based on watch cache wait time.
548607

549608
- We expect watch being starved to happen very rarely, meaning its logic needs to be very simple to ensure it works properly.
550-
- Simple fallback will rather not do a better job then just a manual fallback.
609+
- Simple fallback will rather not do a better job then just a manual fallback.
610+
611+
### Per-request override
612+
613+
To enable debugging, we considered introducing per-request override to disable
614+
watchcache to force the request to be served from etcd. This would allow us
615+
to compare request results without impacting other requests or requiring to
616+
redeploy the whole cluster. However, as described in the KEP itself, the results
617+
of the same requests served from watchcache and etcd may legitimately return
618+
different results. As a result, the proposed debugging mechanism was decided
619+
to better serve its purpose.
620+
621+
We also considered automatic fallback. However, we expect watch being
622+
starved to happen very rarely, meaning its logic needs to be very simple to
623+
ensure it works properly. A simple fallback will not bring much benefit over
624+
what user can do manually. It will just make the harder to understand and
625+
predict behavior. APF estimates cost just based on request parameters,
626+
before it is passed to storage. If fallback was based on state of watch cache,
627+
cost of request would change after the APF decision increasing the risk of overload.

keps/sig-api-machinery/2340-Consistent-reads-from-cache/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,17 @@ approvers:
1616
- "@wojtek-t"
1717
editor: TBD
1818
creation-date: 2019-12-10
19-
last-updated: 2023-06-15
19+
last-updated: 2024-05-09
2020
status: implementable
2121
see-also:
2222
- "/keps/sig-api-machinery/3157-watch-list"
2323
replaces:
2424
superseded-by:
2525
stage: beta
26-
latest-milestone: "v1.30"
26+
latest-milestone: "v1.31"
2727
milestone:
2828
alpha: "v1.28"
29-
beta: "v1.30"
29+
beta: "v1.31"
3030
feature-gates:
3131
- name: ConsistentListFromCache
3232
components:

0 commit comments

Comments
 (0)