@@ -49,6 +49,7 @@ read from etcd.
49
49
- [ Troubleshooting] ( #troubleshooting )
50
50
- [ Implementation History] ( #implementation-history )
51
51
- [ Alternatives] ( #alternatives )
52
+ - [ Per-request override] ( #per-request-override )
52
53
<!-- /toc -->
53
54
54
55
## Summary
@@ -306,36 +307,41 @@ The table for watch requests look like the following
306
307
[ validation ] : https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/validation/validation.go#L28
307
308
[ etcd resolution ] : https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L589-L627
308
309
309
- For such situations we will provide users with following tools:
310
+ As presented in the above tables, the semantics for a given request server from
311
+ etcd and watchcache is a little bit different. It's a consequence of the fact that:
312
+ * etcd design supports only ` Exact ` semantics - it allows for consistent list
313
+ from a given resource version (either specific value or "now").
314
+ The semantics of ` NotOlderThan ` is implemented as getting consistent list from
315
+ "now" and checking if it satisfies the condition.
316
+ * watchcache design supports only ` NotOlderThan ` semantics - it always waits
317
+ until its resource version is at least as fresh as requested resource version
318
+ and then returns the result from its current state
319
+
320
+ For the above reason, sending the same request to etcd and watchcache, especially
321
+ when cluster state is changing, may legitimately return different results.
322
+
323
+ In order to allow debugging results returned from watchcache in a runnning cluster,
324
+ the only reasonable procedure is:
325
+ * send a request that is served from watchcache
326
+ * send a request setting ` ResourceVersionMatch=Exact ` and ` ResourceVersions ` to value
327
+ returned from the request returned in a previous point
328
+ * compare the two results
329
+
330
+ The existing API already allows us to achieve it.
331
+
332
+ To further allow debugging and improve confidence we will provide users with the
333
+ following tools:
310
334
* a dedicated ` apiserver_watch_cache_read_wait ` metric to detect a problem with
311
335
watch cache.
312
- * a per-request override to disable watch cache to allow debugging.
336
+ * a ` inconsistency detector ` that for requests served from watchcache will be able
337
+ to send a request to etcd (as described above) and compare the results
313
338
314
339
Metric ` apiserver_watch_cache_read_wait ` will measure wait time experienced by
315
340
reads for watch cache to become fresh. If user notices a latency request in
316
341
they can use this metric to confirm that the issue is caused by watch cache.
317
342
318
- Per request override should allow user to compare request results without
319
- impacting other requests or requiring to redeploy whole cluster. The exact
320
- details of override API will be clarified during API review. In healthy
321
- situation, using this override should not cause any impact on the response,
322
- however it might increase resource usage. In our tests cpu load could increase
323
- tenfold. To prevent abuse access to it should be limited to users with
324
- ` cluster-admin ` role, rejecting the request otherwise.
325
-
326
- In case of issues with watch cache users can use the ` ConsistentListFromCache `
327
- feature flag to disable the feature or the existing ` --watch-cache ` flag to
328
- disable the whole watch cache.
329
-
330
- We prefer to provide users an explicit flag and per-request override over an
331
- automatic fallback. It gives users full control and visibility into how request
332
- are handled and ensures accurate APF cost estimates. We expect watch being
333
- starved to happen very rarely, meaning its logic needs to be very simple to
334
- ensure it works properly. A simple fallback will not bring much benefit over
335
- what user can do manually. It will just make the harder to understand and
336
- predict behavior. APF estimates cost just based on request parameters,
337
- before it is passed to storage. If fallback was based on state of watch cache,
338
- cost of request would change after the APF decision increasing the risk of overload.
343
+ The ` inconsistency detector ` will get enabled in our CI to detect issues with
344
+ the introduced mechanism.
339
345
340
346
## Design Details
341
347
@@ -432,7 +438,7 @@ Comparing resource usage and latency with and without consistent list from watch
432
438
433
439
- Feature is enabled by default.
434
440
- Metric ` apiserver_watch_cache_read_wait ` is implemented.
435
- - Per-request watch cache opt-out is implemented.
441
+ - Inconsistency detector is implemented and enabled in CI
436
442
- Deprecate support of etcd v3.3.X, v3.4.24 and v3.5.7
437
443
438
444
#### GA
@@ -582,7 +588,7 @@ Use per-request override to compare latency when reading from watch cache vs etc
582
588
## Implementation History
583
589
584
590
* 1.28 - Alpha
585
- * 1.30 - Beta
591
+ * 1.31 - Beta
586
592
587
593
## Alternatives
588
594
@@ -601,3 +607,21 @@ Do a dynamic fallback based on watch cache wait time.
601
607
602
608
- We expect watch being starved to happen very rarely, meaning its logic needs to be very simple to ensure it works properly.
603
609
- Simple fallback will rather not do a better job then just a manual fallback.
610
+
611
+ ### Per-request override
612
+
613
+ To enable debugging, we considered introducing per-request override to disable
614
+ watchcache to force the request to be served from etcd. This would allow us
615
+ to compare request results without impacting other requests or requiring to
616
+ redeploy the whole cluster. However, as described in the KEP itself, the results
617
+ of the same requests served from watchcache and etcd may legitimately return
618
+ different results. As a result, the proposed debugging mechanism was decided
619
+ to better serve its purpose.
620
+
621
+ We also considered automatic fallback. However, we expect watch being
622
+ starved to happen very rarely, meaning its logic needs to be very simple to
623
+ ensure it works properly. A simple fallback will not bring much benefit over
624
+ what user can do manually. It will just make the harder to understand and
625
+ predict behavior. APF estimates cost just based on request parameters,
626
+ before it is passed to storage. If fallback was based on state of watch cache,
627
+ cost of request would change after the APF decision increasing the risk of overload.
0 commit comments