@@ -49,6 +49,7 @@ read from etcd.
49
49
- [ Troubleshooting] ( #troubleshooting )
50
50
- [ Implementation History] ( #implementation-history )
51
51
- [ Alternatives] ( #alternatives )
52
+ - [ Per-request override] ( #per-request-override )
52
53
<!-- /toc -->
53
54
54
55
## Summary
@@ -253,36 +254,94 @@ Since falling back to etcd won't work, we should fail the requests and rely on
253
254
rate limiting to prevent cascading failure. I.e. ` Retry-After ` HTTP header (for
254
255
well-behaved clients) and [ Priority and Fairness] ( https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md ) .
255
256
256
- For such situations we will provide users with following tools:
257
+ In order to mitigate such problems, let's present how the system currently works
258
+ in different cases. In addition to that, we add column indicating whether a given
259
+ case will change how watchcache implementation will be handling the request.
260
+
261
+ | ResourceVersion | ResourceVersionMatch | Continuation | Limit | etcd implementation | watchcache implementation | changed |
262
+ | -----------------| ----------------------| -------------------| ---------------| -----------------------------------------| ----------------------------------------------------| ----------|
263
+ | _ unset_ | _ unset_ | _ unset_ | _ unset_ / _ N_ | Quorum read request | Delegated to etcd | Yes |
264
+ | _ unset_ | _ unset_ | _ token_ | _ unset_ / _ N_ | Read request from RV encoded in _ token_ | Delegated to etcd | |
265
+ | _ unset_ | _ Exact_ | _ unset_ / _ token_ | _ unset_ / _ N_ | Fails [ validation] | Fails [ validation] | |
266
+ | _ unset_ | _ NotOlderThan_ | _ unset_ | _ unset_ / _ N_ | Quorum read request | Delegated to etcd | Yes |
267
+ | _ unset_ | _ NotOlderThan_ | _ token_ | _ unset_ / _ N_ | Fails [ validation] | Fails [ validation] | |
268
+ | _ 0_ | _ unset_ | _ unset_ | _ unset_ / _ N_ | Quorum read request | List from cache ignoring _ limit_ | |
269
+ | _ 0_ | _ unset_ | _ token_ | _ unset_ / _ N_ | Quorum read request | Delegated to etcd | |
270
+ | _ 0_ | _ Exact_ | _ unset_ / _ token_ | _ unset_ / _ N_ | Fails [ validation] | Fails [ validation] | |
271
+ | _ 0_ | _ NotOlderThan_ | _ unset_ | _ unset_ / _ N_ | Quorum read request | List from cache ignoring _ limit_ | |
272
+ | _ 0_ | _ NotOlderThan_ | _ token_ | _ unset_ / _ N_ | Read request from RV encoded in _ token_ | Delegated to etcd | |
273
+ | _ RV_ | _ unset_ | _ unset_ | _ unset_ | Quorum read request | Wait for cache synced to _ RV_ + and list from cache | |
274
+ | _ RV_ | _ unset_ | _ unset_ | _ N_ | Read request from RV=_ RV_ | Delegated to etcd | |
275
+ | _ RV_ | _ unset_ | _ token_ | _ unset_ / _ N_ | Read request from RV encoded in _ token_ | Delegated to etcd | Deferred |
276
+ | _ RV_ | _ Exact_ | _ unset_ | _ unset_ / _ N_ | Read request from RV=_ RV_ | Delegated to etcd | |
277
+ | _ RV_ | _ Exact_ | _ token_ | _ unset_ / _ N_ | Fails [ validation] | Fails [ validation] | |
278
+ | _ RV_ | _ NotOlderThan_ | _ unset_ | _ unset_ | Quorum read request + check for _ RV_ | Wait for cache synced to _ RV_ + and list from cache | |
279
+ | _ RV_ | _ NotOlderThan_ | _ unset_ | _ N_ | Quorum read request + check for _ RV_ | Delegated to etcd | Deferred |
280
+ | _ RV_ | _ NotOlderThan_ | _ token_ | _ unset_ / _ N_ | Fails [ validation] | Fails [ validation] | |
281
+
282
+ For watch requests both ` Continuation ` and ` Limit ` parameters are ignored (we should
283
+ have added validation rules for them in the past), but we have ` SendInitialEvents ` one.
284
+ The table for watch requests look like the following
285
+
286
+ | ResourceVersion | ResourceVersionMatch | SendInitialEvents | etcd implementation | watchcache implementation | changed |
287
+ | -----------------| ----------------------| ------------------------| ------------------------------------------------| -----------------------------------------| ----------|
288
+ | _ unset_ | _ unset_ | _ unset_ | Quorum list + watch stream | Delegate to etcd | Deferred |
289
+ | _ unset_ | _ unset_ | false / true | Fails [ validation] | Fails [ validation] | |
290
+ | _ unset_ | _ NotOlderThan_ | _ unset_ | Fails [ validation] | Fails [ validation] | |
291
+ | _ unset_ | _ NotOlderThan_ | false | Watch stream from etcd RV | Read etcd RV. Watch stream from it | |
292
+ | _ unset_ | _ NotOlderThan_ | true | Quorum list + watch stream | Wait RV > etcd RV. List + watch stream | |
293
+ | _ unset_ | _ Exact_ | _ unset_ / false / true | Fails [ validation] | Fails [ validation] | |
294
+ | _ 0_ | _ unset_ | _ unset_ | Quorum list + watch stream | List + watch stream | |
295
+ | _ 0_ | _ unset_ | false / true | Fails [ validation] | Fails [ validation] | |
296
+ | _ 0_ | _ NotOlderThan_ | _ unset_ | Fails [ validation] | Fails [ validation] | |
297
+ | _ 0_ | _ NotOlderThan_ | false | Watch stream from etcd RV | Watch stream from current watchcache RV | |
298
+ | _ 0_ | _ NotOlderThan_ | true | Quorum list + watch stream | List + watch stream | |
299
+ | _ 0_ | _ Exact_ | _ unset_ / false / true | Fails [ validation] | Fails [ validation] | |
300
+ | _ RV_ | _ unset_ | _ unset_ | Watch stream from RV | Watch stream from RV | |
301
+ | _ RV_ | _ unset_ | false / true | Fails [ validation] | Fails [ validation] | |
302
+ | _ RV_ | _ NotOlderThan_ | _ unset_ | Fails [ validation] | Fails [ validation] | |
303
+ | _ RV_ | _ NotOlderThan_ | false | Check RV > etcd RV. Watch stream from RV | Watch stream from RV | |
304
+ | _ RV_ | _ NotOlderThan_ | true | Check RV > etcd RV. Quorum list + watch stream | Wait for RV. List + watch stream | |
305
+ | _ RV_ | _ Exact_ | _ unset_ / false / true | Fails [ validation] | Fails [ validation] | |
306
+
307
+ [ validation ] : https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/validation/validation.go#L28
308
+ [ etcd resolution ] : https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L589-L627
309
+
310
+ As presented in the above tables, the semantics for a given request server from
311
+ etcd and watchcache is a little bit different. It's a consequence of the fact that:
312
+ * etcd design supports only ` Exact ` semantics - it allows for consistent list
313
+ from a given resource version (either specific value or "now").
314
+ The semantics of ` NotOlderThan ` is implemented as getting consistent list from
315
+ "now" and checking if it satisfies the condition.
316
+ * watchcache design supports only ` NotOlderThan ` semantics - it always waits
317
+ until its resource version is at least as fresh as requested resource version
318
+ and then returns the result from its current state
319
+
320
+ For the above reason, sending the same request to etcd and watchcache, especially
321
+ when cluster state is changing, may legitimately return different results.
322
+
323
+ In order to allow debugging results returned from watchcache in a runnning cluster,
324
+ the only reasonable procedure is:
325
+ * send a request that is served from watchcache
326
+ * send a request setting ` ResourceVersionMatch=Exact ` and ` ResourceVersions ` to value
327
+ returned from the request returned in a previous point
328
+ * compare the two results
329
+
330
+ The existing API already allows us to achieve it.
331
+
332
+ To further allow debugging and improve confidence we will provide users with the
333
+ following tools:
257
334
* a dedicated ` apiserver_watch_cache_read_wait ` metric to detect a problem with
258
335
watch cache.
259
- * a per-request override to disable watch cache to allow debugging.
336
+ * a ` inconsistency detector ` that for requests served from watchcache will be able
337
+ to send a request to etcd (as described above) and compare the results
260
338
261
339
Metric ` apiserver_watch_cache_read_wait ` will measure wait time experienced by
262
340
reads for watch cache to become fresh. If user notices a latency request in
263
341
they can use this metric to confirm that the issue is caused by watch cache.
264
342
265
- Per request override should allow user to compare request results without
266
- impacting other requests or requiring to redeploy whole cluster. The exact
267
- details of override API will be clarified during API review. In healthy
268
- situation, using this override should not cause any impact on the response,
269
- however it might increase resource usage. In our tests cpu load could increase
270
- tenfold. To prevent abuse access to it should be limited to users with
271
- ` cluster-admin ` role, rejecting the request otherwise.
272
-
273
- In case of issues with watch cache users can use the ` ConsistentListFromCache `
274
- feature flag to disable the feature or the existing ` --watch-cache ` flag to
275
- disable the whole watch cache.
276
-
277
- We prefer to provide users an explicit flag and per-request override over an
278
- automatic fallback. It gives users full control and visibility into how request
279
- are handled and ensures accurate APF cost estimates. We expect watch being
280
- starved to happen very rarely, meaning its logic needs to be very simple to
281
- ensure it works properly. A simple fallback will not bring much benefit over
282
- what user can do manually. It will just make the harder to understand and
283
- predict behavior. APF estimates cost just based on request parameters,
284
- before it is passed to storage. If fallback was based on state of watch cache,
285
- cost of request would change after the APF decision increasing the risk of overload.
343
+ The ` inconsistency detector ` will get enabled in our CI to detect issues with
344
+ the introduced mechanism.
286
345
287
346
## Design Details
288
347
@@ -379,7 +438,7 @@ Comparing resource usage and latency with and without consistent list from watch
379
438
380
439
- Feature is enabled by default.
381
440
- Metric ` apiserver_watch_cache_read_wait ` is implemented.
382
- - Per-request watch cache opt-out is implemented.
441
+ - Inconsistency detector is implemented and enabled in CI
383
442
- Deprecate support of etcd v3.3.X, v3.4.24 and v3.5.7
384
443
385
444
#### GA
@@ -529,7 +588,7 @@ Use per-request override to compare latency when reading from watch cache vs etc
529
588
## Implementation History
530
589
531
590
* 1.28 - Alpha
532
- * 1.30 - Beta
591
+ * 1.31 - Beta
533
592
534
593
## Alternatives
535
594
@@ -547,4 +606,22 @@ Allow clients to manage the initial resource version they provide to reflectors,
547
606
Do a dynamic fallback based on watch cache wait time.
548
607
549
608
- We expect watch being starved to happen very rarely, meaning its logic needs to be very simple to ensure it works properly.
550
- - Simple fallback will rather not do a better job then just a manual fallback.
609
+ - Simple fallback will rather not do a better job then just a manual fallback.
610
+
611
+ ### Per-request override
612
+
613
+ To enable debugging, we considered introducing per-request override to disable
614
+ watchcache to force the request to be served from etcd. This would allow us
615
+ to compare request results without impacting other requests or requiring to
616
+ redeploy the whole cluster. However, as described in the KEP itself, the results
617
+ of the same requests served from watchcache and etcd may legitimately return
618
+ different results. As a result, the proposed debugging mechanism was decided
619
+ to better serve its purpose.
620
+
621
+ We also considered automatic fallback. However, we expect watch being
622
+ starved to happen very rarely, meaning its logic needs to be very simple to
623
+ ensure it works properly. A simple fallback will not bring much benefit over
624
+ what user can do manually. It will just make the harder to understand and
625
+ predict behavior. APF estimates cost just based on request parameters,
626
+ before it is passed to storage. If fallback was based on state of watch cache,
627
+ cost of request would change after the APF decision increasing the risk of overload.
0 commit comments