24
24
- [ Troubleshooting] ( #troubleshooting )
25
25
- [ Implementation History] ( #implementation-history )
26
26
- [ Drawbacks] ( #drawbacks )
27
+ - [ Future work] ( #future-work )
27
28
- [ Alternatives] ( #alternatives )
28
29
- [ Initialize watch cache from etcd history window] ( #initialize-watch-cache-from-etcd-history-window )
29
30
<!-- /toc -->
@@ -236,9 +237,6 @@ We are going to utilize this feature to solve the problems described above.
236
237
1 . Change watch cache to utilize the resource version updates from Bookmark
237
238
events.
238
239
239
- 1 . On top of recent changes that send Kubernetes Bookmark events every minute,
240
- we will add a support to send them also on kube-apiserver shutdown.
241
-
242
240
1 . We will set the progress notify period to reasonably small value.
243
241
The requirement is to ensure that in case of rolling upgrade of multiple
244
242
kube-apiservers, the next-to-be-updated one will get either a real event
@@ -325,8 +323,6 @@ n/a - watch bookmarks don't have any frequency guarantees
325
323
326
324
## Production Readiness Review Questionnaire
327
325
328
- TODO: Fill in before making ` Implementable ` .
329
-
330
326
### Feature Enablement and Rollback
331
327
332
328
_ This section must be completed when targeting alpha to a release._
@@ -355,121 +351,80 @@ _This section must be completed when targeting alpha to a release._
355
351
_ This section must be completed when targeting beta graduation to a release._
356
352
357
353
* ** How can a rollout fail? Can it impact already running workloads?**
358
- Try to be as paranoid as possible - e.g., what if some components will restart
359
- mid-rollout?
354
+ In case of bugs, etcd progress notify events may be incorrectly parsed leading
355
+ to kube-apiserver crashes.
356
+ It can't affect running workloads.
360
357
361
358
* ** What specific metrics should inform a rollback?**
359
+ Crashes of kube-apiserver.
362
360
363
361
* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
364
- Describe manual testing that was done and the outcomes.
365
- Longer term, we may want to require automated upgrade/rollback tests, but we
366
- are missing a bunch of machinery and tooling and can't do that now.
362
+ Manual tests are still to be run.
367
363
368
364
* ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
369
365
fields of API types, flags, etc.?**
370
- Even if applying deprecation policies, they may still surprise some users.
366
+ No
371
367
372
368
### Monitoring Requirements
373
369
374
370
_ This section must be completed when targeting beta graduation to a release._
375
371
376
372
* ** How can an operator determine if the feature is in use by workloads?**
377
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
378
- checking if there are objects with field X set) may be a last resort. Avoid
379
- logs or events for this purpose.
373
+ It's not a workload feature.
380
374
381
375
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
382
376
the health of the service?**
383
- - [ ] Metrics
384
- - Metric name:
385
- - [ Optional] Aggregation method:
386
- - Components exposing the metric:
387
- - [ ] Other (treat as last resort)
388
- - Details:
377
+ - [x] Metrics
378
+ - Metric name: etcd_bookmark_counts
379
+ - Components exposing the metric: kube-apiserver
389
380
390
381
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
391
- At a high level, this usually will be in the form of "high percentile of SLI
392
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
393
- high level (needs more precise definitions) those may be things like:
394
- - per-day percentage of API calls finishing with 5XX errors <= 1%
395
- - 99% percentile over day of absolute value from (job creation time minus expected
396
- job creation time) for cron job <= 10%
397
- - 99,9% of /health requests per day finish with 200 code
382
+ n/a [ Bookmark and watch progress notify events are best effort in their nature]
398
383
399
384
* ** Are there any missing metrics that would be useful to have to improve observability
400
385
of this feature?**
401
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
402
- implementation difficulties, etc.).
386
+ No
403
387
404
388
### Dependencies
405
389
406
390
_ This section must be completed when targeting beta graduation to a release._
407
391
408
392
* ** Does this feature depend on any specific services running in the cluster?**
409
- Think about both cluster-level services (e.g. metrics-server) as well
410
- as node-level agents (e.g. specific version of CRI). Focus on external or
411
- optional services that are needed. For example, if this feature depends on
412
- a cloud provider API, or upon an external software-defined storage or network
413
- control plane.
414
-
415
- For each of these, fill in the following—thinking about running existing user workloads
416
- and creating new ones, as well as about cluster-level services (e.g. DNS):
417
- - [ Dependency name]
418
- - Usage description:
393
+
394
+ - etcd
395
+ - Usage description: We rely on etcd support for ProgressNotify events, that
396
+ was added in release 3.3. However, we also rely on ability to configure
397
+ notifications period (default of 10m is too high), that was added in 3.5
398
+ and backported to 3.4.11.
419
399
- Impact of its outage on the feature:
400
+ etcd outage will translate to cluster outage anyway
420
401
- Impact of its degraded performance or high-error rates on the feature:
402
+ ProgressNotify events may not be send as expected
421
403
422
404
423
405
### Scalability
424
406
425
- _ For alpha, this section is encouraged: reviewers should consider these questions
426
- and attempt to answer them._
427
-
428
- _ For beta, this section is required: reviewers must answer these questions._
429
-
430
- _ For GA, this section is required: approvers should be able to confirm the
431
- previous answers based on experience in the field._
432
-
433
407
* ** Will enabling / using this feature result in any new API calls?**
434
- Describe them, providing:
435
- - API call type (e.g. PATCH pods)
436
- - estimated throughput
437
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
438
- focusing mostly on:
439
- - components listing and/or watching resources they didn't before
440
- - API calls that may be triggered by changes of some Kubernetes resources
441
- (e.g. update of object X triggers new updates of object Y)
442
- - periodic API calls to reconcile state (e.g. periodic fetching state,
443
- heartbeats, leader election, etc.)
408
+ No. Although new events are being send via etcd to kube-apiserver as part
409
+ of the open Watch requests.
444
410
445
411
* ** Will enabling / using this feature result in introducing new API types?**
446
- Describe them, providing:
447
- - API type
448
- - Supported number of objects per cluster
449
- - Supported number of objects per namespace (for namespace-scoped objects)
412
+ No
450
413
451
414
* ** Will enabling / using this feature result in any new calls to the cloud
452
415
provider?**
453
416
454
417
* ** Will enabling / using this feature result in increasing size or count of
455
418
the existing API objects?**
456
- Describe them, providing:
457
- - API type(s):
458
- - Estimated increase in size: (e.g., new annotation of size 32B)
459
- - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
419
+ No
460
420
461
421
* ** Will enabling / using this feature result in increasing time taken by any
462
422
operations covered by [ existing SLIs/SLOs] ?**
463
- Think about adding additional work or introducing new steps in between
464
- (e.g. need to do X to start a container), etc. Please describe the details.
423
+ No
465
424
466
425
* ** Will enabling / using this feature result in non-negligible increase of
467
426
resource usage (CPU, RAM, disk, IO, ...) in any components?**
468
- Things to keep in mind include: additional in-memory state, additional
469
- non-trivial computations, excessive access to disks (including increased log
470
- volume), significant amount of data sent and/or received over network, etc.
471
- This through this both in small and large cases, again with respect to the
472
- [ supported limits] .
427
+ No
473
428
474
429
### Troubleshooting
475
430
@@ -480,20 +435,13 @@ details). For now, we leave it here.
480
435
_ This section must be completed when targeting beta graduation to a release._
481
436
482
437
* ** How does this feature react if the API server and/or etcd is unavailable?**
438
+ The feature will not work (though it is a control-plane feature, not a workload one.
483
439
484
440
* ** What are other known failure modes?**
485
- For each of them, fill in the following information by copying the below template:
486
- - [ Failure mode brief description]
487
- - Detection: How can it be detected via metrics? Stated another way:
488
- how can an operator troubleshoot without logging into a master or worker node?
489
- - Mitigations: What can be done to stop the bleeding, especially for already
490
- running user workloads?
491
- - Diagnostics: What are the useful log messages and their required logging
492
- levels that could help debug the issue?
493
- Not required until feature graduated to beta.
494
- - Testing: Are there any tests for failure mode? If not, describe why.
441
+ n/a
495
442
496
443
* ** What steps should be taken if SLOs are not being met to determine the problem?**
444
+ n/a
497
445
498
446
[ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
499
447
[ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
@@ -502,11 +450,23 @@ _This section must be completed when targeting beta graduation to a release._
502
450
503
451
2020-06-30: KEP Proposed.
504
452
2020-08-04: KEP marked as implementable.
453
+ v1.20: Feature graduated to Alpha
454
+ 2020-01-15: KEP updated to target Beta in v1.21
505
455
506
456
## Drawbacks
507
457
508
458
n/a
509
459
460
+ ## Future work
461
+
462
+ The above solution doesn't address the extensive relisting case in the
463
+ setup with single kube-apiserver. The reason for that is that we don't send
464
+ Kubernetes Bookmark events on kube-apiserver shutdown (which would actually be
465
+ beneficial on its own). However, doing that properly together with ensuring
466
+ that no request weren't dropped in the meantime (even in single kube-apiserver)
467
+ scenario isn't trivial and probably deserves its own KEP.
468
+ As a result, we're leving this as a future work.
469
+
510
470
## Alternatives
511
471
512
472
### Initialize watch cache from etcd history window
0 commit comments