Skip to content

Commit e900e43

Browse files
authored
Merge pull request kubernetes#2267 from wojtek-t/efficient_watch_resumption_beta
Update "Efficient watch resumption KEP" to target Beta graduation in 1.21
2 parents 1dedf5b + 0fab050 commit e900e43

File tree

2 files changed

+45
-85
lines changed

2 files changed

+45
-85
lines changed

keps/sig-api-machinery/1904-efficient-watch-resumption/README.md

Lines changed: 42 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
- [Troubleshooting](#troubleshooting)
2525
- [Implementation History](#implementation-history)
2626
- [Drawbacks](#drawbacks)
27+
- [Future work](#future-work)
2728
- [Alternatives](#alternatives)
2829
- [Initialize watch cache from etcd history window](#initialize-watch-cache-from-etcd-history-window)
2930
<!-- /toc -->
@@ -236,9 +237,6 @@ We are going to utilize this feature to solve the problems described above.
236237
1. Change watch cache to utilize the resource version updates from Bookmark
237238
events.
238239

239-
1. On top of recent changes that send Kubernetes Bookmark events every minute,
240-
we will add a support to send them also on kube-apiserver shutdown.
241-
242240
1. We will set the progress notify period to reasonably small value.
243241
The requirement is to ensure that in case of rolling upgrade of multiple
244242
kube-apiservers, the next-to-be-updated one will get either a real event
@@ -325,8 +323,6 @@ n/a - watch bookmarks don't have any frequency guarantees
325323

326324
## Production Readiness Review Questionnaire
327325

328-
TODO: Fill in before making `Implementable`.
329-
330326
### Feature Enablement and Rollback
331327

332328
_This section must be completed when targeting alpha to a release._
@@ -355,121 +351,80 @@ _This section must be completed when targeting alpha to a release._
355351
_This section must be completed when targeting beta graduation to a release._
356352

357353
* **How can a rollout fail? Can it impact already running workloads?**
358-
Try to be as paranoid as possible - e.g., what if some components will restart
359-
mid-rollout?
354+
In case of bugs, etcd progress notify events may be incorrectly parsed leading
355+
to kube-apiserver crashes.
356+
It can't affect running workloads.
360357

361358
* **What specific metrics should inform a rollback?**
359+
Crashes of kube-apiserver.
362360

363361
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
364-
Describe manual testing that was done and the outcomes.
365-
Longer term, we may want to require automated upgrade/rollback tests, but we
366-
are missing a bunch of machinery and tooling and can't do that now.
362+
Manual tests are still to be run.
367363

368364
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
369365
fields of API types, flags, etc.?**
370-
Even if applying deprecation policies, they may still surprise some users.
366+
No
371367

372368
### Monitoring Requirements
373369

374370
_This section must be completed when targeting beta graduation to a release._
375371

376372
* **How can an operator determine if the feature is in use by workloads?**
377-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
378-
checking if there are objects with field X set) may be a last resort. Avoid
379-
logs or events for this purpose.
373+
It's not a workload feature.
380374

381375
* **What are the SLIs (Service Level Indicators) an operator can use to determine
382376
the health of the service?**
383-
- [ ] Metrics
384-
- Metric name:
385-
- [Optional] Aggregation method:
386-
- Components exposing the metric:
387-
- [ ] Other (treat as last resort)
388-
- Details:
377+
- [x] Metrics
378+
- Metric name: etcd_bookmark_counts
379+
- Components exposing the metric: kube-apiserver
389380

390381
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
391-
At a high level, this usually will be in the form of "high percentile of SLI
392-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
393-
high level (needs more precise definitions) those may be things like:
394-
- per-day percentage of API calls finishing with 5XX errors <= 1%
395-
- 99% percentile over day of absolute value from (job creation time minus expected
396-
job creation time) for cron job <= 10%
397-
- 99,9% of /health requests per day finish with 200 code
382+
n/a [Bookmark and watch progress notify events are best effort in their nature]
398383

399384
* **Are there any missing metrics that would be useful to have to improve observability
400385
of this feature?**
401-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
402-
implementation difficulties, etc.).
386+
No
403387

404388
### Dependencies
405389

406390
_This section must be completed when targeting beta graduation to a release._
407391

408392
* **Does this feature depend on any specific services running in the cluster?**
409-
Think about both cluster-level services (e.g. metrics-server) as well
410-
as node-level agents (e.g. specific version of CRI). Focus on external or
411-
optional services that are needed. For example, if this feature depends on
412-
a cloud provider API, or upon an external software-defined storage or network
413-
control plane.
414-
415-
For each of these, fill in the following—thinking about running existing user workloads
416-
and creating new ones, as well as about cluster-level services (e.g. DNS):
417-
- [Dependency name]
418-
- Usage description:
393+
394+
- etcd
395+
- Usage description: We rely on etcd support for ProgressNotify events, that
396+
was added in release 3.3. However, we also rely on ability to configure
397+
notifications period (default of 10m is too high), that was added in 3.5
398+
and backported to 3.4.11.
419399
- Impact of its outage on the feature:
400+
etcd outage will translate to cluster outage anyway
420401
- Impact of its degraded performance or high-error rates on the feature:
402+
ProgressNotify events may not be send as expected
421403

422404

423405
### Scalability
424406

425-
_For alpha, this section is encouraged: reviewers should consider these questions
426-
and attempt to answer them._
427-
428-
_For beta, this section is required: reviewers must answer these questions._
429-
430-
_For GA, this section is required: approvers should be able to confirm the
431-
previous answers based on experience in the field._
432-
433407
* **Will enabling / using this feature result in any new API calls?**
434-
Describe them, providing:
435-
- API call type (e.g. PATCH pods)
436-
- estimated throughput
437-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
438-
focusing mostly on:
439-
- components listing and/or watching resources they didn't before
440-
- API calls that may be triggered by changes of some Kubernetes resources
441-
(e.g. update of object X triggers new updates of object Y)
442-
- periodic API calls to reconcile state (e.g. periodic fetching state,
443-
heartbeats, leader election, etc.)
408+
No. Although new events are being send via etcd to kube-apiserver as part
409+
of the open Watch requests.
444410

445411
* **Will enabling / using this feature result in introducing new API types?**
446-
Describe them, providing:
447-
- API type
448-
- Supported number of objects per cluster
449-
- Supported number of objects per namespace (for namespace-scoped objects)
412+
No
450413

451414
* **Will enabling / using this feature result in any new calls to the cloud
452415
provider?**
453416

454417
* **Will enabling / using this feature result in increasing size or count of
455418
the existing API objects?**
456-
Describe them, providing:
457-
- API type(s):
458-
- Estimated increase in size: (e.g., new annotation of size 32B)
459-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
419+
No
460420

461421
* **Will enabling / using this feature result in increasing time taken by any
462422
operations covered by [existing SLIs/SLOs]?**
463-
Think about adding additional work or introducing new steps in between
464-
(e.g. need to do X to start a container), etc. Please describe the details.
423+
No
465424

466425
* **Will enabling / using this feature result in non-negligible increase of
467426
resource usage (CPU, RAM, disk, IO, ...) in any components?**
468-
Things to keep in mind include: additional in-memory state, additional
469-
non-trivial computations, excessive access to disks (including increased log
470-
volume), significant amount of data sent and/or received over network, etc.
471-
This through this both in small and large cases, again with respect to the
472-
[supported limits].
427+
No
473428

474429
### Troubleshooting
475430

@@ -480,20 +435,13 @@ details). For now, we leave it here.
480435
_This section must be completed when targeting beta graduation to a release._
481436

482437
* **How does this feature react if the API server and/or etcd is unavailable?**
438+
The feature will not work (though it is a control-plane feature, not a workload one.
483439

484440
* **What are other known failure modes?**
485-
For each of them, fill in the following information by copying the below template:
486-
- [Failure mode brief description]
487-
- Detection: How can it be detected via metrics? Stated another way:
488-
how can an operator troubleshoot without logging into a master or worker node?
489-
- Mitigations: What can be done to stop the bleeding, especially for already
490-
running user workloads?
491-
- Diagnostics: What are the useful log messages and their required logging
492-
levels that could help debug the issue?
493-
Not required until feature graduated to beta.
494-
- Testing: Are there any tests for failure mode? If not, describe why.
441+
n/a
495442

496443
* **What steps should be taken if SLOs are not being met to determine the problem?**
444+
n/a
497445

498446
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
499447
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
@@ -502,11 +450,23 @@ _This section must be completed when targeting beta graduation to a release._
502450

503451
2020-06-30: KEP Proposed.
504452
2020-08-04: KEP marked as implementable.
453+
v1.20: Feature graduated to Alpha
454+
2020-01-15: KEP updated to target Beta in v1.21
505455

506456
## Drawbacks
507457

508458
n/a
509459

460+
## Future work
461+
462+
The above solution doesn't address the extensive relisting case in the
463+
setup with single kube-apiserver. The reason for that is that we don't send
464+
Kubernetes Bookmark events on kube-apiserver shutdown (which would actually be
465+
beneficial on its own). However, doing that properly together with ensuring
466+
that no request weren't dropped in the meantime (even in single kube-apiserver)
467+
scenario isn't trivial and probably deserves its own KEP.
468+
As a result, we're leving this as a future work.
469+
510470
## Alternatives
511471

512472
### Initialize watch cache from etcd history window

keps/sig-api-machinery/1904-efficient-watch-resumption/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ stage: alpha
2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.20"
26+
latest-milestone: "v1.21"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
@@ -40,5 +40,5 @@ feature-gates:
4040
disable-supported: true
4141

4242
# The following PRR answers are required at beta release
43-
#metrics:
44-
# - my_feature_metric
43+
metrics:
44+
- etcd_bookmark_counts

0 commit comments

Comments
 (0)