Skip to content

Commit 91267f2

Browse files
committed
KEP-4033: update for GA
1 parent 020f1d0 commit 91267f2

File tree

3 files changed

+73
-42
lines changed

3 files changed

+73
-42
lines changed

keps/prod-readiness/sig-node/4033.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,5 @@ alpha:
66
approver: "@johnbelamaric"
77
beta:
88
approver: "@johnbelamaric"
9+
stable:
10+
approver: "@johnbelamaric"

keps/sig-node/4033-group-driver-detection-over-cri/README.md

Lines changed: 67 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -133,18 +133,18 @@ checklist items _must_ be updated for the enhancement to be released.
133133

134134
Items marked with (R) are required *prior to targeting to a milestone / release*.
135135

136-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
137-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
138-
- [ ] (R) Design details are appropriately documented
139-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
140-
- [ ] e2e Tests for all Beta API Operations (endpoints)
141-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
142-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
143-
- [ ] (R) Graduation criteria is in place
144-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
137+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
138+
- [x] (R) Design details are appropriately documented
139+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
140+
- [x] e2e Tests for all Beta API Operations (endpoints)
141+
- [x] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
142+
- [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
143+
- [x] (R) Graduation criteria is in place
144+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
145145
- [ ] (R) Production readiness review completed
146146
- [ ] (R) Production readiness review approved
147-
- [ ] "Implementation History" section is up-to-date for milestone
147+
- [x] "Implementation History" section is up-to-date for milestone
148148
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
149149
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
150150

@@ -278,19 +278,22 @@ will take precedence over cgroupDriver setting from the kubelet config (or
278278
`--cgroup-driver` command line flag). If the runtime does not provide
279279
information about the cgroup driver, then kubelet will fall back to using its
280280
own configuration (`cgroupDriver` from kubeletConfig or the `--cgroup-driver`
281-
flag). Further, the kubeletConfig field and `--cgroup-driver` flag will be
282-
marked as deprecated, to be dropped when support for the feature is adopted by
283-
CRI-O and containerd. Usage of the deprecated setting will produce a log
281+
flag). For beta, the kubeletConfig field and `--cgroup-driver` flag will be
282+
marked as deprecated. Usage of the deprecated setting will produce a log
284283
message, e.g.:
285284

286285
```
287286
cgroupDriver option has been deprecated and will be dropped in a future release. Please upgrade to a CRI implementation that supports cgroup-driver detection.
288287
```
289288

289+
The `--cgroup-driver` flag will be completely removed when support for the
290+
feature is graduated to GA and the kubelet refuses to start if the CRI runtime
291+
does not support the feature.
292+
290293
Kubelet startup is modified so that connection to the CRI server (container
291294
runtime) is established and RuntimeConfig is queried before initializing the
292295
kubelet internal container-manager which is responsible for kubelet-side cgroup
293-
management. If supported by the runtime, RuntimeConfig query is expected to
296+
management. RuntimeConfig query is expected to
294297
succeed, an error (error response or timeout) is regarded as a failed
295298
initialization of the runtime service and kubelet will exit with an error
296299
message and an error code.
@@ -347,11 +350,7 @@ extending the production code to implement this enhancement.
347350
Kubelet unit tests that use the
348351
[fake_runtime](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/testing/fake_runtime_service.go)
349352
will be updated to verify the Kubelet is correctly inheriting the cgroup
350-
driver:
351-
352-
- If CRI returns "Not Implemented", Kubelet falls back to its own internal cgroup driver
353-
- If CRI returns an error, Kubelet fails to run
354-
- If CRI returns the cgroup driver, Kubelet overrides its cgroup driver to use the one returned by CRI.
353+
driver.
355354

356355
##### Integration tests
357356

@@ -452,21 +451,21 @@ in back-to-back releases.
452451

453452
#### Alpha
454453

455-
- [ ] Feature implemented behind a feature flag, fallback to old behavior if flag is enabled but runtime support not present.
456-
- [ ] Initial unit tests completed and enabled
454+
- Feature implemented behind a feature flag, fallback to old behavior if flag is enabled but runtime support not present.
455+
- Initial unit tests completed and enabled
457456

458457

459458
#### Beta
460459

461-
- [ ] Feature implemented, with the feature gate enabled by default.
462-
- [ ] Released versions of CRI-O and containerd runtime implementations support the feature
463-
- [ ] Drop fallback to old behavior. CRI implementations expected to have support.
460+
- Feature implemented, with the feature gate enabled by default.
461+
- Released versions of CRI-O and containerd runtime implementations support the feature
464462

465463
#### GA
466464

467-
- [ ] No bugs reported in the previous cycle.
468-
- [ ] Remove fallback behavior
469-
- [ ] Remove feature gate
465+
- No bugs reported in the previous cycle.
466+
- Drop fallback to old behavior. CRI implementations expected to have support.
467+
- Remove feature gate
468+
- All issues and gaps identified as feedback during beta are resolved
470469

471470
### Upgrade / Downgrade Strategy
472471

@@ -482,9 +481,8 @@ enhancement:
482481
cluster required to make on upgrade, in order to make use of the enhancement?
483482
-->
484483

485-
The fallback behavior specified in alpha will prevent the majority of regressions, as Kubelet will choose a cgroup driver,
484+
In alpha and beta, the fallback behavior specified in alpha will prevent the majority of regressions, as Kubelet will choose a cgroup driver,
486485
same as it used to before this KEP, even when the feature gate is on.
487-
488486
The feature gate is another layer of protection, requiring admins to specifically opt-into this behavior.
489487

490488
### Version Skew Strategy
@@ -502,7 +500,7 @@ enhancement:
502500
CRI or CNI may require updating that component before the kubelet.
503501
-->
504502

505-
If either kubelet or the container runtime running on the node does not support
503+
In alpha and beta, if either kubelet or the container runtime running on the node does not support
506504
the new field in the CRI API, they just resort to the existing behavior of
507505
respecting their individual cgroup-driver setting. That is, if the node has a
508506
container runtime that does not support this field the kubelet will use its
@@ -513,6 +511,11 @@ ignored by kubelet and it will resort to its own configuration settings. Note:
513511
this does present a configuration skew risk, but that risk is the same as
514512
currently exists today.
515513

514+
In GA, the fallback behavior will be removed, and the kubelet will rely on the
515+
container runtime to implement the feature. In practice, this means the cluster
516+
must use at least containerd v2.0 or cri-o v1.28 as a prerequisite for
517+
upgrading.
518+
516519
## Production Readiness Review Questionnaire
517520

518521
<!--
@@ -566,13 +569,19 @@ Any change of default behavior may be surprising to users or break existing
566569
automations, so be extremely careful here.
567570
-->
568571

569-
Yes. If/when the runtime is updated to a version that supports this, kubelet
572+
Yes.
573+
574+
In alpha and beta, when the runtime is updated to a version that supports this, kubelet
570575
will ignore the cgroupDriver config option/flag. However, this change in
571576
behavior should not cause any breakages (on the contrary, it should fix
572577
scenarios where the kubelet `--cgroup-driver` setting is incorrectly
573578
configured). With old versions of the container runtimes (that don't support
574579
the new field in the CRI API) the default behavior is not changed.
575580

581+
In GA, the fallback behavior (and the kubelet `--cgroup-driver` flag) is
582+
removed and the kubelet requires the CRI runtime to implement the feature (see
583+
[Version Skew Strategy](#version-skew-strategy)).
584+
576585
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
577586

578587
<!--
@@ -586,7 +595,9 @@ feature.
586595
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
587596
-->
588597

589-
Yes, through the feature gate.
598+
In alpha and beta, yes, through the feature gate.
599+
600+
In GA, no.
590601

591602
###### What happens if we reenable the feature if it was previously rolled back?
592603

@@ -662,9 +673,13 @@ testing of the feature gate (in addition to the unit tests) is performed.
662673
Even if applying deprecation policies, they may still surprise some users.
663674
-->
664675

665-
Yes, the CgroupDriver field of the Kubelet configuration (and the corresponding
666-
`--cgroup-driver` flag) will be marked as deprecated, but won’t be dropped
667-
until wide adoption of the field on CRI has been done.
676+
Yes.
677+
678+
In alpha and beta, the CgroupDriver field of the Kubelet configuration (and the
679+
corresponding `--cgroup-driver` flag) will be marked as deprecated.
680+
681+
In GA, the CgroupDriver configuration option and the `--cgroup-driver` flag are
682+
removed.
668683

669684
### Monitoring Requirements
670685

@@ -699,9 +714,15 @@ and operation of this feature.
699714
Recall that end users cannot usually observe component logs or access metrics.
700715
-->
701716

702-
No metrics likely will expose this. Examining kubelet logs whould inform the
717+
No metrics will expose this.
718+
719+
In alpha and beta, examining kubelet logs whould inform the
703720
that the cgroup driver setting instructed by the runtime is being used.
704721

722+
In GA, the kubelet refuses to start if the feature is not working. The can be
723+
observed in the system logs and the node being in NotReady state or node not
724+
being registered in cluster bootstrap.
725+
705726
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
706727

707728
<!--
@@ -761,8 +782,12 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
761782
- Impact of its degraded performance or high-error rates on the feature:
762783
-->
763784

764-
A CRI (server) implementation of the correct version. However, the feature will
765-
fallback if the CRI implementation doesn’t support the feature.
785+
A CRI (server) implementation of the correct version.
786+
787+
In alpha and beta, the feature will fallback if the CRI implementation doesn’t
788+
support the feature.
789+
790+
In GA, a sufficiently recent version of the CRI runtime is a hard requirement.
766791

767792
### Scalability
768793

@@ -904,7 +929,7 @@ For each of them, fill in the following information by copying the below templat
904929
- Testing: Are there any tests for failure mode? If not, describe why.
905930
-->
906931

907-
Same that exists today: Kubelet and the CRI server (container runtime) not
932+
In alpha and beta, same that exists today: Kubelet and the CRI server (container runtime) not
908933
agreeing on the CgroupDriver while one of them doesn’t support the feature.
909934

910935
###### What steps should be taken if SLOs are not being met to determine the problem?
@@ -924,6 +949,9 @@ Major milestones might include:
924949
- when the KEP was retired or superseded
925950
-->
926951

952+
- v1.28: alpha
953+
- v1.31: beta
954+
927955
## Drawbacks
928956

929957
<!--

keps/sig-node/4033-group-driver-detection-over-cri/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,25 +17,26 @@ see-also: []
1717
replaces: []
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: beta
20+
stage: stable
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.31"
25+
latest-milestone: "v1.34"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:
2929
alpha: "v1.28"
3030
beta: "v1.31"
31+
stable: "v1.34"
3132

3233
# The following PRR answers are required at alpha release
3334
# List the feature gate name and the components for which it must be enabled
3435
feature-gates:
3536
- name: KubeletCgroupDriverFromCRI
3637
components:
3738
- kubelet
38-
disable-supported: true
39+
disable-supported: false
3940

4041
# The following PRR answers are required at beta release
4142
# No metrics will be added for this feature, as it's an internal

0 commit comments

Comments
 (0)