Skip to content

Commit e634c15

Browse files
authored
Merge pull request kubernetes#3589 from andrewsykim/kep-1965
KEP-1965: update with Beta criteria/milestone and PRR questions answered
2 parents b6bef75 + 8a9df00 commit e634c15

File tree

3 files changed

+194
-65
lines changed

3 files changed

+194
-65
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1965
2+
beta:
3+
approver: "@deads2k" # and @wojtek-t

keps/sig-api-machinery/1965-kube-apiserver-identity/README.md

Lines changed: 181 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,16 @@
44
- [Release Signoff Checklist](#release-signoff-checklist)
55
- [Summary](#summary)
66
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
79
- [Proposal](#proposal)
810
- [Caveats](#caveats)
911
- [Design Details](#design-details)
1012
- [Test Plan](#test-plan)
13+
- [Prerequisite testing updates](#prerequisite-testing-updates)
14+
- [Unit tests](#unit-tests)
15+
- [Integration tests](#integration-tests)
16+
- [e2e tests](#e2e-tests)
1117
- [Graduation Criteria](#graduation-criteria)
1218
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
1319
- [Beta -> GA Graduation](#beta---ga-graduation)
@@ -18,6 +24,7 @@
1824
- [Monitoring Requirements](#monitoring-requirements)
1925
- [Dependencies](#dependencies)
2026
- [Scalability](#scalability)
27+
- [Troubleshooting](#troubleshooting)
2128
- [Implementation History](#implementation-history)
2229
- [Alternatives](#alternatives)
2330
- [Alternative 1: new API + storage TTL](#alternative-1-new-api--storage-ttl)
@@ -30,17 +37,25 @@
3037

3138
Items marked with (R) are required *prior to targeting to a milestone / release*.
3239

33-
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
34-
- [x] (R) KEP approvers have approved the KEP status as `implementable`
35-
- [x] (R) Design details are appropriately documented
36-
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
37-
- [x] (R) Graduation criteria is in place
40+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
41+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
42+
- [X] (R) Design details are appropriately documented
43+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
44+
- [ ] e2e Tests for all Beta API Operations (endpoints)
45+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
46+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
47+
- [X] (R) Graduation criteria is in place
48+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
3849
- [ ] (R) Production readiness review completed
39-
- [ ] Production readiness review approved
50+
- [ ] (R) Production readiness review approved
4051
- [ ] "Implementation History" section is up-to-date for milestone
4152
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4253
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4354

55+
<!--
56+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
57+
-->
58+
4459
[kubernetes.io]: https://kubernetes.io/
4560
[kubernetes/enhancements]: https://git.k8s.io/enhancements
4661
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
@@ -65,14 +80,24 @@ load balancer for the cluster, where the advertise IP address is set to the IP
6580
address of the load balancer, all three kube-apiservers will have the same
6681
advertise IP address.
6782

83+
### Goals
84+
85+
* Provide a mechanism in which controllers can uniquely identify kube-apiserver's in a cluster.
86+
87+
### Non-Goals
88+
89+
* improving the availability of kube-apiserver
90+
6891
## Proposal
6992

7093
We will use “hostname+PID+random suffix (e.g. 6 base58 digits)” as the ID.
7194

72-
Similar to the [node heartbeat](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md),
95+
Similar to the [node heartbeats](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/589-efficient-node-heartbeats),
7396
a kube-apiserver will store its ID in a Lease object. All kube-apiserver Leases
74-
will be stored in a special namespace “kube-apiserver-lease”. A controller will
75-
garbage collect expired Leases.
97+
will be stored in a special namespace `kube-apiserver-lease`. The Lease creation
98+
and heart beat will be managed by a controller that is started in kube-apiserver's
99+
post startup hook. A separate controller in kube-controller-manager will be responsible
100+
for garbaging collecting expired Leases.
76101

77102
### Caveats
78103

@@ -95,20 +120,21 @@ will only delay the storage migration for the same period of time.
95120

96121
## Design Details
97122

98-
The [kubelet heartbeat](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md)
123+
The [kubelet heartbeat](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/589-efficient-node-heartbeats)
99124
logic [already written](https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/nodelease)
100125
will be re-used. The heartbeat controller will be added to kube-apiserver in a
101126
post-start hook.
102127

103-
Each kube-apiserver will refresh its Lease every 10s by default. A GC controller
104-
will watch the Lease API using an informer, and periodically resync its local
105-
cache. On processing an item, the controller will delete the Lease if the last
106-
`renewTime` was more than `leaseDurationSeconds` ago (default to 1h). The
107-
default `leaseDurationSeconds` is chosen to be way longer than the default
128+
Each kube-apiserver will run a lease controller in a post-start-hook to refresh
129+
its Lease every 10s by default. A separate controller named [storageversiongc](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/storageversiongc/gc_controller.go)
130+
running in kube-controller-manager will watch the Lease API using an informer, and
131+
periodically resync its local cache. On processing an item, the `storageversiongc` controller
132+
will delete the Lease if the last `renewTime` was more than `leaseDurationSeconds` ago (default to 1h).
133+
The default `leaseDurationSeconds` is chosen to be way longer than the default
108134
refresh period, to tolerate clock skew and/or accidental refresh failure. The
109135
default resync period is 1h. By default, assuming negligible clock skew, a Lease
110136
will be deleted if the kube-apiserver fails to refresh its Lease for one to two
111-
hours. The GC controller will run in kube-controller-manager, to leverage leader
137+
hours. The `storageversiongc` controller will run in kube-controller-manager, to leverage leader
112138
election and reduce conflicts.
113139

114140
The refresh rate, lease duration will be configurable through kube-apiserver
@@ -117,12 +143,30 @@ flag.
117143

118144
### Test Plan
119145

120-
- integration test for creating the Namespace and the Lease on kube-apiserver
121-
startup
122-
- integration test for not creating the StorageVersions after creating the
123-
Lease
124-
- integration test for garbage collecting a Lease that isn't refreshed
125-
- integration test for not garbage collecting a Lease that is refreshed
146+
[X] I/we understand the owners of the involved components may require updates to
147+
existing tests to make this code solid enough prior to committing the changes necessary
148+
to implement this enhancement.
149+
150+
##### Prerequisite testing updates
151+
152+
##### Unit tests
153+
154+
- `staging/src/k8s.io/apiserver/pkg/endpoints`
155+
156+
##### Integration tests
157+
158+
[apiserver_identity_test.go](https://github.com/kubernetes/kubernetes/blob/24238425492227fdbb55c687fd4e94c8b58c1ee3/test/integration/controlplane/apiserver_identity_test.go)
159+
- integration test for creating the Namespace and the Lease on kube-apiserver startup
160+
- integration test for not creating the StorageVersions after creating the Lease
161+
- integration test for garbage collecting a Lease that isn't refreshed
162+
- integration test for not garbage collecting a Lease that is refreshed
163+
164+
##### e2e tests
165+
166+
Proposed e2e tests:
167+
- an e2e test that validates the existence of the Lease objects per kube-apiserver
168+
- an e2e test that restarts a kube-apiserver and validates that a new Lease is created
169+
with a newly generated ID and the old lease is garbage collected
126170

127171
### Graduation Criteria
128172

@@ -131,14 +175,16 @@ Alpha should provide basic functionality covered with tests described above.
131175
#### Alpha -> Beta Graduation
132176

133177
- Appropriate metrics are agreed on and implemented
134-
- An e2e test plan is agreed and implemented (e.g. chaosmonkey in a regional
135-
cluster)
178+
- Sufficient integration tests covering basic functionality of this enhancement.
179+
- e2e tests outlined in the test plan are implemented
136180

137181
#### Beta -> GA Graduation
138182

139-
- Conformance tests are agreed on and implemented
183+
- SIG consensus on whether Lease names should be unique per process (i.e. uuid) or persist across restarts (i.e. hostname)
184+
- SIG consensus on whether Lease names should include a hostname identifier (via label) if they do NOT persist across restarts.
185+
- SIG consensus on where the storageversiongc controller should run (kube-apiserver vs kube-controller-manager).
140186

141-
**For non-optional features moving to GA, the graduation criteria must include
187+
**For non-optional features moving to GA, the graduation criteria must include
142188
[conformance tests].**
143189

144190
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
@@ -154,64 +200,138 @@ Alpha should provide basic functionality covered with tests described above.
154200

155201
### Feature Enablement and Rollback
156202

157-
* **How can this feature be enabled / disabled in a live cluster?**
158-
- [x] Feature gate (also fill in values in `kep.yaml`)
159-
- Feature gate name: APIServerIdentity
160-
- Components depending on the feature gate: kube-apiserver
203+
###### How can this feature be enabled / disabled in a live cluster?
204+
205+
- [X] Feature gate (also fill in values in `kep.yaml`)
206+
- Feature gate name: APIServerIdentity
207+
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
161208

162-
* **Does enabling the feature change any default behavior?**
163-
A namespace "kube-apiserver-lease" will be used to store kube-apiserver
164-
identity Leases.
209+
###### Does enabling the feature change any default behavior?
165210

166-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
167-
the enablement)?**
168-
Yes. Stale Lease objects will remain stale (`renewTime` won't get updated)
211+
A namespace `kube-apiserver-lease` will be created to store kube-apiserver identity Leases.
212+
Old leases will be actively garbage collected by kube-controller-manager.
169213

170-
* **What happens if we reenable the feature if it was previously rolled back?**
171-
Stale Lease objects will be garbage collected.
214+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
215+
216+
Yes. Stale Lease objects will remain stale (renewTime won't get updated)
217+
218+
###### What happens if we reenable the feature if it was previously rolled back?
219+
220+
Stale Lease objects will be garbage collected.
221+
222+
###### Are there any tests for feature enablement/disablement?
223+
224+
There are some tests that require enabling the feature gate in [apiserver_identity_test.go](https://github.com/kubernetes/kubernetes/blob/24238425492227fdbb55c687fd4e94c8b58c1ee3/test/integration/controlplane/apiserver_identity_test.go).
225+
However, there are no tests validating feature enablement/disablement based on the gate. These tests should be added prior to Beta.
172226

173227
### Rollout, Upgrade and Rollback Planning
174228

175-
_This section must be completed when targeting beta graduation to a release._
229+
###### How can a rollout or rollback fail? Can it impact already running workloads?
230+
231+
Existing workloads should not be impacteded by this feature, unless they were
232+
looking for Lease objects in the `kube-apiserver-lease` namespace.
233+
234+
###### What specific metrics should inform a rollback?
235+
236+
Recently added [healthcheck metrics for apiserver](https://github.com/kubernetes/kubernetes/pull/112741), which includes
237+
the health of the post start hook can be used to inform rollback, specifically `kubernetes_healthcheck{poststarthook/start-kube-apiserver-identity-lease-controller}`
238+
239+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
240+
241+
Manual testing for upgrade/rollback will be done prior to Beta. Steps taken for manual tests will be updated here.
242+
243+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
244+
245+
No.
176246

177247
### Monitoring Requirements
178248

179-
_This section must be completed when targeting beta graduation to a release._
249+
###### How can an operator determine if the feature is in use by workloads?
250+
251+
The existence of the `kube-apiserver-lease` namespace and Lease objects in the namespace
252+
will determine if the feature is working. Operators can check for clients that are accessing
253+
the Lease object to see if workloads or other controllers are relying on this feature.
254+
255+
###### How can someone using this feature know that it is working for their instance?
256+
257+
- [ ] Events
258+
- Event Reason:
259+
- [X] API .status
260+
- Condition name:
261+
- Other field: `.spec.holderIdentity`, `.spec.acquireTime`, `.spec.renewTime`, `.spec.leaseTransitions`
262+
- [X] Other (treat as last resort)
263+
- Details: audit logs for clients that are reading the Lease objects
264+
265+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
266+
267+
Some reasonable SLOs could be:
268+
* Number of (non-expired) Leases in `kube-apiserver-leases` is equal to the number of expected kube-apiservers 95% of the time.
269+
* kube-apiservers hold a lease which is not older than 2 times the frequency of the lease heart beat 95% of time.
270+
271+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
272+
273+
- [X] Metrics
274+
- Metric name: kubernetes_healthcheck
275+
- [Optional] Aggregation method: name="poststarthook/start-kube-apiserver-identity-lease-controller"
276+
- Components exposing the metric: kube-apiserver
277+
278+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
279+
280+
A metric measuring the last updated time for a lease could be useful, but it could introduce cardinality problems
281+
since the lease is changed on every restart of kube-apiserver.
282+
283+
We may consider adding a metric exposing the count of leases in `kube-apiserver-lease`.
180284

181285
### Dependencies
182286

183-
_This section must be completed when targeting beta graduation to a release._
287+
###### Does this feature depend on any specific services running in the cluster?
288+
289+
No
184290

185291
### Scalability
186292

187-
* **Will enabling / using this feature result in any new API calls?**
188-
Describe them, providing:
189-
- API call type (e.g. PATCH pods): UPDATE leases
190-
- estimated throughput:
191-
- originating component(s) (e.g. Kubelet, Feature-X-controller):
192-
kube-apiserver
293+
###### Will enabling / using this feature result in any new API calls?
294+
295+
Yes, kube-apiserver will be making new API calls as part of the lease controller.
296+
297+
###### Will enabling / using this feature result in introducing new API types?
298+
299+
No, the feature will use the existing Lease API.
300+
301+
###### Will enabling / using this feature result in any new calls to the cloud provider?
302+
303+
No
304+
305+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
306+
307+
Yes, it will increase the number of Leases in a cluster by the number of control plane VMs.
308+
309+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
310+
311+
No.
312+
313+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
314+
315+
The lease controller may use additional resources in kube-apiserver, but it is likely negligible.
316+
317+
### Troubleshooting
318+
319+
###### How does this feature react if the API server and/or etcd is unavailable?
193320

194-
focusing mostly on:
195-
- components listing and/or watching resources they didn't before:
196-
kube-controller-manager
197-
- periodic API calls to reconcile state (e.g. periodic fetching state,
198-
heartbeats, leader election, etc.): kube-apiserver heartbeat every 10s
321+
Lease objects for a given kube-apiserver may become stale if the kube-apiserver or etcd is non-responsive. Clients should
322+
be able to respond accordingly by checking the lease expiration.
199323

200-
* **Will enabling / using this feature result in increasing size or count of
201-
the existing API objects?**
202-
Describe them, providing:
203-
- API type(s): leases
204-
- Estimated amount of new objects: one per living kube-apiserver
324+
###### What are other known failure modes?
205325

206-
* **Will enabling / using this feature result in increasing time taken by any
207-
operations covered by [existing SLIs/SLOs]?**
208-
No.
326+
* lease objects can become stale if etcd is unavailable and clients do not check lease expiration.
327+
* kube-apiserver heart beats consuming too many resources (unlikely but possible)
209328

210-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
329+
###### What steps should be taken if SLOs are not being met to determine the problem?
211330

212331
## Implementation History
213332

214333
- 2020-09-18: KEP introduced
334+
- 2022-10-05: KEP updated with Beta criteria and all PRR questions answered.
215335

216336
## Alternatives
217337

keps/sig-api-machinery/1965-kube-apiserver-identity/kep.yaml

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ title: kube-apiserver identity
22
kep-number: 1965
33
authors:
44
- "@roycaihw"
5+
- "@andrewsykim"
6+
- "@enj"
57
owning-sig: sig-api-machinery
68
status: implementable
79
creation-date: 2020-09-02
@@ -17,18 +19,17 @@ see-also:
1719
- "https://docs.google.com/document/d/1ed7miqlFY7-9lZxE7gzoyx_MFQCtFEDqtcKMpaAmHys/edit?usp=sharing"
1820

1921
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
22+
stage: beta
2123

2224
# The most recent milestone for which work toward delivery of this KEP has been
2325
# done. This can be the current (upcoming) milestone, if it is being actively
2426
# worked on.
25-
latest-milestone: "v1.20"
27+
latest-milestone: "v1.26"
2628

2729
# The milestone at which this feature was, or is targeted to be, at each stage.
2830
milestone:
2931
alpha: "v1.20"
30-
beta: "v1.21"
31-
stable: "v1.22"
32+
beta: "v1.26"
3233

3334
# The following PRR answers are required at alpha release
3435
# List the feature gate name and the components for which it must be enabled
@@ -37,3 +38,8 @@ feature-gates:
3738
components:
3839
- kube-apiserver
3940
disable-supported: true
41+
42+
metrics:
43+
- kubernetes_healthcheck{name="poststarthook/start-kube-apiserver-identity-lease-controller",type="healthz"}
44+
- kubernetes_healthcheck{name="poststarthook/start-kube-apiserver-identity-lease-controller",type="readyz"}
45+
- kubernetes_healthcheck{name="poststarthook/start-kube-apiserver-identity-lease-controller",type="livez"}

0 commit comments

Comments
 (0)