|
| 1 | +# KEP-1965: kube-apiserver identity |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | +- [Proposal](#proposal) |
| 8 | + - [Caveats](#caveats) |
| 9 | +- [Design Details](#design-details) |
| 10 | + - [Test Plan](#test-plan) |
| 11 | + - [Graduation Criteria](#graduation-criteria) |
| 12 | + - [Alpha -> Beta Graduation](#alpha---beta-graduation) |
| 13 | + - [Beta -> GA Graduation](#beta---ga-graduation) |
| 14 | + - [Version Skew Strategy](#version-skew-strategy) |
| 15 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 16 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 17 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 18 | + - [Monitoring Requirements](#monitoring-requirements) |
| 19 | + - [Dependencies](#dependencies) |
| 20 | + - [Scalability](#scalability) |
| 21 | +- [Implementation History](#implementation-history) |
| 22 | +- [Alternatives](#alternatives) |
| 23 | + - [Alternative 1: new API + storage TTL](#alternative-1-new-api--storage-ttl) |
| 24 | + - [Alternative 2: using storage interface directly](#alternative-2-using-storage-interface-directly) |
| 25 | + - [Alternative 3: storage interface + Lease API](#alternative-3-storage-interface--lease-api) |
| 26 | + - [Alternative 4: storage interface + new API](#alternative-4-storage-interface--new-api) |
| 27 | +<!-- /toc --> |
| 28 | + |
| 29 | +## Release Signoff Checklist |
| 30 | + |
| 31 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 32 | + |
| 33 | +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 34 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 35 | +- [ ] (R) Design details are appropriately documented |
| 36 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 37 | +- [ ] (R) Graduation criteria is in place |
| 38 | +- [ ] (R) Production readiness review completed |
| 39 | +- [ ] Production readiness review approved |
| 40 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 41 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 42 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 43 | + |
| 44 | +[kubernetes.io]: https://kubernetes.io/ |
| 45 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 46 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 47 | +[kubernetes/website]: https://git.k8s.io/website |
| 48 | + |
| 49 | +## Summary |
| 50 | + |
| 51 | +In a HA cluster, each kube-apiserver has an ID. Controllers have access to the |
| 52 | +list of IDs for living kube-apiservers in the cluster. |
| 53 | + |
| 54 | +## Motivation |
| 55 | + |
| 56 | +The [dynamic coordinated storage version API](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190802-dynamic-coordinated-storage-version.md#curating-a-list-of-participating-api-servers-in-ha-master) |
| 57 | +needs such a list to garbage collect stale records. The |
| 58 | +[API priority and fairness feature](https://github.com/kubernetes/kubernetes/pull/91389) |
| 59 | +needs a unique identifier for an apiserver reporting its concurrency limit. |
| 60 | + |
| 61 | +Currently, such a list is already maintained in the “kubernetes” endpoints, |
| 62 | +where the kube-apiservers’ advertised IP addresses are the IDs. However it is |
| 63 | +not working in all flavors of Kubernetes deployments. For example, if there is a |
| 64 | +load balancer for the cluster, where the advertise IP address is set to the IP |
| 65 | +address of the load balancer, all three kube-apiservers will have the same |
| 66 | +advertise IP address. |
| 67 | + |
| 68 | +## Proposal |
| 69 | + |
| 70 | +We will use “hostname+PID+random suffix (e.g. 6 base58 digits)” as the ID. |
| 71 | + |
| 72 | +Similar to the [node heartbeat](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md), |
| 73 | +a kube-apiserver will store its ID in a Lease object. All kube-apiserver Leases |
| 74 | +will be stored in a special namespace “kube-apiserver-lease”. A controller will |
| 75 | +garbage collect expired Leases. |
| 76 | + |
| 77 | +### Caveats |
| 78 | + |
| 79 | +In this proposal we focus on kube-apiservers. Aggregated apiservers don’t have |
| 80 | +the same problem, because their record is already exposed via the service. By |
| 81 | +listing the pods selected by the service, an aggregated server can learn the |
| 82 | +list of living servers with distinct podIPs. A server can get its own IDs via |
| 83 | +downward API. |
| 84 | + |
| 85 | +We prefer that expired Leases remain for a longer duration as opposed to |
| 86 | +collecting them quickly, because in the latter case, if a Lease is falsely |
| 87 | +collected by accident, it can do more damage than the former case. Take the |
| 88 | +storage version API scenario as an example, if a kube-apiserver accidentally |
| 89 | +missed a heartbeat and got its Lease garbage collected, its StorageVersion can |
| 90 | +be falsely garbage collected as a consequence. In this case, the storage |
| 91 | +migrator won’t be able to migrate the storage, unless this kube-aipserver gets |
| 92 | +restarted and re-registers its StorageVersion. On the other hand, if a |
| 93 | +kube-apiserver is gone and its Lease still stays around for an hour or two, it |
| 94 | +will only delay the storage migration for the same period of time. |
| 95 | + |
| 96 | +## Design Details |
| 97 | + |
| 98 | +The [kubelet heartbeat](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md) |
| 99 | +logic [already written](https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/nodelease) |
| 100 | +will be re-used. The heartbeat controller will be added to kube-apiserver in a |
| 101 | +post-start hook. |
| 102 | + |
| 103 | +Each kube-apiserver will refresh its Lease every 10s by default. A GC controller |
| 104 | +will watch the Lease API using an informer, and periodically resync its local |
| 105 | +cache. On processing an item, the controller will delete the Lease if the last |
| 106 | +`renewTime` was more than `leaseDurationSeconds` ago (default to 1h). The |
| 107 | +default `leaseDurationSeconds` is chosen to be way longer than the default |
| 108 | +refresh period, to tolerate clock skew and/or accidental refresh failure. The |
| 109 | +default resync period is 1h. By default, assuming negligible clock skew, a Lease |
| 110 | +will be deleted if the kube-apiserver fails to refresh its Lease for one to two |
| 111 | +hours. The GC controller will run in kube-controller-manager, to leverage leader |
| 112 | +election and reduce conflicts. |
| 113 | + |
| 114 | +The refresh rate, lease duration will be configurable through kube-apiserver |
| 115 | +flags. The resync period will be configurable through a kube-controller-manager |
| 116 | +flag. |
| 117 | + |
| 118 | +### Test Plan |
| 119 | + |
| 120 | + - integration test for creating the Namespace and the Lease on kube-apiserver |
| 121 | + startup |
| 122 | + - integration test for not creating the StorageVersions after creating the |
| 123 | + Lease |
| 124 | + - integration test for garbage collecting a Lease that isn't refreshed |
| 125 | + - integration test for not garbage collecting a Lease that is refreshed |
| 126 | + |
| 127 | +### Graduation Criteria |
| 128 | + |
| 129 | +Alpha should provide basic functionality covered with tests described above. |
| 130 | + |
| 131 | +#### Alpha -> Beta Graduation |
| 132 | + |
| 133 | + - Appropriate metrics are agreed on and implemented |
| 134 | + - An e2e test plan is agreed and implemented (e.g. chaosmonkey in a regional |
| 135 | + cluster) |
| 136 | + |
| 137 | +#### Beta -> GA Graduation |
| 138 | + |
| 139 | + - Conformance tests are agreed on and implemented |
| 140 | + |
| 141 | +**For non-optional features moving to GA, the graduation criteria must include |
| 142 | +[conformance tests].** |
| 143 | + |
| 144 | +[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md |
| 145 | + |
| 146 | +### Version Skew Strategy |
| 147 | + |
| 148 | + - This feature is proposed for the control plane internal use. Master-node skew is |
| 149 | + not considered. |
| 150 | + - During a rolling update, an HA cluster may have old and new masters. Old masters |
| 151 | + won't create Leases, nor garbage collect Leases. |
| 152 | + |
| 153 | +## Production Readiness Review Questionnaire |
| 154 | + |
| 155 | +### Feature Enablement and Rollback |
| 156 | + |
| 157 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 158 | + - [x] Feature gate (also fill in values in `kep.yaml`) |
| 159 | + - Feature gate name: APIServerIdentity |
| 160 | + - Components depending on the feature gate: kube-apiserver |
| 161 | + |
| 162 | +* **Does enabling the feature change any default behavior?** |
| 163 | + A namespace "kube-apiserver-lease" will be used to store kube-apiserver |
| 164 | + identity Leases. |
| 165 | + |
| 166 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 167 | + the enablement)?** |
| 168 | + Yes. Stale Lease objects will remain stale (`renewTime` won't get updated) |
| 169 | + |
| 170 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 171 | + Stale Lease objects will be garbage collected. |
| 172 | + |
| 173 | +### Rollout, Upgrade and Rollback Planning |
| 174 | + |
| 175 | +_This section must be completed when targeting beta graduation to a release._ |
| 176 | + |
| 177 | +### Monitoring Requirements |
| 178 | + |
| 179 | +_This section must be completed when targeting beta graduation to a release._ |
| 180 | + |
| 181 | +### Dependencies |
| 182 | + |
| 183 | +_This section must be completed when targeting beta graduation to a release._ |
| 184 | + |
| 185 | +### Scalability |
| 186 | + |
| 187 | +* **Will enabling / using this feature result in any new API calls?** |
| 188 | + Describe them, providing: |
| 189 | + - API call type (e.g. PATCH pods): UPDATE leases |
| 190 | + - estimated throughput: |
| 191 | + - originating component(s) (e.g. Kubelet, Feature-X-controller): |
| 192 | + kube-apiserver |
| 193 | + |
| 194 | + focusing mostly on: |
| 195 | + - components listing and/or watching resources they didn't before: |
| 196 | + kube-controller-manager |
| 197 | + - periodic API calls to reconcile state (e.g. periodic fetching state, |
| 198 | + heartbeats, leader election, etc.): kube-apiserver heartbeat every 10s |
| 199 | + |
| 200 | +* **Will enabling / using this feature result in increasing size or count of |
| 201 | +the existing API objects?** |
| 202 | + Describe them, providing: |
| 203 | + - API type(s): leases |
| 204 | + - Estimated amount of new objects: one per living kube-apiserver |
| 205 | + |
| 206 | +* **Will enabling / using this feature result in increasing time taken by any |
| 207 | +operations covered by [existing SLIs/SLOs]?** |
| 208 | + No. |
| 209 | + |
| 210 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 211 | + |
| 212 | +## Implementation History |
| 213 | + |
| 214 | +## Alternatives |
| 215 | + |
| 216 | +### Alternative 1: new API + storage TTL |
| 217 | + |
| 218 | +We define a new API for kube-apiserver identity. Similar to [Event](https://github.com/kubernetes/kubernetes/blob/9062c43b76c8562062e454a190a948f1370f8eb3/pkg/registry/core/rest/storage_core.go#L128), |
| 219 | +we make the storage path for the new object type [tack on the TTL](https://github.com/kubernetes/kubernetes/blob/9062c43b76c8562062e454a190a948f1370f8eb3/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L1173). |
| 220 | +Etcd will delete objects who don’t get their TTL refreshed in time. |
| 221 | + |
| 222 | + - Pros: |
| 223 | + - We don’t need to write a controller to garbage collect expired records, nor |
| 224 | + worry about client-server clock skew. |
| 225 | + - We can extend the API in future to include more information (e.g. version, |
| 226 | + feature, config) |
| 227 | + - Cons: |
| 228 | + - We need a new dedicated API |
| 229 | + |
| 230 | +Note that the proposed solution doesn't prevent us from switching to a new API |
| 231 | +in future. Similar to node heartbeats switched from node status to leases. |
| 232 | + |
| 233 | +### Alternative 2: using storage interface directly |
| 234 | + |
| 235 | +The existing “kubernetes” Endpoints [mechanism](https://github.com/kubernetes/community/pull/939) |
| 236 | +can be inherited to solve the kube-apiserver identity problem. There are two |
| 237 | +parts of the mechanism: |
| 238 | + 1. Each kube-apiserver periodically writes a lease of its ID (address) with a |
| 239 | + TTL to etcd through the storage interface. The lease object itself is an |
| 240 | + Endpoints. Leases will be deleted by etcd for servers who fail to refresh the |
| 241 | + TTL in time. |
| 242 | + 2. A controller reads the leases through the storage interface, to collect the |
| 243 | + list of IP addresses. The controller updates the “kubernetes” Endpoints to |
| 244 | + match the IP address list. |
| 245 | + |
| 246 | +We inherit the first part of the existing mechanism (the etcd TTL lease), but |
| 247 | +change the key and value. The key will be the new ID. All the keys will be |
| 248 | +stored under a special prefix “/apiserverleases/” (similar to the [existing mechanism](https://github.com/kubernetes/kubernetes/blob/14a11060a0775ed609f0810898ebdbe737c59441/pkg/master/master.go#L265)). |
| 249 | +The value will be a Lease object. A kube-apiserver obtains the list of IDs by |
| 250 | +directly listing/watching the leases through the storage interface. |
| 251 | + |
| 252 | + - Cons: |
| 253 | + - We depend on a side-channel API, which is against Kubernetes philosophy |
| 254 | + - Clients like the kube-controller-manager cannot access the storage |
| 255 | + interface. For the storage version API, if we put the garbage collector in |
| 256 | + kube-apiserver instead of kube-controller-manager, the lack of leader |
| 257 | + election may cause update conflicts. |
| 258 | + |
| 259 | +### Alternative 3: storage interface + Lease API |
| 260 | + |
| 261 | +The kube-apiservers still write the master leases to etcd, but a controller will |
| 262 | +watch the master leases and update an existing public API (e.g. store it in a |
| 263 | +defined way in a Lease). Note that we cannot use the endpoints API like the |
| 264 | +“kubernetes” endpoints, because the endpoints API is designed to store a list of |
| 265 | +addresses, but our IDs are not IP addresses. |
| 266 | + |
| 267 | + - Cons: |
| 268 | + - We depend on a side-channel API, which is against Kubernetes philosophy |
| 269 | + |
| 270 | +### Alternative 4: storage interface + new API |
| 271 | + |
| 272 | +Similar to Alternative 1, the kube-apiservers write the master leases to etcd, |
| 273 | +and a controller watches the master leases, but updates a new public API |
| 274 | +specifically designed to host information about the API servers, including its |
| 275 | +ID, enabled feature gates, etc. |
| 276 | + |
| 277 | + - Cons: |
| 278 | + - We depend on a side-channel API, which is against Kubernetes philosophy |
0 commit comments