Skip to content

Commit 37424a0

Browse files
authored
Merge pull request #4911 from g-gaston/harden-kubelet-cert-validation
KEP-4872: Harden Kubelet serving cert validation
2 parents d4b9b2e + 9084d35 commit 37424a0

File tree

3 files changed

+439
-0
lines changed

3 files changed

+439
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 4872
2+
alpha:
3+
approver: "@soltysh"
Lines changed: 392 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,392 @@
1+
# KEP-4872: Harden Kubelet Serving Certificate Validation in Kube-API server
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Impact of node impersonation](#impact-of-node-impersonation)
8+
- [Goals](#goals)
9+
- [Non-Goals](#non-goals)
10+
- [Proposal](#proposal)
11+
- [User Stories (Optional)](#user-stories-optional)
12+
- [Story 1](#story-1)
13+
- [Story 2](#story-2)
14+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
15+
- [Risks and Mitigations](#risks-and-mitigations)
16+
- [Design Details](#design-details)
17+
- [Enabling the feature](#enabling-the-feature)
18+
- [Metrics](#metrics)
19+
- [TLS insecure](#tls-insecure)
20+
- [Test Plan](#test-plan)
21+
- [Prerequisite testing updates](#prerequisite-testing-updates)
22+
- [Unit tests](#unit-tests)
23+
- [Integration tests](#integration-tests)
24+
- [e2e tests](#e2e-tests)
25+
- [Graduation Criteria](#graduation-criteria)
26+
- [Alpha](#alpha)
27+
- [Beta](#beta)
28+
- [GA](#ga)
29+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
30+
- [Version Skew Strategy](#version-skew-strategy)
31+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
32+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
33+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
34+
- [Monitoring Requirements](#monitoring-requirements)
35+
- [Dependencies](#dependencies)
36+
- [Scalability](#scalability)
37+
- [Troubleshooting](#troubleshooting)
38+
- [Implementation History](#implementation-history)
39+
- [Drawbacks](#drawbacks)
40+
- [Alternatives](#alternatives)
41+
- [Infrastructure Needed](#infrastructure-needed)
42+
<!-- /toc -->
43+
44+
## Release Signoff Checklist
45+
46+
<!--
47+
**ACTION REQUIRED:** In order to merge code into a release, there must be an
48+
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
49+
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
50+
of the targeted release**.
51+
52+
For enhancements that make changes to code or processes/procedures in core
53+
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
54+
Signoff checklist to be completed.
55+
56+
Check these off as they are completed for the Release Team to track. These
57+
checklist items _must_ be updated for the enhancement to be released.
58+
-->
59+
60+
Items marked with (R) are required *prior to targeting to a milestone / release*.
61+
62+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
63+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
64+
- [x] (R) Design details are appropriately documented
65+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
66+
- [ ] e2e Tests for all Beta API Operations (endpoints)
67+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
68+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
69+
- [x] (R) Graduation criteria is in place
70+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
71+
- [x] (R) Production readiness review completed
72+
- [x] (R) Production readiness review approved
73+
- [x] "Implementation History" section is up-to-date for milestone
74+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
75+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
76+
77+
[kubernetes.io]: https://kubernetes.io/
78+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
79+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
80+
[kubernetes/website]: https://git.k8s.io/website
81+
82+
## Summary
83+
84+
This proposal aims to enhance the security of the Kube API server by validating the Common Name (CN) of the kubelet's serving certificate to ensure it matches the expected node name.
85+
This validation prevents a compromised node that has obtained a certificate for an IP address it does not own from using it to impersonate another node.
86+
87+
## Motivation
88+
89+
In cloud environments, IPs can change rapidly due to the ephemeral nature of the infrastructure.
90+
If IPs or machines rotate faster than the expiration frequency of kubelet serving certificates, a certificate issued to an old node could be used to respond to requests aimed at a new node, provided they share an IP.
91+
92+
In addition, in on-premises environments, verifying that the IP addresses in a Certificate Signing Request (CSR) are owned by the requesting node can be challenging due to the lack of a reliable source of truth for IP ownership.
93+
Even when such a source exists, integration can be complex, leading to unsafe practices like auto-approval of CSRs without a strong guarantee of IP ownership.
94+
This vulnerability can be exploited through ARP poisoning or other routing attacks, allowing a rogue node to obtain a certificate for an IP it does not own and reroute traffic to itself.
95+
96+
When the Kube API server connects to a kubelet, it verifies that the serving certificate is signed by a trusted CA and that the IP or hostname it’s connecting to is included in the certificate's SANs.
97+
If a rogue node obtained a certificate for an IP it does not own and reroute traffic to itself, it would be able to impersonate a Node that reports that IP.
98+
99+
### Impact of node impersonation
100+
101+
Provided an actor with control of a node can impersonate another node, the impact would be:
102+
103+
* Break confidentiality of the requests sent by the Kube-API server to the kubelet (e.g kubectl exec/logs). These are usually user-driven requests. That gives the threat actor the possibility of producing incorrect or misleading feedback. In the exec case, it could allow a threat actor to issue prompts for credentials. In addition, the exec commands might contain user secrets.
104+
* Break confidentiality of credentials if the client uses token based authentication. This is probably more common for non Kube-API server clients, given mTLS is common for Kube-API server to kubelet communication.
105+
106+
### Goals
107+
108+
* Ensure the Kube API server validates that the node’s serving certificate's CN matches the expected node name.
109+
* Prevent rogue nodes from using certificates issued for IPs they do not own.
110+
111+
### Non-Goals
112+
113+
* This proposal does not address certificate validation for clients other than the Kube API server, such as metrics scrapers. However, we'll consider an implementation in client-go that could be used by those other clients.
114+
115+
## Proposal
116+
117+
We propose that the Kube API server is modified to validate the Common Name (CN) of the kubelet's serving certificate to be equal to `system:node:<nodename>`.
118+
`nodename` is the name of the Node object as reported by the kubelet. When the Kube-API server connects to the kubelet server (e.g. for logs, exec, port-forward), it always knows the Node it's connecting to.
119+
120+
### User Stories (Optional)
121+
122+
#### Story 1
123+
124+
As a cluster administrator, I want to ensure that kubelet serving certificates are validated based on the node name, reducing the risk of IP-based impersonation attacks.
125+
126+
#### Story 2
127+
128+
As a cluster administrator using custom serving certificates for the kubelet server, I want to be able to disable the Subject's CN validation.
129+
130+
### Notes/Constraints/Caveats (Optional)
131+
132+
When the kubelet requests a certificate through a CSR, it sets the CN to `system:node:<nodename>`, enforced by the admission controller as per [PR \#126015](https://github.com/kubernetes/kubernetes/pull/126015).
133+
134+
However, certificates issued manually or through other mechanisms may not follow this convention.
135+
With the new validation, any certificate not following this `system:node:<nodename>` convention will be deemed invalid by the Kube API server.
136+
This will require cluster administrators to reissue any non-conforming certificates before enabling this feature.
137+
138+
### Risks and Mitigations
139+
140+
This could disrupt existing clusters that are using custom kubelet serving certificates.
141+
142+
In order to maintain compatibility by default with these clusters even after this feature goes GA, we will make it opt-in.
143+
144+
Before enabling this feature on clusters with custom kubelet serving certificates, cluster administrators will need to reissue those certificates.
145+
146+
## Design Details
147+
148+
### Enabling the feature
149+
150+
We will introduce a feature flag `KubeletCertCNValidation` that will gate the usage of the new validation.
151+
This gate will start disabled by default in Alpha, will be turned on by default in Beta and will be removed in GA.
152+
153+
In addition, the validation will be opt-in and enabled through a new command-line flag `--enable-kubelet-cert-cn-validation`.
154+
This flag can only be set if the `KubeletCertCNValidation` feature flag is enabled and if `--kubelet-certificate-authority` is set.
155+
156+
Making the feature opt-in maintains compatibility with existing clusters using custom kubelet serving certificates that don't follow the `system:node:<nodename>` convention even after the feature gate is removed.
157+
158+
#### Metrics
159+
160+
In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing us to track the number of errors due to the new CN validation.
161+
In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates.
162+
163+
If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all.
164+
165+
If the feature gate is enabled, the kubelet CA is set (`--kubelet-certificate-authority`) but this feature is disabled, we will still run the validation code to collect the metric. However, if the validation fails we won't return an error, we will just increment the metric counter.
166+
167+
We intentionally don't add the node name to the metric to avoid a high cardinality.
168+
The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter for `failure` label is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs.
169+
170+
### TLS insecure
171+
172+
Currently, if the Kube-API server is not configured with a `--kubelet-certificate-authority` the TLS client for kubelet server will skip the server certificate validation.
173+
Additionally, `logs` requests allow configuring `InsecureSkipTLSVerifyBackend` per request to skip the server certificate validation.
174+
175+
To align with this behavior, we won't allow enabling the validation if `--kubelet-certificate-authority` is not set and we won't execute the CN validation if `InsecureSkipTLSVerifyBackend` is set to true.
176+
177+
### Test Plan
178+
179+
[x] I/we understand the owners of the involved components may require updates to
180+
existing tests to make this code solid enough prior to committing the changes necessary
181+
to implement this enhancement.
182+
183+
##### Prerequisite testing updates
184+
185+
##### Unit tests
186+
187+
Unit tests will be added along with any new code introduced.
188+
189+
Existing test coverage for the packages we anticipate modifying:
190+
191+
- `k8s.io/kubernetes/pkg/kubelet/client`: `2024-10-07` - `28.2`
192+
- `k8s.io/client-go/transport`: `2024-10-07` - `59.4`
193+
194+
On top of testing the validation itself, we will test that:
195+
* An error is returned if `--enable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled.
196+
* An error is returned if the feature `KubeletCertCNValidation` is enabled, `--enable-kubelet-cert-cn-validation` is set to true but `--kubelet-certificate-authority` is not set.
197+
198+
##### Integration tests
199+
200+
Integration tests will be added to ensure the following:
201+
* Validation for custom certificates works if the feature flag is not enabled.
202+
* Validation for custom certificates works if the feature flag is enabled and `--enable-kubelet-cert-cn-validation` is not set or set to false.
203+
* Validation for custom certificates fails if the feature flag is enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true.
204+
* Validation for kubernetes issued certificates works if the feature flag is enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true.
205+
206+
##### e2e tests
207+
208+
We will update the alpha kind e2e tests job to exercise this flow to start with, and once the functionality is beta, we will update all kind e2e test jobs to run with this verification.
209+
210+
### Graduation Criteria
211+
212+
#### Alpha
213+
214+
* Add feature flag for gating usage, off by default
215+
* Add flag to disable extra validation
216+
* Unit and integration tests
217+
218+
#### Beta
219+
* Address user reviews and iterate if needed
220+
* Feature flag on by default
221+
* Validation enabled for all kind e2e test jobs
222+
223+
#### GA
224+
* Successful adoption by at least one provider
225+
226+
### Upgrade / Downgrade Strategy
227+
228+
The feature is opt-in and it can be disabled at any time by just not setting the `--enable-kubelet-cert-cn-validation` flag.
229+
230+
### Version Skew Strategy
231+
232+
Not applicable.
233+
234+
## Production Readiness Review Questionnaire
235+
236+
### Feature Enablement and Rollback
237+
238+
###### How can this feature be enabled / disabled in a live cluster?
239+
240+
- [x] Feature gate
241+
- Feature gate name: `KubeletCertCNValidation`
242+
- Components depending on the feature gate: kube-apiserver
243+
- [x] Other
244+
- Describe the mechanism: kube-apiserver command-line flag `--enable-kubelet-cert-cn-validation`
245+
- Will enabling / disabling the feature require downtime of the control
246+
plane? No. But requires restarting the kube-apiserver.
247+
- Will enabling / disabling the feature require downtime or reprovisioning
248+
of a node? No.
249+
250+
###### Does enabling the feature change any default behavior?
251+
252+
Enabling the feature gate doesn't change any behavior.
253+
254+
Enabling the validation does change the default certificate validation behavior.
255+
256+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
257+
258+
Yes, the feature can be disabled once enabled by not setting the command-line flag.
259+
260+
###### What happens if we reenable the feature if it was previously rolled back?
261+
262+
You just get back the new behavior with the extra cert validation, no extra considerations needed.
263+
264+
###### Are there any tests for feature enablement/disablement?
265+
266+
We will add integration tests to validate the enablement/disablement flow. Test cases specified in a previous section.
267+
268+
### Rollout, Upgrade and Rollback Planning
269+
270+
###### How can a rollout or rollback fail? Can it impact already running workloads?
271+
272+
A rollout can fail if the feature flag is not enabled but the command-line flag is set.
273+
274+
Already running workloads won't be impacted but cluster users won't be able to access the control plane if the cluster is single-node.
275+
276+
###### What specific metrics should inform a rollback?
277+
278+
`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require investigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kubelet serving certificates are reissued.
279+
280+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
281+
282+
No. There is no data stored for this feature which persists between upgrade / downgrade, or between enable / disable.
283+
The feature is purely an API server configuration option.
284+
285+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
286+
287+
No.
288+
289+
### Monitoring Requirements
290+
291+
###### How can an operator determine if the feature is in use by workloads?
292+
293+
The cluster administrators can check the flags passed to the kube-apiserver if they have access to the control plane nodes.
294+
If the `--enable-kubelet-cert-cn-validation` flag is set to true, the feature is being used.
295+
Alternatively, they can check the `kubernetes_feature_enabled` metric.
296+
297+
###### How can someone using this feature know that it is working for their instance?
298+
299+
- [x] Other
300+
- Details: when the feature is enabled, the metric `kube_apiserver_validation_kubelet_cert_cn_total` will increase for the `success` label.
301+
302+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
303+
304+
The average `apiserver_request_duration_seconds` for logs/exec/port-forward requests is within reasonable limits.
305+
A rising value after enabling this feature could signal overhead introduced by the extra validation.
306+
307+
In addition, the number of TLS connections made from API server to nodes should not increase.
308+
309+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
310+
311+
- [x] Metrics
312+
- Metric name: `kube_apiserver_validation_kubelet_cert_cn_total`
313+
- Components exposing the metric: kube-apiserver
314+
- If the feature is enabled, and the metric increases for the `failure` label, it signals a problem.
315+
- If the service is healthy, the metric should increase.
316+
317+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
318+
319+
We could add a metric to track the time spent per request on the CN validation.
320+
321+
However, we consider this metric to not provide enough value to justify the work to maintain it.
322+
323+
### Dependencies
324+
325+
###### Does this feature depend on any specific services running in the cluster?
326+
327+
No.
328+
329+
### Scalability
330+
331+
###### Will enabling / using this feature result in any new API calls?
332+
333+
No.
334+
335+
###### Will enabling / using this feature result in introducing new API types?
336+
337+
No.
338+
339+
###### Will enabling / using this feature result in any new calls to the cloud provider?
340+
341+
No.
342+
343+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
344+
345+
No.
346+
347+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
348+
349+
No. This only affects streaming APIs and these are not covered by SLIs/SLOs.
350+
351+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
352+
353+
No.
354+
355+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
356+
357+
No.
358+
359+
### Troubleshooting
360+
361+
###### How does this feature react if the API server and/or etcd is unavailable?
362+
363+
It's part of the API server, so the feature will be unavailable.
364+
365+
###### What are other known failure modes?
366+
367+
- [API server can't connect to Nodes with custom kubelet serving certificates that don't follow the `system:node:<node-name>` convention]
368+
- Detection: `kubectl logs` returns a certificate validation error.
369+
- Mitigations: disable the validation by not setting `--enable-kubelet-cert-cn-validation` flag.
370+
- Diagnostics: error is returned by the API server, no additional logging needed.
371+
- Testing: We will have tests for this, this is basically testing that the feature works.
372+
373+
###### What steps should be taken if SLOs are not being met to determine the problem?
374+
375+
## Implementation History
376+
377+
* 2024-10-08: KEP created
378+
* 2025-05-08: Implementation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY
379+
380+
381+
## Drawbacks
382+
383+
None.
384+
385+
## Alternatives
386+
387+
None.
388+
389+
## Infrastructure Needed
390+
391+
None.
392+
****

0 commit comments

Comments
 (0)