You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-auth/4872-harden-kubelet-cert-validation/README.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -157,15 +157,15 @@ Making the feature opt-in maintains compatibility with existing clusters using c
157
157
158
158
#### Metrics
159
159
160
-
In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation.
160
+
In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing to track the number of errors due to the new CN validation.
161
161
In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates.
162
162
163
163
If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all.
164
164
165
165
If the feature gate is enabled, the kubelet CA is set (`--kubelet-certificate-authority`) but this feature is disabled, we will still run the validation code to collect the metric. However, if the validation fails we won't return an error, we will just increment the metric counter.
166
166
167
167
We intentionally don't add the node name to the metric to avoid a high cardinality.
168
-
The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs.
168
+
The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter for `failure` label is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs.
169
169
170
170
### TLS insecure
171
171
@@ -272,7 +272,7 @@ Already running workloads won't be impacted but cluster users won't be able to a
272
272
273
273
###### What specific metrics should inform a rollback?
274
274
275
-
Not applicable.
275
+
`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require invetsigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kuebeler serving certificates are reissued.
276
276
277
277
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
278
278
@@ -294,7 +294,7 @@ Alternatively the can check the `kubernetes_feature_enabled` metric.
294
294
###### How can someone using this feature know that it is working for their instance?
295
295
296
296
-[x] Other
297
-
- Details: users can create a Node with a kubelet serving certificate that doesn't meet the CN requirements enforced by this validation (something different than `system:node:<node-name>`).Then run `kubectl logs`for any pod running in that node. If it returns an error for an invalid certificate, the feature is working.
297
+
- Details: when the feature is enabled, the metric `kube_apiserver_validation_kubelet_cert_cn_total` will increase for the `success` label.
298
298
299
299
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
300
300
@@ -306,17 +306,16 @@ A raising value after enabling this feature could signal overhead introduced by
306
306
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
> TODO: should `kube_apiserver_pod_logs_backend_tls_failure_total` reflect errors due to the new CN validation?
313
-
> It's technically a TLS failure, but it's not part of the base TLS client validations.
311
+
- If the feature is enabled, and the metric increases for the `failure` label, it signals a problem.
312
+
- If the service is healthy, the metric should increase.
314
313
315
314
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
316
315
317
-
We could add a metric specific to track the number of requests that failed due to the new CN validation. In addition, we could track the time spent per request on the CN validation.
316
+
We could add a metric to track the time spent per request on the CN validation.
318
317
319
-
However, we consider these metrics to not provide enough value to justify the work to maintain them.
318
+
However, we consider this metric to not provide enough value to justify the work to maintain it.
0 commit comments