Skip to content

Commit f268788

Browse files
committed
Improve metrics
1 parent ed8f3e2 commit f268788

File tree

1 file changed

+9
-10
lines changed
  • keps/sig-auth/4872-harden-kubelet-cert-validation

1 file changed

+9
-10
lines changed

keps/sig-auth/4872-harden-kubelet-cert-validation/README.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -157,15 +157,15 @@ Making the feature opt-in maintains compatibility with existing clusters using c
157157

158158
#### Metrics
159159

160-
In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation.
160+
In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing to track the number of errors due to the new CN validation.
161161
In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates.
162162

163163
If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all.
164164

165165
If the feature gate is enabled, the kubelet CA is set (`--kubelet-certificate-authority`) but this feature is disabled, we will still run the validation code to collect the metric. However, if the validation fails we won't return an error, we will just increment the metric counter.
166166

167167
We intentionally don't add the node name to the metric to avoid a high cardinality.
168-
The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs.
168+
The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter for `failure` label is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs.
169169

170170
### TLS insecure
171171

@@ -272,7 +272,7 @@ Already running workloads won't be impacted but cluster users won't be able to a
272272

273273
###### What specific metrics should inform a rollback?
274274

275-
Not applicable.
275+
`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require invetsigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kuebeler serving certificates are reissued.
276276

277277
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
278278

@@ -294,7 +294,7 @@ Alternatively the can check the `kubernetes_feature_enabled` metric.
294294
###### How can someone using this feature know that it is working for their instance?
295295

296296
- [x] Other
297-
- Details: users can create a Node with a kubelet serving certificate that doesn't meet the CN requirements enforced by this validation (something different than `system:node:<node-name>`).Then run `kubectl logs` for any pod running in that node. If it returns an error for an invalid certificate, the feature is working.
297+
- Details: when the feature is enabled, the metric `kube_apiserver_validation_kubelet_cert_cn_total` will increase for the `success` label.
298298

299299
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
300300

@@ -306,17 +306,16 @@ A raising value after enabling this feature could signal overhead introduced by
306306
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
307307

308308
- [x] Metrics
309-
- Metric name: `kube_apiserver_pod_logs_backend_tls_failure_total`
309+
- Metric name: `kube_apiserver_validation_kubelet_cert_cn_total`
310310
- Components exposing the metric: kube-apiserver
311-
312-
> TODO: should `kube_apiserver_pod_logs_backend_tls_failure_total` reflect errors due to the new CN validation?
313-
> It's technically a TLS failure, but it's not part of the base TLS client validations.
311+
- If the feature is enabled, and the metric increases for the `failure` label, it signals a problem.
312+
- If the service is healthy, the metric should increase.
314313

315314
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
316315

317-
We could add a metric specific to track the number of requests that failed due to the new CN validation. In addition, we could track the time spent per request on the CN validation.
316+
We could add a metric to track the time spent per request on the CN validation.
318317

319-
However, we consider these metrics to not provide enough value to justify the work to maintain them.
318+
However, we consider this metric to not provide enough value to justify the work to maintain it.
320319

321320
### Dependencies
322321

0 commit comments

Comments
 (0)