You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/5394-psi-node-conditions/README.md
+25-7Lines changed: 25 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -284,7 +284,15 @@ to a different kubelet version.
284
284
285
285
### Version Skew Strategy
286
286
287
-
N/A
287
+
kubelet is responsible of setting node conditions to True or False. While kube-scheduler is responsible of applying / removing taints to the nodes based on the conditions.
288
+
289
+
Any / Both component being rolled back could result in stale node conditions and/or taints. Stale node condition can be misleading, while stale taint can be even worse as it may affect workload scheduling.
290
+
291
+
We need to ensure the node conditions and taints get cleaned up when
292
+
1. The feature gate is disabled in any of the component
293
+
2. The version is rolled back in any of the component
294
+
295
+
in which case the feature can be safely disabled / rolled back.
288
296
289
297
## Production Readiness Review Questionnaire
290
298
@@ -438,7 +446,12 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
438
446
checking if there are objects with field X set) may be a last resort. Avoid
439
447
logs or events for this purpose.
440
448
-->
441
-
TBD
449
+
450
+
One can check if the proposed node conditions are set (True or False) in all nodes in the cluster.
451
+
452
+
If the cluster admin sets up metrics that monitor node conditions in the cluster, that metrics can be used to tell if the feature is used. Note that such metrics could show no usage even when the feature is enabled and in use, since the node condition may not be set to True if there is no node-level resource pressure.
453
+
454
+
One can also check if kubelet is surfacing PSI metrics in `/metrics/cadvisor`. Surfacing PSI metrics is a prerequisite for the PSI-based node condition feature.
442
455
443
456
###### How can someone using this feature know that it is working for their instance?
444
457
@@ -451,13 +464,16 @@ and operation of this feature.
451
464
Recall that end users cannot usually observe component logs or access metrics.
452
465
-->
453
466
454
-
-[] Events
455
-
- Event Reason:
456
-
-[] API .status
467
+
-[x] Events
468
+
- Event Reason: kubelet should record an event when setting the node condition to True / False and record the reasons (based on PSI data).
469
+
-[x] API .status
457
470
- Condition name:
471
+
- SystemMemoryContentionPressure
472
+
- SystemDiskContentionPressure
473
+
- KubepodsMemoryContentionPressure
474
+
- KubepodsDiskContentionPressure
458
475
- Other field:
459
-
-[ ] Other (treat as last resort)
460
-
- Details:
476
+
- Check the actual PSI data for the nodes in kubelet metrics
461
477
462
478
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
463
479
@@ -489,6 +505,8 @@ Pick one more of these and delete the rest.
489
505
-[ ] Other (treat as last resort)
490
506
- Details:
491
507
508
+
TBD. Since kubelet checks the PSI data and sets node condition accordingly, we should have metrics in place to monitor that kubelet is indeed performing this tasks and refreshing node conditions. This can help us detect issues where a bug causes kubelet to get stuck and the node conditions become stale.
509
+
492
510
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
0 commit comments