Skip to content

Commit 28250e9

Browse files
committed
Update monitoring and version skew details
1 parent 15c3c95 commit 28250e9

File tree

1 file changed

+25
-7
lines changed
  • keps/sig-node/5394-psi-node-conditions

1 file changed

+25
-7
lines changed

keps/sig-node/5394-psi-node-conditions/README.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -284,7 +284,15 @@ to a different kubelet version.
284284

285285
### Version Skew Strategy
286286

287-
N/A
287+
kubelet is responsible of setting node conditions to True or False. While kube-scheduler is responsible of applying / removing taints to the nodes based on the conditions.
288+
289+
Any / Both component being rolled back could result in stale node conditions and/or taints. Stale node condition can be misleading, while stale taint can be even worse as it may affect workload scheduling.
290+
291+
We need to ensure the node conditions and taints get cleaned up when
292+
1. The feature gate is disabled in any of the component
293+
2. The version is rolled back in any of the component
294+
295+
in which case the feature can be safely disabled / rolled back.
288296

289297
## Production Readiness Review Questionnaire
290298

@@ -438,7 +446,12 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
438446
checking if there are objects with field X set) may be a last resort. Avoid
439447
logs or events for this purpose.
440448
-->
441-
TBD
449+
450+
One can check if the proposed node conditions are set (True or False) in all nodes in the cluster.
451+
452+
If the cluster admin sets up metrics that monitor node conditions in the cluster, that metrics can be used to tell if the feature is used. Note that such metrics could show no usage even when the feature is enabled and in use, since the node condition may not be set to True if there is no node-level resource pressure.
453+
454+
One can also check if kubelet is surfacing PSI metrics in `/metrics/cadvisor`. Surfacing PSI metrics is a prerequisite for the PSI-based node condition feature.
442455

443456
###### How can someone using this feature know that it is working for their instance?
444457

@@ -451,13 +464,16 @@ and operation of this feature.
451464
Recall that end users cannot usually observe component logs or access metrics.
452465
-->
453466

454-
- [ ] Events
455-
- Event Reason:
456-
- [ ] API .status
467+
- [x] Events
468+
- Event Reason: kubelet should record an event when setting the node condition to True / False and record the reasons (based on PSI data).
469+
- [x] API .status
457470
- Condition name:
471+
- SystemMemoryContentionPressure
472+
- SystemDiskContentionPressure
473+
- KubepodsMemoryContentionPressure
474+
- KubepodsDiskContentionPressure
458475
- Other field:
459-
- [ ] Other (treat as last resort)
460-
- Details:
476+
- Check the actual PSI data for the nodes in kubelet metrics
461477

462478
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
463479

@@ -489,6 +505,8 @@ Pick one more of these and delete the rest.
489505
- [ ] Other (treat as last resort)
490506
- Details:
491507

508+
TBD. Since kubelet checks the PSI data and sets node condition accordingly, we should have metrics in place to monitor that kubelet is indeed performing this tasks and refreshing node conditions. This can help us detect issues where a bug causes kubelet to get stuck and the node conditions become stale.
509+
492510
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
493511

494512
<!--

0 commit comments

Comments
 (0)