You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/4017-pod-index-label/README.md
+42-14Lines changed: 42 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -286,8 +286,7 @@ Consider including folks who also work outside the SIG or subproject.
286
286
One thing that must be considered is how enabling this new feature will interact with existing workloads. There are a couple of options:
287
287
288
288
1. Only inject the label on *newly created pods*, so an existing StatefulSet/Indexed Jobs may include pods with the label and some without it.
289
-
This means for the user to utilize the label via the downward API, or to use the label for pod selection, they will need to recreate
290
-
the StatefulSet so the label is present on all pods.
289
+
This means for the user to utilize the label via the downward API, or to use the label for pod selection, they will need to recreate the StatefulSet so the label is present on all pods.
291
290
292
291
2. Inject the label only on pods for *newly created StatefulSets/Indexed Jobs*. We can track this by annotating newly created StatefulSets/Indexed Jobs
293
292
to distinguish existing ones from newly created ones. Using this strategy, for a given StatefulSet/IndexedJob, either none of the pods have this label, or all
@@ -406,8 +405,10 @@ We will release the feature directly in Beta state since there is no benefit in
406
405
existing label which other things may depend on, for example).
407
406
408
407
#### Beta
409
-
Feature implemented behind the `PodIndexLabel` feature gate.
410
-
Unit and integration tests passing.
408
+
- Feature implemented behind the `PodIndexLabel` feature gate.
409
+
- Unit and integration tests passing.
410
+
- Docs are clear that it is managed by the workload controller(s), and it is NOT guaranteed for every pod.
411
+
- Docs are clear about what happens if two pods get the same value (it is set by workload controllers, nothing in the API system will prevent collisions from happening).
411
412
412
413
#### GA
413
414
Fix any potentially reported bugs.
@@ -444,7 +445,9 @@ enhancement:
444
445
cluster required to make on upgrade, in order to make use of the enhancement?
445
446
-->
446
447
447
-
No changes required to existing cluster to use this feature.
448
+
After a user upgrades their cluster to a version which supports this feature (and has the feature gate
449
+
enabled) the user will need to redeploy their StatefulSets / Indexed Jobs so that all pods have the pod index label,
450
+
since after the upgrade only newly created pods will have this pod index label added.
448
451
449
452
### Version Skew Strategy
450
453
@@ -512,6 +515,8 @@ well as the [existing list] of feature gates.
512
515
-[X] Feature gate (also fill in values in `kep.yaml`)
513
516
- Feature gate name: PodIndexLabel
514
517
- Components depending on the feature gate:
518
+
- StatefulSet controller
519
+
- Job controller
515
520
-[ ] Other
516
521
- Describe the mechanism:
517
522
- Will enabling / disabling the feature require downtime of the control
@@ -525,7 +530,7 @@ well as the [existing list] of feature gates.
525
530
Any change of default behavior may be surprising to users or break existing
526
531
automations, so be extremely careful here.
527
532
-->
528
-
No.
533
+
Yes - when we start setting a new label, if someone is doing deep-equal comparison, those will start failing.
529
534
530
535
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
531
536
@@ -591,6 +596,18 @@ that might indicate a serious problem?
591
596
-->
592
597
- Users can monitor queue related metrics (e.g., queue depth and work duration) to make sure they aren't growing.
593
598
- For Indexed Jobs, users can also monitor `job_sync_duration_seconds`.
599
+
- For StatefulSets: the `kube_statefulset_status_replicas` metric can be monitored against the
600
+
`kube_statefulset_replicas` metric to check the expected number of replicas to
601
+
the actual number of pods matched by this StatefulSet's selector. If there is
602
+
a divergence between these fields during steady state operations, this can
603
+
indicate that the number of replicas being created by the StatefulSet do not
604
+
match the expected number of replicas.
605
+
606
+
On a large scale (across a large number of StatefulSets) the distribution of the
607
+
ratio of these two metrics should not change when enabling this feature. If this
608
+
ratio changes significantly after enabling this feature, it could indicate a problem
609
+
and could indicate a rollback is necessary.
610
+
594
611
595
612
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
596
613
@@ -664,21 +681,32 @@ high level (needs more precise definitions) those may be things like:
664
681
665
682
These goals will help you determine what you need to measure (SLIs) in the next question.
666
683
-->
667
-
- 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS
684
+
-Jobs: 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS
668
685
limit.
686
+
- StatefulSets: the ratio of `kube_statefulset_status_replicas`/`kube_statefulset_replicas` should be near 1.0, although as unhealthy replicas are often an application error rather than a problem with the stateful set controller, this will need to be tuned by an operator on a per-cluster basis.
669
687
670
688
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments