You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5598-opportunistic-batching/README.md
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -684,10 +684,15 @@ What signals should users be paying attention to when the feature is young
684
684
that might indicate a serious problem?
685
685
-->
686
686
687
-
- Pods that fail to schedule.
688
-
- Pods that have had a node nominated, but then found that node infeasible.
689
-
- Pods that cannot be batched.
690
-
- High pod scheduling time.
687
+
Existing metrics:
688
+
-`pod_scheduling_sli_duration_seconds`
689
+
-`schedule_attempts_total` - specifically unschedulable and error cases
690
+
-`pending_pods`
691
+
-`unschedulable_pods`
692
+
693
+
New metrics:
694
+
- Pods that cannot be batched.
695
+
- Pod batch failure reasons
691
696
692
697
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
693
698
@@ -697,6 +702,8 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
697
702
are missing a bunch of machinery and tooling and can't do that now.
698
703
-->
699
704
705
+
Upgrade and downgrade should be simple due the feature being in-memory. But we will test the path before GA.
706
+
700
707
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
701
708
702
709
<!--
@@ -714,11 +721,6 @@ For GA, this section is required: approvers should be able to confirm the
714
721
previous answers based on experience in the field.
715
722
-->
716
723
717
-
We will add metrics to identify:
718
-
- How often nominated nodes are found infeasible.
719
-
- How often pods are "batched" vs not.
720
-
- Reasons for pods to be "unbatchable" (pod affinity, pod spread, etc.)
721
-
722
724
###### How can an operator determine if the feature is in use by workloads?
723
725
724
726
<!--
@@ -728,7 +730,6 @@ logs or events for this purpose.
728
730
-->
729
731
730
732
- We will log statistics about how often pods are batched vs not batched.
731
-
- We will include in the pod status information about whether it was batched or not.
732
733
733
734
###### How can someone using this feature know that it is working for their instance?
734
735
@@ -1035,6 +1036,10 @@ The issues experienced by eCache were:
1035
1036
1036
1037
See https://github.com/kubernetes/kubernetes/pull/65714#issuecomment-410016382 as starting point on eCache.
1037
1038
1039
+
## Future work
1040
+
1041
+
Today we have the ability to determine if a given node would still be feasible after we added a specific pod to it. This is powerful and will be used by this feature. However, we do not have the same capability when it comes to scoring. Adding this capability would make it much easier for us to do batching (and many other things) on a wider range of workloads. This work is not required for this KEP, but would increase the number of use cases where we could apply batching.
0 commit comments