Skip to content

Commit d020979

Browse files
committed
More review feedback.
1 parent b3ac322 commit d020979

File tree

1 file changed

+15
-10
lines changed
  • keps/sig-scheduling/5598-opportunistic-batching

1 file changed

+15
-10
lines changed

keps/sig-scheduling/5598-opportunistic-batching/README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -684,10 +684,15 @@ What signals should users be paying attention to when the feature is young
684684
that might indicate a serious problem?
685685
-->
686686

687-
- Pods that fail to schedule.
688-
- Pods that have had a node nominated, but then found that node infeasible.
689-
- Pods that cannot be batched.
690-
- High pod scheduling time.
687+
Existing metrics:
688+
- `pod_scheduling_sli_duration_seconds`
689+
- `schedule_attempts_total` - specifically unschedulable and error cases
690+
- `pending_pods`
691+
- `unschedulable_pods`
692+
693+
New metrics:
694+
- Pods that cannot be batched.
695+
- Pod batch failure reasons
691696

692697
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
693698

@@ -697,6 +702,8 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
697702
are missing a bunch of machinery and tooling and can't do that now.
698703
-->
699704

705+
Upgrade and downgrade should be simple due the feature being in-memory. But we will test the path before GA.
706+
700707
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
701708

702709
<!--
@@ -714,11 +721,6 @@ For GA, this section is required: approvers should be able to confirm the
714721
previous answers based on experience in the field.
715722
-->
716723

717-
We will add metrics to identify:
718-
- How often nominated nodes are found infeasible.
719-
- How often pods are "batched" vs not.
720-
- Reasons for pods to be "unbatchable" (pod affinity, pod spread, etc.)
721-
722724
###### How can an operator determine if the feature is in use by workloads?
723725

724726
<!--
@@ -728,7 +730,6 @@ logs or events for this purpose.
728730
-->
729731

730732
- We will log statistics about how often pods are batched vs not batched.
731-
- We will include in the pod status information about whether it was batched or not.
732733

733734
###### How can someone using this feature know that it is working for their instance?
734735

@@ -1035,6 +1036,10 @@ The issues experienced by eCache were:
10351036

10361037
See https://github.com/kubernetes/kubernetes/pull/65714#issuecomment-410016382 as starting point on eCache.
10371038

1039+
## Future work
1040+
1041+
Today we have the ability to determine if a given node would still be feasible after we added a specific pod to it. This is powerful and will be used by this feature. However, we do not have the same capability when it comes to scoring. Adding this capability would make it much easier for us to do batching (and many other things) on a wider range of workloads. This work is not required for this KEP, but would increase the number of use cases where we could apply batching.
1042+
10381043
## Infrastructure Needed (Optional)
10391044

10401045
<!--

0 commit comments

Comments
 (0)