Skip to content

Commit 665f217

Browse files
committed
added rollout, upgrade, and rollback
1 parent 0478769 commit 665f217

File tree

3 files changed

+41
-7
lines changed

3 files changed

+41
-7
lines changed

keps/prod-readiness/sig-scheduling/5004.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 5004
55
alpha:
66
approver: "@johnbelamaric"
7+
beta:
8+
approver: "@johnbelamaric"

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -901,15 +901,37 @@ feature flags will be enabled on some API servers and not others during the
901901
rollout. Similarly, consider large clusters and how enablement/disablement
902902
will rollout across nodes.
903903
-->
904-
Will be considered for beta.
904+
Workloads that do not use the DRA Extended Resource feature should not be impacted,
905+
since the functionality is unchanged.
906+
907+
If the feature is being used in pods before support for it has been fully rolled out
908+
across the cluster, api server, scheduler in control plane, and kubelet in nodes, it
909+
can cause a failure to schedule pods or a failure to run the pods on the nodes.
910+
This will not affect already running workloads unless they have to be restarted.
911+
912+
Device plugin drivers can be replaced with DRA drivers for the same devices on a
913+
per-node basis, one node at a time.
905914

906915
###### What specific metrics should inform a rollback?
907916

908917
<!--
909918
What signals should users be paying attention to when the feature is young
910919
that might indicate a serious problem?
911920
-->
912-
Will be considered for beta.
921+
One indicator are unexpected restarts of the cluster control plane components
922+
(kube-scheduler, apiserver) or kubelet.
923+
924+
If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, it can
925+
suggest that pods are no longer gettings scheduled which might be due to a problem with
926+
the DRA scheduler plugin. Another are an increase in the number of pods that fail to start,
927+
as indicated by the kubelet_started_containers_errors_total metric.
928+
929+
If the node.status.Capacity for the extended resources for the devices do not decrease to zero,
930+
or a pod fail to be scheduled, or run on the node, it may indicate that the device plugin driver
931+
on the node for the devices is not properly replaced by the DRA driver.
932+
933+
In all cases further analysis of logs and pod events is needed to determine whether
934+
errors are related to this feature.
913935

914936
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
915937

@@ -918,14 +940,17 @@ Describe manual testing that was done and the outcomes.
918940
Longer term, we may want to require automated upgrade/rollback tests, but we
919941
are missing a bunch of machinery and tooling and can't do that now.
920942
-->
921-
Will be considered for beta.
943+
This will be covered by automated tests before transition to beta by bringing up a KinD cluster and
944+
changing the feature gate for individual components.
945+
946+
Roundtripping of API types is covered by unit tests.
922947

923948
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
924949

925950
<!--
926951
Even if applying deprecation policies, they may still surprise some users.
927952
-->
928-
Will be considered for beta.
953+
No
929954

930955
### Monitoring Requirements
931956

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@ see-also:
2121
- "/keps/sig-node/3573-device-plugin"
2222

2323
# The target maturity stage in the current dev cycle for this KEP.
24-
stage: alpha
24+
stage: beta
2525

2626
# The most recent milestone for which work toward delivery of this KEP has been
2727
# done. This can be the current (upcoming) milestone, if it is being actively
2828
# worked on.
29-
latest-milestone: "v1.34"
29+
latest-milestone: "v1.35"
3030

3131
# The milestone at which this feature was, or is targeted to be, at each stage.
3232
milestone:
@@ -46,4 +46,11 @@ disable-supported: true
4646

4747
# The following PRR answers are required at beta release
4848
metrics:
49-
#- my_feature_metric
49+
- kube_pod_resource_limit
50+
- kube_pod_resource_request
51+
- kubelet_started_containers_errors_total
52+
- resourceclaim_controller_resource_claims
53+
- scheduler_pending_pods
54+
- scheduler_plugin_execution_duration_seconds
55+
- scheduler_pod_scheduling_sli_duration_seconds
56+
- scheduler_resourceclaim_creates_total

0 commit comments

Comments
 (0)