Skip to content

Commit aa5461c

Browse files
authored
Introducing soft-tainter controller (#464)
Introducing a new controller to dynamically set/remove a soft taint (effect: PreferNoSchedule) to/from nodes that the descheduler identified as overutilized according to the nodeutilization plugin. Now the nodeutilization plugin in the descheduler can consume utilization data from actual kubernetes metrics and not only static reservations (requests). On the other side, the default kube-scheduler is only going to ensure that Pod's specific resource requests can be satisfied. The soft taint will simply gave an hint to the scheduler to try avoiding nodes that looks overutilized at descheduler eyes. Since it's just a soft taint (do it just if possible) there is no need to restrict it only to a certain amount of nodes in the cluster or be strict on error handling. This could potentially be directly implemented in the descheduler but it has been judged as out of scope there. So an external controller seems a more manageable solution. Since we want to limit the rights to amend nodes to a dedicated serviceaccount, the new controller will be executed in a new pod deployed by the operator. The operator will also deploy the ClusterRole and the RoleBindings forthis additional operand. With a fine-grained ValidatingAdmissionPolicy (GA in k8s 1.30) we can enforce that the softTainter service account is only allowed to apply/remove soft-taints (enforcing PreferNoSchedule effect) and with a predefined key-prefix. The operator will also deploy the ValidatingAdmissionPolicy. Finally, a smarter node classification requires a more complex logic that is implemented on Prometheous side with recording rules. So the operator will now also deply the PrometheusRule checking that the containing namespace is correctly labelled to exposed them. Signed-off-by: Simone Tiraboschi <[email protected]>
1 parent 882e79f commit aa5461c

File tree

359 files changed

+58176
-3022
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

359 files changed

+58176
-3022
lines changed

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:rhel_9_1.22 as builder
1+
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:rhel_9_1.23 as builder
22
WORKDIR /go/src/github.com/openshift/cluster-kube-descheduler-operator
33
COPY . .
44
RUN make build --warn-undefined-variables
55

66
FROM registry.redhat.io/rhel9-4-els/rhel-minimal:9.4
77
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/cluster-kube-descheduler-operator /usr/bin/
8+
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/soft-tainter /usr/bin/
89
RUN mkdir /licenses
910
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/LICENSE /licenses/.
1011

Dockerfile.rhel7

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ RUN make build --warn-undefined-variables
55

66
FROM registry.ci.openshift.org/ocp/builder:rhel-9-base-openshift-4.19
77
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/cluster-kube-descheduler-operator /usr/bin/
8+
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/soft-tainter /usr/bin/
89
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/manifests /manifests
910
COPY --from=builder /go/src/github.com/openshift/cluster-kube-descheduler-operator/metadata /metadata
1011
LABEL io.k8s.display-name="OpenShift Descheduler Operator" \

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,10 +168,17 @@ as they are commonly encountered during VM eviction and migration.
168168
Equivalent to enabling both `EvictPodsWithPVC` and `EvictPodsWithLocalStorage` profiles in parallel.
169169
In the future, additional configurations may be introduced through the operator based on user feedback.
170170

171+
This profile sets a nodeSelector (`kubevirt.io/schedulable=true`)
172+
in the Descheduler policy to limit its action to nodes that are considered schedulable for `KubeVirt`.
173+
That nodeSelector is a top level configuration option affecting all the active descheduler profiles:
174+
this profile is not expected to be combined with other profiles
175+
unless all profiles are expected to operate over the same set of nodes.
176+
171177
The profile exposes the following customization:
172178
- `devLowNodeUtilizationThresholds`: Sets experimental thresholds for the LowNodeUtilization strategy.
173179
- `devActualUtilizationProfile`: Enable load-aware descheduling.
174180
- `devDeviationThresholds`: Have the thresholds be based on the average utilization.
181+
- `devEnableSoftTainter`: Have the operator deploying the soft-tainter component to dynamically set/remove soft taints according to the same criteria used for load aware descheduling. Scheduling and descheduling decisions are in general fully independent. In case of load-aware descheduling we can potentially have a relevant asymmetry between the descheduling and successive scheduling decisions. The soft taints set by the descheduler soft-tainter act as a hint for the scheduler to mitigate this asymmetry and foster a quicker convergence.
175182

176183
### EvictPodsWithPVC
177184
By default, the operator prevents pods with PVCs from being evicted. Enabling this
@@ -199,7 +206,8 @@ the `profileCustomizations` field:
199206
|`devEnableEvictionsInBackground`|`bool`| Enables descheduler's EvictionsInBackground alpha feature. The EvictionsInBackground alpha feature is a subject to change. Currently provided as an experimental feature.|
200207
| `devHighNodeUtilizationThresholds` | `string` | Sets thresholds for the [HighNodeUtilization](https://github.com/kubernetes-sigs/descheduler#highnodeutilization) strategy of the `CompactAndScale` profile in the following ratios: `Minimal` for 10%, `Modest` for 20%, `Moderate` for 30%. Currently provided as an experimental feature.|
201208
|`devActualUtilizationProfile`|`string`| Sets a profile that gets translated into a predefined prometheus query |
202-
| `devDeviationThresholds` | `string` | Have the thresholds be based on the average utilization. Thresholds signify the distance from the average node utilization in the following setting: `Low`: 10%:10%, `Medium`: 20%:20%, `High`: 30%:30% |
209+
| `devDeviationThresholds` | `string` | Have the thresholds be based on the average utilization. Thresholds signify the distance from the average node utilization. An AsymmetricDeviationThreshold will force all nodes below the average to be considered as underutilized to help rebalancing overutilized outliers. Supported settings: `Low`: 10%:10%, `Medium`: 20%:20%, `High`: 30%:30%, `AsymmetricLow`: 0%:10%, `AsymmetricMedium`: 0%:20%, `AsymmetricHigh`: 0%:30% |
210+
| `devEnableSoftTainter` | `bool` | Have the operator deploying the soft-tainter component to dynamic set/remove soft taints as a hint for the scheduler based on the same criteria used for the descheduling |
203211

204212
## Prometheus query profiles
205213
The operator provides the following profiles:

bindata/assets/kube-descheduler/deployment.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ spec:
3737
readOnlyRootFilesystem: true
3838
capabilities:
3939
drop: ["ALL"]
40-
image: ${IMAGE}
40+
image: ${OPERAND_IMAGE}
4141
resources:
4242
requests:
4343
cpu: "100m"
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
apiVersion: monitoring.coreos.com/v1
3+
kind: PrometheusRule
4+
metadata:
5+
name: descheduler-rules
6+
namespace: openshift-kube-descheduler-operator
7+
spec:
8+
groups:
9+
- name: recordingRules.rules
10+
rules:
11+
- record: descheduler:nodeutilization:cpu:avg1m
12+
expr: avg by (instance) (1 - rate(node_cpu_seconds_total{mode='idle'}[1m]))
13+
14+
- record: descheduler:averageworkersutilization:cpu:avg1m
15+
expr: avg(descheduler:nodeutilization:cpu:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))
16+
17+
- record: descheduler:nodepressure:cpu:avg1m
18+
# return the cpu pressure if the cpu usage is over 70% otherwise
19+
# return cpu pressure as zero to (partially) filter out false
20+
# positives pressure spikes due to CPU limited pods.
21+
# See: https://github.com/kubernetes/enhancements/issues/5062
22+
expr: |-
23+
avg by (instance) (
24+
rate(node_pressure_cpu_waiting_seconds_total[1m])
25+
) and (
26+
1 - avg by (instance) (
27+
rate(node_cpu_seconds_total{mode='idle'}[1m])
28+
)
29+
) > 0.7
30+
or
31+
avg by (instance) (
32+
rate(node_pressure_cpu_waiting_seconds_total[1m])
33+
) * 0
34+
35+
- record: descheduler:combined_utilization_and_pressure:avg1m
36+
expr: |-
37+
(descheduler:nodeutilization:cpu:avg1m and on() descheduler:averageworkersutilization:cpu:avg1m < 0.8)
38+
or
39+
(descheduler:nodepressure:cpu:avg1m)
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: openshift-descheduler-softtainter
5+
rules:
6+
- apiGroups:
7+
- "events.k8s.io"
8+
resources:
9+
- "events"
10+
verbs:
11+
- "get"
12+
- "watch"
13+
- "list"
14+
- "create"
15+
- "update"
16+
- "patch"
17+
- "delete"
18+
- apiGroups:
19+
- ""
20+
resources:
21+
- "nodes"
22+
verbs:
23+
- "get"
24+
- "watch"
25+
- "list"
26+
- "patch"
27+
- "update"
28+
- apiGroups:
29+
- "coordination.k8s.io"
30+
resources:
31+
- "leases"
32+
verbs:
33+
- "create"
34+
- apiGroups:
35+
- "coordination.k8s.io"
36+
resources:
37+
- "leases"
38+
resourceNames:
39+
- "soft-tainter-lock"
40+
verbs:
41+
- "get"
42+
- "patch"
43+
- "update"
44+
- "delete"
45+
- apiGroups:
46+
- "operator.openshift.io"
47+
resources:
48+
- "kubedeschedulers"
49+
verbs:
50+
- "get"
51+
- "watch"
52+
- "list"
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: openshift-descheduler-softtainter
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: openshift-descheduler-softtainter
9+
subjects:
10+
- kind: ServiceAccount
11+
name: openshift-descheduler-softtainter
12+
namespace: openshift-kube-descheduler-operator
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: openshift-descheduler-softtainter-monitoring
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: cluster-monitoring-view
9+
subjects:
10+
- kind: ServiceAccount
11+
name: openshift-descheduler-softtainter
12+
namespace: openshift-kube-descheduler-operator
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: "softtainer"
5+
namespace: "openshift-kube-descheduler-operator"
6+
labels:
7+
app: "softtainer"
8+
spec:
9+
replicas: 1
10+
selector:
11+
matchLabels:
12+
app: "softtainer"
13+
template:
14+
metadata:
15+
annotations:
16+
kubectl.kubernetes.io/default-container: openshift-softtainer
17+
labels:
18+
app: "softtainer"
19+
spec:
20+
securityContext:
21+
runAsNonRoot: true
22+
seccompProfile:
23+
type: RuntimeDefault
24+
volumes:
25+
- name: "policy-volume"
26+
configMap:
27+
name: "descheduler"
28+
- name: certs-dir
29+
secret:
30+
secretName: kube-descheduler-serving-cert
31+
priorityClassName: "system-cluster-critical"
32+
restartPolicy: "Always"
33+
containers:
34+
- name: "openshift-softtainer"
35+
securityContext:
36+
allowPrivilegeEscalation: false
37+
readOnlyRootFilesystem: true
38+
capabilities:
39+
drop: ["ALL"]
40+
image: ${SOFTTAINTER_IMAGE}
41+
livenessProbe:
42+
failureThreshold: 1
43+
httpGet:
44+
path: /livez
45+
port: 6060
46+
scheme: HTTP
47+
initialDelaySeconds: 30
48+
periodSeconds: 5
49+
readinessProbe:
50+
failureThreshold: 1
51+
httpGet:
52+
path: /readyz
53+
port: 6060
54+
scheme: HTTP
55+
initialDelaySeconds: 5
56+
periodSeconds: 5
57+
resources:
58+
requests:
59+
cpu: "100m"
60+
memory: "500Mi"
61+
command: ["/usr/bin/soft-tainter"]
62+
args:
63+
- --policy-config-file=/policy-dir/policy.yaml
64+
volumeMounts:
65+
- mountPath: "/policy-dir"
66+
name: "policy-volume"
67+
- mountPath: "/certs-dir"
68+
name: certs-dir
69+
serviceAccountName: "openshift-descheduler-softtainter"
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
apiVersion: v1
2+
kind: ServiceAccount
3+
metadata:
4+
name: openshift-descheduler-softtainter
5+
namespace: openshift-kube-descheduler-operator

0 commit comments

Comments
 (0)