|
1 |
| -# Kube Descheduler Operator |
| 1 | +# README |
2 | 2 |
|
3 |
| -Run the descheduler in your OpenShift cluster to move pods based on specific strategies. |
| 3 | +## FBC catalog rendering |
4 | 4 |
|
5 |
| -## Releases |
| 5 | +To initiliaze catalog-template.json |
6 | 6 |
|
7 |
| -| kdo version | ocp version | k8s version | golang | |
8 |
| -| ----------- | ----------- | ----------- | ------ | |
9 |
| -| 5.0.0 | 4.15, 4.16 | 1.28 | 1.20 | |
10 |
| -| 5.0.1 | 4.15, 4.16 | 1.29 | 1.21 | |
11 |
| -| 5.0.2 | 4.15, 4.16 | 1.29 | 1.21 | |
12 |
| -| 5.1.0 | 4.17, 4.18 | 1.30 | 1.22 | |
13 |
| -| 5.1.1 | 4.17, 4.18 | 1.31 | 1.22 | |
14 |
| - |
15 |
| -## Deploy the operator |
16 |
| - |
17 |
| -### Quick Development |
18 |
| - |
19 |
| -1. Build and push the operator image to a registry: |
20 |
| -2. Ensure the `image` spec in `deploy/05_deployment.yaml` refers to the operator image you pushed |
21 |
| -3. Run `oc create -f deploy/.` |
22 |
| - |
23 |
| -### OperatorHub install with custom index image |
24 |
| - |
25 |
| -This process refers to building the operator in a way that it can be installed locally via the OperatorHub with a custom index image |
26 |
| - |
27 |
| -1. Build and push the operator image to a registry: |
28 |
| - ```sh |
29 |
| - export QUAY_USER=${your_quay_user_id} |
30 |
| - export IMAGE_TAG=${your_image_tag} |
31 |
| - podman build -t quay.io/${QUAY_USER}/cluster-kube-descheduler-operator:${IMAGE_TAG} -f Dockerfile.rhel7 |
32 |
| - podman login quay.io -u ${QUAY_USER} |
33 |
| - podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator:${IMAGE_TAG} |
34 |
| - ``` |
35 |
| - |
36 |
| -1. Update the `.spec.install.spec.deployments[0].spec.template.spec.containers[0].image` field in the KDO CSV under `./manifests/cluster-kube-descheduler-operator.clusterserviceversion.yaml` to point to the newly built image. |
37 |
| - |
38 |
| -1. build and push the metadata image to a registry (e.g. https://quay.io): |
39 |
| - ```sh |
40 |
| - podman build -t quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG} -f Dockerfile.metadata . |
41 |
| - podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG} |
42 |
| - ``` |
43 |
| - |
44 |
| -1. build and push image index for operator-registry (pull and build https://github.com/operator-framework/operator-registry/ to get the `opm` binary) |
45 |
| - ```sh |
46 |
| - opm index add --bundles quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG} --tag quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-index:${IMAGE_TAG} |
47 |
| - podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-index:${IMAGE_TAG} |
48 |
| - ``` |
49 |
| - |
50 |
| - Don't forget to increase the number of open files, .e.g. `ulimit -n 100000` in case the current limit is insufficient. |
51 |
| - |
52 |
| -1. create and apply catalogsource manifest (remember to change <<QUAY_USER>> and <<IMAGE_TAG>> to your own values): |
53 |
| - ```yaml |
54 |
| - apiVersion: operators.coreos.com/v1alpha1 |
55 |
| - kind: CatalogSource |
56 |
| - metadata: |
57 |
| - name: cluster-kube-descheduler-operator |
58 |
| - namespace: openshift-marketplace |
59 |
| - spec: |
60 |
| - sourceType: grpc |
61 |
| - image: quay.io/<<QUAY_USER>>/cluster-kube-descheduler-operator-index:<<IMAGE_TAG>> |
62 |
| - ``` |
63 |
| -
|
64 |
| -1. create `openshift-kube-descheduler-operator` namespace: |
65 |
| - ``` |
66 |
| - $ oc create ns openshift-kube-descheduler-operator |
67 |
| - ``` |
68 |
| -
|
69 |
| -1. open the console Operators -> OperatorHub, search for `descheduler operator` and install the operator |
70 |
| -
|
71 |
| -
|
72 |
| -## Sample CR |
73 |
| -
|
74 |
| -A sample CR definition looks like below (the operator expects `cluster` CR under `openshift-kube-descheduler-operator` namespace): |
75 |
| -
|
76 |
| -```yaml |
77 |
| -apiVersion: operator.openshift.io/v1 |
78 |
| -kind: KubeDescheduler |
79 |
| -metadata: |
80 |
| - name: cluster |
81 |
| - namespace: openshift-kube-descheduler-operator |
82 |
| -spec: |
83 |
| - deschedulingIntervalSeconds: 1800 |
84 |
| - profiles: |
85 |
| - - AffinityAndTaints |
86 |
| - - LifecycleAndUtilization |
87 |
| - profileCustomizations: |
88 |
| - podLifetime: 5m |
89 |
| - namespaces: |
90 |
| - included: |
91 |
| - - ns1 |
92 |
| - - ns2 |
| 7 | +```sh |
| 8 | +$ opm migrate registry.redhat.io/redhat/redhat-operator-index:v4.17 ./catalog-migrate |
| 9 | +$ mkdir -p v4.18/catalog/cluster-kube-descheduler-operator |
| 10 | +$ opm alpha convert-template basic ./catalog-migrate/cluster-kube-descheduler-operator/catalog.json > v4.18/catalog-template.json |
93 | 11 | ```
|
94 | 12 |
|
95 |
| -The operator spec provides a `profiles` field, which allows users to set one or more descheduling profiles to enable. |
96 |
| - |
97 |
| -These profiles map to preconfigured policy definitions, enabling several descheduler strategies grouped by intent, and |
98 |
| -any that are enabled will be merged. |
99 |
| - |
100 |
| -## Profiles |
101 |
| - |
102 |
| -The following profiles are currently provided: |
103 |
| -* [`AffinityAndTaints`](#AffinityAndTaints) |
104 |
| -* [`TopologyAndDuplicates`](#TopologyAndDuplicates) |
105 |
| -* [`SoftTopologyAndDuplicates`](#SoftTopologyAndDuplicates) |
106 |
| -* [`LifecycleAndUtilization`](#LifecycleAndUtilization) |
107 |
| -* [`LongLifecycle`](#LongLifecycle) |
108 |
| -* [`CompactAndScale`](#compactandscale-techpreview) |
109 |
| -* [`EvictPodsWithPVC`](#EvictPodsWithPVC) |
110 |
| -* [`EvictPodsWithLocalStorage`](#EvictPodsWithLocalStorage) |
111 |
| - |
112 |
| -Each of these enables cluster-wide descheduling (excluding openshift and kube-system namespaces) based on certain goals. |
113 |
| - |
114 |
| -### AffinityAndTaints |
115 |
| -This is the most basic descheduling profile and it removes running pods which violate node and pod affinity, and node |
116 |
| -taints. |
117 |
| - |
118 |
| -This profile enables the [`RemovePodsViolatingInterPodAntiAffinity`](https://github.com/kubernetes-sigs/descheduler/#removepodsviolatinginterpodantiaffinity), |
119 |
| -[`RemovePodsViolatingNodeAffinity`](https://github.com/kubernetes-sigs/descheduler/#removepodsviolatingnodeaffinity), and |
120 |
| -[`RemovePodsViolatingNodeTaints`](https://github.com/kubernetes-sigs/descheduler/#removepodsviolatingnodeaffinity) strategies. |
| 13 | +To update the catalog |
121 | 14 |
|
122 |
| -### TopologyAndDuplicates |
123 |
| -This profile attempts to balance pod distribution based on topology constraint definitions and evicting duplicate copies |
124 |
| -of the same pod running on the same node. It enables the [`RemovePodsViolatingTopologySpreadConstraints`](https://github.com/kubernetes-sigs/descheduler/#removepodsviolatingtopologyspreadconstraint) |
125 |
| -and [`RemoveDuplicates`](https://github.com/kubernetes-sigs/descheduler/#removeduplicates) strategies. |
126 |
| - |
127 |
| -### SoftTopologyAndDuplicates |
128 |
| -This profile is the same as `TopologyAndDuplicates`, however it will also consider pods with "soft" topology constraints |
129 |
| -for eviction (ie, `whenUnsatisfiable: ScheduleAnyway`) |
130 |
| - |
131 |
| -### LifecycleAndUtilization |
132 |
| -This profile focuses on pod lifecycles and node resource consumption. It will evict any running pod older than 24 hours |
133 |
| -and attempts to evict pods from "high utilization" nodes that can fit onto "low utilization" nodes. A high utilization |
134 |
| -node is any node consuming more than 50% its available cpu, memory, *or* pod capacity. A low utilization node is any node |
135 |
| -with less than 20% of its available cpu, memory, *and* pod capacity. |
136 |
| - |
137 |
| -This profile enables the [`LowNodeUtilizaition`](https://github.com/kubernetes-sigs/descheduler/#lownodeutilization), |
138 |
| -[`RemovePodsHavingTooManyRestarts`](https://github.com/kubernetes-sigs/descheduler/#removepodshavingtoomanyrestarts) and |
139 |
| -[`PodLifeTime`](https://github.com/kubernetes-sigs/descheduler/#podlifetime) strategies. In the future, more configuration |
140 |
| -may be made available through the operator for these strategies based on user feedback. |
141 |
| - |
142 |
| -### LongLifecycle |
143 |
| -This profile provides cluster resource balancing similar to [LifecycleAndUtilization](#LifecycleAndUtilization) for longer-running |
144 |
| -clusters. It does not evict pods based on the 24 hour lifetime used by LifecycleAndUtilization. |
145 |
| - |
146 |
| -### CompactAndScale |
147 |
| -This profile seeks to evict pods to enable the same workload to run on a smaller set of nodes. |
148 |
| -It will attempts to evict pods from "under utilized" nodes that can fit into fewer nodes. |
149 |
| -An under utilized node is any node consuming less than 20% of its available cpu, memory, *and* pod capacity. |
150 |
| - |
151 |
| -This profile enables the [`HighNodeUtilization`](https://github.com/kubernetes-sigs/descheduler/#highnodeutilization) strategy. |
152 |
| -In the future, more configuration may be made available through the operator based on user feedback. |
153 |
| - |
154 |
| -### EvictPodsWithPVC |
155 |
| -By default, the operator prevents pods with PVCs from being evicted. Enabling this |
156 |
| -profile in combination with any of the above profiles allows pods with PVCs to be |
157 |
| -eligible for eviction. |
158 |
| - |
159 |
| -### EvictPodsWithLocalStorage |
160 |
| -By default, pods with local storage are not eligible to be considered for eviction by any |
161 |
| -profile. Using this profile allows them to be evicted if necessary. A pod is defined as using |
162 |
| -local storage if any of its volumes have `HostPath` or `EmptyDir` set (note that a pod that only |
163 |
| -uses PVCs does not fit this definition, and will need the `EvictPodsWithPVC` profile instead. Pods |
164 |
| -that use both will need both profiles to be evicted). |
165 |
| - |
166 |
| -## Profile Customizations |
167 |
| -Some profiles expose options which may be used to configure the underlying Descheduler strategy parameters. These are available under |
168 |
| -the `profileCustomizations` field: |
169 |
| - |
170 |
| -|Name|Type|Description| |
171 |
| -|---|---|---| |
172 |
| -|`podLifetime`|`time.Duration`|Sets the lifetime value for pods evicted by the `LifecycleAndUtilization` profile| |
173 |
| -|`thresholdPriorityClassName`|`string`|Sets the priority class threshold by name for all strategies| |
174 |
| -|`thresholdPriority`|`string`|Sets the priority class threshold by value for all strategies| |
175 |
| -|`namespaces.included`, `namespaces.excluded`|`[]string`| Sets the included/excluded namespaces for all strategies (included namespaces are not allowed to include protected namespaces which consist of `kube-system`, `hypershift` and all `openshift-` prefixed namespaces)| |
176 |
| -| `devLowNodeUtilizationThresholds` | `string` | Sets experimental thresholds for the [LowNodeUtilization](https://github.com/kubernetes-sigs/descheduler#lownodeutilization) strategy of the `LifecycleAndUtilization` profile in the following ratios: `Low` for 10%:30%, `Medium` for 20%:50%, `High` for 40%:70%| |
177 |
| -|`devEnableEvictionsInBackground`|`bool`| Enables descheduler's EvictionsInBackground alpha feature. The EvictionsInBackground alpha feature is a subject to change. Currently provided as an experimental feature.| |
178 |
| -| `devHighNodeUtilizationThresholds` | `string` | Sets thresholds for the [HighNodeUtilization](https://github.com/kubernetes-sigs/descheduler#highnodeutilization) strategy of the `CompactAndScale` profile in the following ratios: `Minimal` for 10%, `Modest` for 20%, `Moderate` for 30%. Currently provided as an experimental feature.| |
179 |
| -|`devActualUtilizationProfile`|`string`| Sets a profile that gets translated into a predefined prometheus query | |
180 |
| - |
181 |
| -## Prometheus query profiles |
182 |
| -The operator provides the following profiles: |
183 |
| -- `PrometheusCPUUsage`: `instance:node_cpu:rate:sum` (metric available in OpenShift by default) |
184 |
| -- `PrometheusCPUPSIPressure`: `rate(node_pressure_cpu_waiting_seconds_total[1m])` (`node_pressure_cpu_waiting_seconds_total` is a custom metric that needs to be provided) |
185 |
| -- `PrometheusMemoryPSIPressure`: `rate(node_pressure_memory_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is a custom metric that needs to be provided) |
186 |
| -- `PrometheusIOPSIPressure`: `rate(node_pressure_io_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is a custom metric that needs to be provided) |
187 |
| - |
188 |
| -```yaml |
189 |
| -apiVersion: operator.openshift.io/v1 |
190 |
| -kind: KubeDescheduler |
191 |
| -metadata: |
192 |
| - name: cluster |
193 |
| - namespace: openshift-kube-descheduler-operator |
194 |
| -spec: |
195 |
| - managementState: Managed |
196 |
| - deschedulingIntervalSeconds: 3600 |
197 |
| - profiles: |
198 |
| - - LongLifecycle |
199 |
| - profileCustomizations: |
200 |
| - devActualUtilizationProfile: PrometheusCPUUsage |
201 | 15 | ```
|
202 |
| -
|
203 |
| -## Descheduling modes |
204 |
| -The operator provides two modes of eviction: |
205 |
| -- `Predictive`: configures the descheduler to only simulate eviction |
206 |
| -- `Automatic`: configures the descheduler to evict pods |
207 |
| - |
208 |
| -The predictive mode is the default mode. |
209 |
| -The descheduler in either of the modes still produces metrics (unless the metrics are disabled). |
210 |
| -When the predictive mode is configured, the reported metrics can serve as an estimation |
211 |
| -of evicted pods in the cluster. |
212 |
| - |
213 |
| - |
214 |
| -## How does the descheduler operator work? |
215 |
| - |
216 |
| -Descheduler operator at a high level is responsible for watching the above CR |
217 |
| -- Create a configmap that could be used by descheduler. |
218 |
| -- Run descheduler as a deployment mounting the configmap as a policy file in the pod. |
219 |
| - |
220 |
| -The configmap created from above sample CR definition looks like this: |
221 |
| - |
222 |
| -```yaml |
223 |
| -apiVersion: descheduler/v1alpha1 |
224 |
| - kind: DeschedulerPolicy |
225 |
| - strategies: |
226 |
| - RemovePodsViolatingInterPodAntiAffinity: |
227 |
| - enabled: true |
228 |
| - ... |
229 |
| - RemovePodsViolatingNodeAffinity: |
230 |
| - enabled: true |
231 |
| - params: |
232 |
| - ... |
233 |
| - nodeAffinityType: |
234 |
| - - requiredDuringSchedulingIgnoredDuringExecution |
235 |
| - RemovePodsViolatingNodeTaints: |
236 |
| - enabled: true |
237 |
| - ... |
| 16 | +$ cd v4.18 |
| 17 | +$ opm alpha render-template basic catalog-template.json --migrate-level bundle-object-to-csv-metadata > catalog/cluster-kube-descheduler-operator/catalog.json |
238 | 18 | ```
|
239 |
| -(Some generated parameters omitted.) |
240 |
| - |
241 |
| - |
242 |
| -## Parameters |
243 |
| -The Descheduler operator exposes the following parameters in its CRD: |
244 |
| - |
245 |
| -|Name|Type|Description| |
246 |
| -|---|---|---| |
247 |
| -|`deschedulingIntervalSeconds`|`int32`|Sets the number of seconds between descheduler runs| |
248 |
| -|`profiles`|`[]string`|Sets which descheduler strategy profiles are enabled| |
249 |
| -|`profileCustomizations`|`map`|Contains various parameters for modifying the default behavior of certain profiles| |
250 |
| -|`mode`|`string`|Configures the descheduler to either evict pods or to simulate the eviction| |
251 |
| -|`evictionLimits`|`map`|Restrict the number of evictions during each descheduling run. Available fields are: `total`| |
252 |
| -|`evictionLimits.total`|`int32`|Restricts the maximum number of overall evictions| |
0 commit comments