|
| 1 | +<!-- |
| 2 | +Copyright The Shipwright Contributors |
| 3 | +
|
| 4 | +SPDX-License-Identifier: Apache-2.0 |
| 5 | +--> |
| 6 | + |
| 7 | +--- |
| 8 | +title: build-scheduler-options |
| 9 | +authors: |
| 10 | + - "@adambkaplan" |
| 11 | +reviewers: |
| 12 | + - "@apoorvajagtap" |
| 13 | + - "@HeavyWombat" |
| 14 | +approvers: |
| 15 | + - "@qu1queee" |
| 16 | + - "@SaschaSchwarze0" |
| 17 | +creation-date: 2024-05-15 |
| 18 | +last-updated: 2024-06-20 |
| 19 | +status: Implementable |
| 20 | +see-also: [] |
| 21 | +replaces: [] |
| 22 | +superseded-by: [] |
| 23 | +--- |
| 24 | + |
| 25 | +# Build Scheduler Options |
| 26 | + |
| 27 | +## Release Signoff Checklist |
| 28 | + |
| 29 | +- [x] Enhancement is `implementable` |
| 30 | +- [x] Design details are appropriately documented from clear requirements |
| 31 | +- [x] Test plan is defined |
| 32 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 33 | +- [ ] User-facing documentation is created in [docs](/docs/) |
| 34 | + |
| 35 | +## Open Questions [optional] |
| 36 | + |
| 37 | +- Should this be enabled always? Should we consider an alpha -> beta lifecycle for this feature? (ex: off by default -> on by default) |
| 38 | + |
| 39 | +## Summary |
| 40 | + |
| 41 | +Add API options that influece where `BuildRun` pods are scheduled on Kubernetes. This can be |
| 42 | +acomplished through the following mechanisms: |
| 43 | + |
| 44 | +- [Node Selectors](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) |
| 45 | +- [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) |
| 46 | +- [Custom Schedulers](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/) |
| 47 | + |
| 48 | +## Motivation |
| 49 | + |
| 50 | +Today, `BuildRun` pods will run on arbitrary nodes - developers, platform engineers, and admins do |
| 51 | +not have the ability to control where a specific build pod will be scheduled. Teams may have |
| 52 | +several motivations for controlling where a build pod is scheduled: |
| 53 | + |
| 54 | +- Builds can be CPU/memory/storage intensive. Scheduling on larger worker nodes with additional |
| 55 | + memory or compute can help ensure builds succeed. |
| 56 | +- Clusters may have mutiple worker node architectures and even OS (Windows nodes). Container images |
| 57 | + are by their nature specific to the OS and CPU architecture, and default to the host operating |
| 58 | + system and architecture. Builds may need to specify OS and architecture through node selectors. |
| 59 | +- The default Kubernetes scheduler may not efficiently schedule build workloads - especially |
| 60 | + considering how Tekton implements step containers and sidecars. A custom scheduler optimized for |
| 61 | + Tekton or other batch workloads may lead to better cluster utulization. |
| 62 | + |
| 63 | +### Goals |
| 64 | + |
| 65 | +- Allow build pods to run on specific nodes using node selectors. |
| 66 | +- Allow build pods to tolerate node taints. |
| 67 | +- Allow build pods to use a custom scheduler. |
| 68 | + |
| 69 | +### Non-Goals |
| 70 | + |
| 71 | +- Primary feature support for multi-arch builds. |
| 72 | +- Allow node selection, pod affinity, and taint toleration to be set at the cluster level. |
| 73 | + While this may be desirable, it requires a more sophisticated means of configuring the build |
| 74 | + controller. Setting default values for scheduling options can be considered as a follow-up |
| 75 | + feature. |
| 76 | +- Prevent use of build pod scheduling fields. This is best left to an admission controller like |
| 77 | + [OPA Gatekeeper](https://www.openpolicyagent.org/docs/latest/kubernetes-introduction/) or |
| 78 | + [Kyverno](https://kyverno.io/). |
| 79 | +- Allow build pods to set node affinity/anti-affinity rules. Affinity/anti-affinity is an |
| 80 | + incredibly rich and complex API (see [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) |
| 81 | + for more information). We should strive to provide a simpler interface that is tailored |
| 82 | + specifically to builds. This feature is being dropped to narrow the scope of this SHIP. Build |
| 83 | + affinity rules can/should be addressed in a follow up feature. |
| 84 | + |
| 85 | +## Proposal |
| 86 | + |
| 87 | +### User Stories |
| 88 | + |
| 89 | +#### Node Selection - platform engineer |
| 90 | + |
| 91 | +As a platform engineer, I want builds to use node selectors to ensure they are scheduled on nodes |
| 92 | +optimized for builds so that builds are more likely to succeed |
| 93 | + |
| 94 | +#### Node Selection - arch-specific images |
| 95 | + |
| 96 | +As a developer, I want to select the OS and architecture of my build's node so that I can run |
| 97 | +builds on worker nodes with multiple architectures. |
| 98 | + |
| 99 | +#### Taint toleration - cluster admin |
| 100 | + |
| 101 | +As a cluster admin, I want builds to be able to tolerate provided node taints so that they can |
| 102 | +be scheduled on nodes that are not suitable/designated for application workloads. |
| 103 | + |
| 104 | +#### Custom Scheduler |
| 105 | + |
| 106 | +As a platform engineer/cluster admin, I want builds to use a custom scheduler so that I can provide |
| 107 | +my own scheduler that is optimized for my build workloads. |
| 108 | + |
| 109 | +### Implementation Notes |
| 110 | + |
| 111 | +#### API Updates |
| 112 | + |
| 113 | +The `BuildSpec` API for Build and BuildRun will be updated to add the following fields: |
| 114 | + |
| 115 | +```yaml |
| 116 | +spec: |
| 117 | + ... |
| 118 | + nodeSelector: # map[string]string |
| 119 | + <node-label>: "label-value" |
| 120 | + tolerations: # []Toleration |
| 121 | + - key: "taint-key" |
| 122 | + operator: Exists|Equal |
| 123 | + value: "taint-value" |
| 124 | + schedulerName: "custom-scheduler-name" # string |
| 125 | +``` |
| 126 | +
|
| 127 | +The `nodeSelector` and `schedulerName` fields will use golang primitives that match their k8s |
| 128 | +equivalents. |
| 129 | + |
| 130 | +#### Tolerations |
| 131 | + |
| 132 | +The Tolerations API for Shipwright will support a limited subset of the upstream Kubernetes |
| 133 | +Tolerations API. For simplicity, any Shipwright Build or BuildRun with a toleration set will use |
| 134 | +the `NoSchedule` [taint effect](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). |
| 135 | + |
| 136 | +```yaml |
| 137 | +spec: |
| 138 | + tolerations: # Optional array |
| 139 | + - key: "taint-key" # Aligns with upstream k8s taint labels. Required |
| 140 | + operator: Exists|Equal # Aligns with upstream k8s - key exists or node label key = value. Required |
| 141 | + value: "taint-value" # Alights with upstream k8s taint value. Optional. |
| 142 | +``` |
| 143 | + |
| 144 | +As with upstream k8s, the Shipwright Tolerations API array should support |
| 145 | +[strategic merge JSON patching](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#notes-on-the-strategic-merge-patch). |
| 146 | + |
| 147 | +#### Precedence Ordering and Value Merging |
| 148 | + |
| 149 | +Values in `BuildRun` will override those in the referenced `Build` object (if present). Values for |
| 150 | +`nodeSelector` and `tolerations` should use strategic merge logic when possible: |
| 151 | + |
| 152 | +- `nodeSelector` merges using map keys. If the map key is present in the `Build` and `BuildRun` |
| 153 | + object, the `BuildRun` overrides the value. |
| 154 | +- `tolerations` merges using the taint key. If the taint key is present in the `Build` and |
| 155 | + `BuildRun` object, the `BuildRun` overrides the value. |
| 156 | + |
| 157 | +This allows the `BuildRun` object to "inherit" values from a parent `Build` object. |
| 158 | + |
| 159 | +#### Impact on Tekton TaskRun |
| 160 | + |
| 161 | +Tekton supports tuning the pod of the `TaskRun` using the |
| 162 | +[podTemplate](https://tekton.dev/docs/pipelines/taskruns/#specifying-a-pod-template) field. When |
| 163 | +Shipwright creates the `TaskRun` for a build, the respective node selector, tolerations, and |
| 164 | +scheduler name can be passed through. |
| 165 | + |
| 166 | +#### Command Line Enhancements |
| 167 | + |
| 168 | +The `shp` CLI _may_ be enhanced to add flags that set the node selector, tolerations, and custom |
| 169 | +scheduler for a `BuildRun`. For example, `shp build run` can have the following new options: |
| 170 | + |
| 171 | +- `--node=<key>=<value>`: Use the node label key/value pair in the selector. Can be set more than |
| 172 | + once for multiple key/value pairs.. |
| 173 | +- `--tolerate=<key>` or `--tolerate=<key>=<value>`: Tolerate the taint key, in one of two ways: |
| 174 | + - First form: taint key `Exists`. |
| 175 | + - Second form: taint key `Equals` provided value. |
| 176 | + - In both cases, this flag can be set more than once. |
| 177 | +- `--scheduler=<name>`: use custom scheduler with given name. Can only be set once. |
| 178 | + |
| 179 | + |
| 180 | +#### Hardening Guidelines |
| 181 | + |
| 182 | +Exposing `nodeSelector` and `tolerations` to end developers adds risk with respect to overall |
| 183 | +system availability. Some platform teams may not want these Kubernetes internals exposed or |
| 184 | +modifiable by end developers at all. To address these concerns, a hardening guideline for |
| 185 | +Shipwright Builds should also be published alongside documentation for this feature. This guideline |
| 186 | +should recommend the use of third party admission controllers (ex: OPA, Kyverno) to prevent builds |
| 187 | +from using values that impact system availability and performance. For example: |
| 188 | + |
| 189 | +- Block toleration of `node.kubernetes.io/*` taints. These are reserved for nodes that are not |
| 190 | + ready to receive workloads for scheduling. |
| 191 | +- Block node selectors with the `node-role.kubernetes.io/control-plane` label key. This is reserved |
| 192 | + for control plane components (`kube-apiserver`, `kube-controller-manager`, etc.) |
| 193 | +- Block toleration of the `node-role.kubernetes.io/control-plane` taint key. Same as above. |
| 194 | + |
| 195 | +See the [well known labels](https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane) |
| 196 | +documentation for more information. |
| 197 | + |
| 198 | +### Test Plan |
| 199 | + |
| 200 | +- Unit testing can verify that the generated `TaskRun` object for a build contains the desired pod |
| 201 | + template fields. |
| 202 | +- End to end tests using `KinD` is possible for the `nodeSelector` and `tolerations` fields: |
| 203 | + - KinD has support for configuring multiple [nodes](https://kind.sigs.k8s.io/docs/user/configuration/#nodes) |
| 204 | + - Once set up, KinD nodes can simulate real nodes when it comes to pod scheduling, node labeling, |
| 205 | + and node taints. |
| 206 | +- End to end testing for the `schedulerName` field requires the deployment of a custom scheduler, |
| 207 | + plus code to verify that the given scheduler was used. This is non-trivial (see |
| 208 | + [upstream example](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods)) |
| 209 | + and adds a potential failure point to the test suite. Relying on unit testing alone is our best |
| 210 | + option. |
| 211 | + |
| 212 | + |
| 213 | +### Release Criteria |
| 214 | + |
| 215 | +TBD |
| 216 | + |
| 217 | +**Note:** *Section not required until targeted at a release.* |
| 218 | + |
| 219 | +#### Removing a deprecated feature [if necessary] |
| 220 | + |
| 221 | +Not applicable. |
| 222 | + |
| 223 | +#### Upgrade Strategy [if necessary] |
| 224 | + |
| 225 | +The top-level API fields will be optional and default to Golang empty values. |
| 226 | +On upgrade, these values will remain empty on existing `Build`/`BuildRun` objects. |
| 227 | + |
| 228 | + |
| 229 | +### Risks and Mitigations |
| 230 | + |
| 231 | +**Risk:** Node selector field allows disruptive workloads (builds) to be scheduled on control plane |
| 232 | +nodes. |
| 233 | + |
| 234 | +*Mitigation*: Hardening guideline added as a requirement for this feature. There may be some |
| 235 | +cluster topologies (ex: single node clusters) where scheduling builds on the "control plane" is not |
| 236 | +only desirable, but necessary. Hardening guidelines referencing third party admission controllers |
| 237 | +preserves flexibility while giving cluster administrators/platform teams the knowledge needed to |
| 238 | +harden their environments as they see fit. |
| 239 | + |
| 240 | + |
| 241 | +## Drawbacks |
| 242 | + |
| 243 | +Exposing these fields leaks - to a certain extent - our abstraction over Kubernetes. This proposal |
| 244 | +places k8s pod scheduling fields up front in the API for `Build` and `BuildRun`, a deviation from |
| 245 | +Tekton which exposes the fields through a `PodTemplate` sub-field. Cluster administrators may not |
| 246 | +want end developers to have control over where these pods are scheduled - they may instead wish to |
| 247 | +control pod scheduling through Tekton's |
| 248 | +[default pod template](https://github.com/tektoncd/pipeline/blob/main/docs/podtemplates.md#supported-fields) |
| 249 | +mechanism at the controller level. |
| 250 | + |
| 251 | +Exposing `nodeSelector` may also conflict with future enhancements to support |
| 252 | +[multi-architecture image builds](https://github.com/shipwright-io/build/issues/1119). A |
| 253 | +hypothetical build that fans out individual image builds to nodes with desired OS/architecture |
| 254 | +pairs may need to explicitly set the `k8s.io/os` and `k8s.io/architecture` node selector fields on |
| 255 | +generated `TaskRuns`. With that said, there is currently no mechanism for Shipwright to control |
| 256 | +where builds execute on clusters with multiple worker node architectures and operating systems. |
| 257 | + |
| 258 | + |
| 259 | +## Alternatives |
| 260 | + |
| 261 | +An earlier draft of this proposal included `affinity` for setting pod affinity/anti-affinity rules. |
| 262 | +This was rejected due to the complexities of Kubernetes pod affinity and anti-affinity. We need |
| 263 | +more concrete user stories from the community to understand what - if anything - we should do with |
| 264 | +respect to distributing build workloads through affinity rules. This may also conflict with |
| 265 | +Tekton's [affinity assistant](https://tekton.dev/docs/pipelines/affinityassistants/) feature - an optional configuration that is enabled by default in upstream Tekton. |
| 266 | + |
| 267 | +An earlier draft also included the ability to set default values for these fields at the cluster |
| 268 | +level. This would be similar to Tekton's capability with the Pipeline controller configuration. |
| 269 | +Since this option is available at the Tekton pipeline level, adding nearly identical features to |
| 270 | +Shipwright is being deferred. Tuning pod template values with the Tekton pipeline controller may |
| 271 | +also be an acceptable alternative to this feature in some circumstances. |
| 272 | + |
| 273 | + |
| 274 | +## Infrastructure Needed [optional] |
| 275 | + |
| 276 | +No additional infrastructure antipated. |
| 277 | +Test KinD clusters may need to deploy with additional nodes where these features can be verified. |
| 278 | + |
| 279 | +## Implementation History |
| 280 | + |
| 281 | +- 2024-05-15: Created as `provisional` |
| 282 | +- 2024-06-20: Draft updated to `implementable` |
0 commit comments