Skip to content

Commit afea4f8

Browse files
Merge pull request #205 from adambkaplan/ship-0039-provisional
SHIP-0039: Build Scheduler Options
2 parents cd27374 + f17810f commit afea4f8

File tree

1 file changed

+282
-0
lines changed

1 file changed

+282
-0
lines changed

ships/0039-build-scheduler-opts.md

Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
<!--
2+
Copyright The Shipwright Contributors
3+
4+
SPDX-License-Identifier: Apache-2.0
5+
-->
6+
7+
---
8+
title: build-scheduler-options
9+
authors:
10+
- "@adambkaplan"
11+
reviewers:
12+
- "@apoorvajagtap"
13+
- "@HeavyWombat"
14+
approvers:
15+
- "@qu1queee"
16+
- "@SaschaSchwarze0"
17+
creation-date: 2024-05-15
18+
last-updated: 2024-06-20
19+
status: Implementable
20+
see-also: []
21+
replaces: []
22+
superseded-by: []
23+
---
24+
25+
# Build Scheduler Options
26+
27+
## Release Signoff Checklist
28+
29+
- [x] Enhancement is `implementable`
30+
- [x] Design details are appropriately documented from clear requirements
31+
- [x] Test plan is defined
32+
- [ ] Graduation criteria for dev preview, tech preview, GA
33+
- [ ] User-facing documentation is created in [docs](/docs/)
34+
35+
## Open Questions [optional]
36+
37+
- Should this be enabled always? Should we consider an alpha -> beta lifecycle for this feature? (ex: off by default -> on by default)
38+
39+
## Summary
40+
41+
Add API options that influece where `BuildRun` pods are scheduled on Kubernetes. This can be
42+
acomplished through the following mechanisms:
43+
44+
- [Node Selectors](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)
45+
- [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
46+
- [Custom Schedulers](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/)
47+
48+
## Motivation
49+
50+
Today, `BuildRun` pods will run on arbitrary nodes - developers, platform engineers, and admins do
51+
not have the ability to control where a specific build pod will be scheduled. Teams may have
52+
several motivations for controlling where a build pod is scheduled:
53+
54+
- Builds can be CPU/memory/storage intensive. Scheduling on larger worker nodes with additional
55+
memory or compute can help ensure builds succeed.
56+
- Clusters may have mutiple worker node architectures and even OS (Windows nodes). Container images
57+
are by their nature specific to the OS and CPU architecture, and default to the host operating
58+
system and architecture. Builds may need to specify OS and architecture through node selectors.
59+
- The default Kubernetes scheduler may not efficiently schedule build workloads - especially
60+
considering how Tekton implements step containers and sidecars. A custom scheduler optimized for
61+
Tekton or other batch workloads may lead to better cluster utulization.
62+
63+
### Goals
64+
65+
- Allow build pods to run on specific nodes using node selectors.
66+
- Allow build pods to tolerate node taints.
67+
- Allow build pods to use a custom scheduler.
68+
69+
### Non-Goals
70+
71+
- Primary feature support for multi-arch builds.
72+
- Allow node selection, pod affinity, and taint toleration to be set at the cluster level.
73+
While this may be desirable, it requires a more sophisticated means of configuring the build
74+
controller. Setting default values for scheduling options can be considered as a follow-up
75+
feature.
76+
- Prevent use of build pod scheduling fields. This is best left to an admission controller like
77+
[OPA Gatekeeper](https://www.openpolicyagent.org/docs/latest/kubernetes-introduction/) or
78+
[Kyverno](https://kyverno.io/).
79+
- Allow build pods to set node affinity/anti-affinity rules. Affinity/anti-affinity is an
80+
incredibly rich and complex API (see [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
81+
for more information). We should strive to provide a simpler interface that is tailored
82+
specifically to builds. This feature is being dropped to narrow the scope of this SHIP. Build
83+
affinity rules can/should be addressed in a follow up feature.
84+
85+
## Proposal
86+
87+
### User Stories
88+
89+
#### Node Selection - platform engineer
90+
91+
As a platform engineer, I want builds to use node selectors to ensure they are scheduled on nodes
92+
optimized for builds so that builds are more likely to succeed
93+
94+
#### Node Selection - arch-specific images
95+
96+
As a developer, I want to select the OS and architecture of my build's node so that I can run
97+
builds on worker nodes with multiple architectures.
98+
99+
#### Taint toleration - cluster admin
100+
101+
As a cluster admin, I want builds to be able to tolerate provided node taints so that they can
102+
be scheduled on nodes that are not suitable/designated for application workloads.
103+
104+
#### Custom Scheduler
105+
106+
As a platform engineer/cluster admin, I want builds to use a custom scheduler so that I can provide
107+
my own scheduler that is optimized for my build workloads.
108+
109+
### Implementation Notes
110+
111+
#### API Updates
112+
113+
The `BuildSpec` API for Build and BuildRun will be updated to add the following fields:
114+
115+
```yaml
116+
spec:
117+
...
118+
nodeSelector: # map[string]string
119+
<node-label>: "label-value"
120+
tolerations: # []Toleration
121+
- key: "taint-key"
122+
operator: Exists|Equal
123+
value: "taint-value"
124+
schedulerName: "custom-scheduler-name" # string
125+
```
126+
127+
The `nodeSelector` and `schedulerName` fields will use golang primitives that match their k8s
128+
equivalents.
129+
130+
#### Tolerations
131+
132+
The Tolerations API for Shipwright will support a limited subset of the upstream Kubernetes
133+
Tolerations API. For simplicity, any Shipwright Build or BuildRun with a toleration set will use
134+
the `NoSchedule` [taint effect](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
135+
136+
```yaml
137+
spec:
138+
tolerations: # Optional array
139+
- key: "taint-key" # Aligns with upstream k8s taint labels. Required
140+
operator: Exists|Equal # Aligns with upstream k8s - key exists or node label key = value. Required
141+
value: "taint-value" # Alights with upstream k8s taint value. Optional.
142+
```
143+
144+
As with upstream k8s, the Shipwright Tolerations API array should support
145+
[strategic merge JSON patching](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#notes-on-the-strategic-merge-patch).
146+
147+
#### Precedence Ordering and Value Merging
148+
149+
Values in `BuildRun` will override those in the referenced `Build` object (if present). Values for
150+
`nodeSelector` and `tolerations` should use strategic merge logic when possible:
151+
152+
- `nodeSelector` merges using map keys. If the map key is present in the `Build` and `BuildRun`
153+
object, the `BuildRun` overrides the value.
154+
- `tolerations` merges using the taint key. If the taint key is present in the `Build` and
155+
`BuildRun` object, the `BuildRun` overrides the value.
156+
157+
This allows the `BuildRun` object to "inherit" values from a parent `Build` object.
158+
159+
#### Impact on Tekton TaskRun
160+
161+
Tekton supports tuning the pod of the `TaskRun` using the
162+
[podTemplate](https://tekton.dev/docs/pipelines/taskruns/#specifying-a-pod-template) field. When
163+
Shipwright creates the `TaskRun` for a build, the respective node selector, tolerations, and
164+
scheduler name can be passed through.
165+
166+
#### Command Line Enhancements
167+
168+
The `shp` CLI _may_ be enhanced to add flags that set the node selector, tolerations, and custom
169+
scheduler for a `BuildRun`. For example, `shp build run` can have the following new options:
170+
171+
- `--node=<key>=<value>`: Use the node label key/value pair in the selector. Can be set more than
172+
once for multiple key/value pairs..
173+
- `--tolerate=<key>` or `--tolerate=<key>=<value>`: Tolerate the taint key, in one of two ways:
174+
- First form: taint key `Exists`.
175+
- Second form: taint key `Equals` provided value.
176+
- In both cases, this flag can be set more than once.
177+
- `--scheduler=<name>`: use custom scheduler with given name. Can only be set once.
178+
179+
180+
#### Hardening Guidelines
181+
182+
Exposing `nodeSelector` and `tolerations` to end developers adds risk with respect to overall
183+
system availability. Some platform teams may not want these Kubernetes internals exposed or
184+
modifiable by end developers at all. To address these concerns, a hardening guideline for
185+
Shipwright Builds should also be published alongside documentation for this feature. This guideline
186+
should recommend the use of third party admission controllers (ex: OPA, Kyverno) to prevent builds
187+
from using values that impact system availability and performance. For example:
188+
189+
- Block toleration of `node.kubernetes.io/*` taints. These are reserved for nodes that are not
190+
ready to receive workloads for scheduling.
191+
- Block node selectors with the `node-role.kubernetes.io/control-plane` label key. This is reserved
192+
for control plane components (`kube-apiserver`, `kube-controller-manager`, etc.)
193+
- Block toleration of the `node-role.kubernetes.io/control-plane` taint key. Same as above.
194+
195+
See the [well known labels](https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane)
196+
documentation for more information.
197+
198+
### Test Plan
199+
200+
- Unit testing can verify that the generated `TaskRun` object for a build contains the desired pod
201+
template fields.
202+
- End to end tests using `KinD` is possible for the `nodeSelector` and `tolerations` fields:
203+
- KinD has support for configuring multiple [nodes](https://kind.sigs.k8s.io/docs/user/configuration/#nodes)
204+
- Once set up, KinD nodes can simulate real nodes when it comes to pod scheduling, node labeling,
205+
and node taints.
206+
- End to end testing for the `schedulerName` field requires the deployment of a custom scheduler,
207+
plus code to verify that the given scheduler was used. This is non-trivial (see
208+
[upstream example](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/#specify-schedulers-for-pods))
209+
and adds a potential failure point to the test suite. Relying on unit testing alone is our best
210+
option.
211+
212+
213+
### Release Criteria
214+
215+
TBD
216+
217+
**Note:** *Section not required until targeted at a release.*
218+
219+
#### Removing a deprecated feature [if necessary]
220+
221+
Not applicable.
222+
223+
#### Upgrade Strategy [if necessary]
224+
225+
The top-level API fields will be optional and default to Golang empty values.
226+
On upgrade, these values will remain empty on existing `Build`/`BuildRun` objects.
227+
228+
229+
### Risks and Mitigations
230+
231+
**Risk:** Node selector field allows disruptive workloads (builds) to be scheduled on control plane
232+
nodes.
233+
234+
*Mitigation*: Hardening guideline added as a requirement for this feature. There may be some
235+
cluster topologies (ex: single node clusters) where scheduling builds on the "control plane" is not
236+
only desirable, but necessary. Hardening guidelines referencing third party admission controllers
237+
preserves flexibility while giving cluster administrators/platform teams the knowledge needed to
238+
harden their environments as they see fit.
239+
240+
241+
## Drawbacks
242+
243+
Exposing these fields leaks - to a certain extent - our abstraction over Kubernetes. This proposal
244+
places k8s pod scheduling fields up front in the API for `Build` and `BuildRun`, a deviation from
245+
Tekton which exposes the fields through a `PodTemplate` sub-field. Cluster administrators may not
246+
want end developers to have control over where these pods are scheduled - they may instead wish to
247+
control pod scheduling through Tekton's
248+
[default pod template](https://github.com/tektoncd/pipeline/blob/main/docs/podtemplates.md#supported-fields)
249+
mechanism at the controller level.
250+
251+
Exposing `nodeSelector` may also conflict with future enhancements to support
252+
[multi-architecture image builds](https://github.com/shipwright-io/build/issues/1119). A
253+
hypothetical build that fans out individual image builds to nodes with desired OS/architecture
254+
pairs may need to explicitly set the `k8s.io/os` and `k8s.io/architecture` node selector fields on
255+
generated `TaskRuns`. With that said, there is currently no mechanism for Shipwright to control
256+
where builds execute on clusters with multiple worker node architectures and operating systems.
257+
258+
259+
## Alternatives
260+
261+
An earlier draft of this proposal included `affinity` for setting pod affinity/anti-affinity rules.
262+
This was rejected due to the complexities of Kubernetes pod affinity and anti-affinity. We need
263+
more concrete user stories from the community to understand what - if anything - we should do with
264+
respect to distributing build workloads through affinity rules. This may also conflict with
265+
Tekton's [affinity assistant](https://tekton.dev/docs/pipelines/affinityassistants/) feature - an optional configuration that is enabled by default in upstream Tekton.
266+
267+
An earlier draft also included the ability to set default values for these fields at the cluster
268+
level. This would be similar to Tekton's capability with the Pipeline controller configuration.
269+
Since this option is available at the Tekton pipeline level, adding nearly identical features to
270+
Shipwright is being deferred. Tuning pod template values with the Tekton pipeline controller may
271+
also be an acceptable alternative to this feature in some circumstances.
272+
273+
274+
## Infrastructure Needed [optional]
275+
276+
No additional infrastructure antipated.
277+
Test KinD clusters may need to deploy with additional nodes where these features can be verified.
278+
279+
## Implementation History
280+
281+
- 2024-05-15: Created as `provisional`
282+
- 2024-06-20: Draft updated to `implementable`

0 commit comments

Comments
 (0)