Commit a3c3876
authored
feat(supervisor): add node affinity rules for large machine worker pool scheduling (#2869)
**Background**
Runs with `large-1x` or `large-2x` machine presets are disproportionally
affected by scheduling delays during peak times. This is in part caused
by the fact that the worker pool is shared for all runs, meaning large
runs compete with smaller runs for available capacity. Because large
runs require significantly more CPU and memory, they are harder for the
scheduler to bin-pack onto existing nodes, often requiring a node with a
significant amount of free resources or waiting for a new node to spin
up entirely. This effect is amplified during peak times when nodes are
already densely packed with smaller workloads, leaving insufficient
contiguous resources for large runs. Also, large runs make up a small
percentage of the total runs.
**Changes**
This PR adds Kubernetes node affinity settings to separate large and
standard machine workloads across node pools.
- Controlled via `KUBERNETES_LARGE_MACHINE_POOL_LABEL` env var (disabled
when not set)
- Large machine presets (large-*) get a soft preference to schedule on
the large pool, with fallback to standard nodes
- Non-large machines are excluded from the large pool via required
anti-affinity
- This ensures the large machine pool is reserved for large workloads
while allowing large workloads to spill over to standard nodes if needed1 parent 7a94908 commit a3c3876
2 files changed
+53
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| 94 | + | |
94 | 95 | | |
95 | 96 | | |
96 | 97 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| 98 | + | |
98 | 99 | | |
99 | 100 | | |
100 | 101 | | |
| |||
356 | 357 | | |
357 | 358 | | |
358 | 359 | | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
359 | 411 | | |
0 commit comments