Skip to content

Commit 76ea8df

Browse files
committed
Add recommendations for Kubernetes cluster administrators
This change comes out of a recent discussion that took place on Slack between Alex Fortune, Benjamin Ingberg, Gabriel Figueira and me about bb_runner getting killed by the OOM killer. It turns out recent versions of Kubernetes require a change to Kubelet's configuration to prevent it from killing cgroups as a whole.
1 parent 67f7995 commit 76ea8df

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed

kubernetes/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,26 @@ kubectl apply -k .
99

1010
These files assume that the cluster needs to be created in the
1111
`buildbarn` namespace. Storage is backed by persistent volumes.
12+
13+
## Recommendations for cluster operators
14+
15+
It is desirable to set the
16+
[CPU requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
17+
for bb\_runner containers to a fixed value, so that the running times of
18+
actions remain consistent. However, even if these are set, functions
19+
like Python's [`os.process_cpu_count()`](https://docs.python.org/3/library/os.html#os.process_cpu_count)
20+
and Go's [`runtime.NumCPU()`](https://pkg.go.dev/runtime#NumCPU) may
21+
report the number of CPU cores present on the system itself. This causes
22+
applications that launch thread pools based on CPU core count to exhibit
23+
heavy throttling. Cluster operators are therefore advised to enable the
24+
[`static` CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-configuration),
25+
so that bb\_runner containers can be assigned to dedicated CPU cores.
26+
27+
Recent versions of Kubernetes have migrated to
28+
[cgroups v2](https://kubernetes.io/docs/concepts/architecture/cgroups/).
29+
This caused a subtle change where if a process causes a container to
30+
reach its memory limit, all processes belonging to that container are
31+
killed by the Out Of Memory (OOM) killer. For Buildbarn this is
32+
problematic, as it means that bb\_runner gets killed as well. Kubernetes
33+
1.32 and later make it possible to restore the old behavior by enabling
34+
[the `singleProcessOOMKill` option in the Kubelet configuration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/).

0 commit comments

Comments
 (0)