-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
TL;DR
You can't use the perf collector in kubernetes without root. CAP_PERFMON for the container is not enough, the container needs a file with CAP_PERFMON. Giving the node_exporter binary CAP_PERFMON works as expected.
Desired outcome
Enable perf collector for instructions and cycles metrics while running node-exporter:
- in kubernetes (1.32, cri-o 1.32)
- as non-root (e.g. nobody)
- with
CAP_PERFMON
In other words: use minimal privileges for node_exporter.
The sysctl change is still required, however a value of 1 is sufficient (no unprivileged access needed, higher values may work but I haven't tested this yet).
Testing the status quo
I am testing on ubuntu 24.04 LTS, kubernetes + cri-o 1.32 from the official deb packages, kubeadm with default settings. The only adjustment is the sysctl change for perf_event_paraniod to 1 and disabled swapping.
The test daemonset yaml is: (derived from the community helm chart)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:latest
imagePullPolicy: Always
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.disable-defaults
- --collector.perf
- --collector.perf.hardware-profilers=CpuCycles,CpuInstr,CacheRef,CacheMisses,BranchInstr,BranchMisses,StalledCyclesBackend,StalledCyclesFrontend,RefCpuCycles
- --collector.perf.software-profilers=PageFault,ContextSwitch,CpuMigration,MinorFault,MajorFault
- --collector.perf.cache-profilers=InstrTLBReadHit
securityContext:
capabilities:
add: [ "PERFMON" ]
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
tolerations are needed to run on the local single control plane node.
Capabilities as non-root
The setup leads to the following permissions
# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid: 65534 65534 65534 65534
Gid: 65534 65534 65534 65534
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs: 0
The privileges are locked down, the bounding set includes CAP_PERFMON (00000040000005fb) and the container is running as non-root (nobody / 65534).
node_exporter is not able to export perf collector related metrics as the process can't get CAP_PERFMON into effect.
Capabilities as root
Adding a security context that runs the pod as root
securityContext:
runAsNonRoot: false
runAsUser: 0
leads to the following permissions:
# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid: 0 0 0 0
Gid: 0 0 0 0
CapInh: 0000000000000000
CapPrm: 00000040000005fb
CapEff: 00000040000005fb
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Note that the effective capabilities are now in line with the bounding set. perf collector metrics work.
No capabilities and root
Changing the container security context to
securityContext:
capabilities:
add: [ ]
and keeping root leads to
# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid: 0 0 0 0
Gid: 0 0 0 0
CapInh: 0000000000000000
CapPrm: 00000000000005fb
CapEff: 00000000000005fb
CapBnd: 00000000000005fb
CapAmb: 0000000000000000
NoNewPrivs: 0
There are no perf collector metrics. root in a container alone is not enough to get the metrics (this is good).
Results of the status quo
CAP_PERFMONmakes it possible to fetch performance metrics- even under restricted (<4, >0) conditions
rootis required to getCAP_PERFMONinto effect
Overcoming root
What we have:
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb
what we want
CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb
man capabilities has extensive documentation on how CapPerm and CapEff are calculated when executing a process.
There are 2 ways in which we could gain CAP_PERFMON
- Set file capability (
setcap cap_perfmon=ep node_exporter) - Use a wrapper that configures the ambient and inheritable set and then executes the binary
TL;DR: non-root usage requires some file with CAP_PERFMON permissions
Note that busybox does not support the PERFMON capability, at least not by name. I also could not find any binaries in /bin/ that would have that capability or special privileges (overall good).
Let's try the other approach by building a node-exporter:perfmon image.
# cat Dockerfile
ARG ARCH="amd64"
ARG OS="linux"
FROM quay.io/prometheus/node-exporter:latest
FROM ubuntu:24.04
COPY --from=0 /bin/node_exporter /srv/node_exporter
RUN apt update && apt install -y libcap2-bin && \
setcap cap_perfmon=ep /srv/node_exporter
FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
COPY --from=1 /srv/node_exporter /bin/node_exporter
EXPOSE 9100
USER nobody
ENTRYPOINT [ "/bin/node_exporter" ]
# podman build -t localhost/node-exporter:perfmon .
Running the patched image as non-root leads to the following security permissions
# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid: 65534 65534 65534 65534
Gid: 65534 65534 65534 65534
CapInh: 0000000000000000
CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs: 0
as well as working perfmon metrics.
Removing the CAP_PERFMON would lead to a startup error.
Question / Path forward
How should we proceed with this? I am happy to help, but what would be the desired path forward?
- CAP_PERFMON was added with Linux 5.8 (2020), older kernels with long term support exist
- This may exclude using CAP_PERFMON by default or shipping images with this enabled by default (TODO: test this)
- Startup failures without CAP_PERFMON would be annoying for users :-/
- Running node_exporter as non-root with working perf metrics seems highly desirable as instructions per second and instructions per cycle are important performance metrics
- Usefulness might be limited if the sysctl change is always needed....