Skip to content

node_exporter OCI images can't run as non-root and use CAP_PERFMON in kubernetes #3477

@rtreffer

Description

@rtreffer

TL;DR

You can't use the perf collector in kubernetes without root. CAP_PERFMON for the container is not enough, the container needs a file with CAP_PERFMON. Giving the node_exporter binary CAP_PERFMON works as expected.

Desired outcome

Enable perf collector for instructions and cycles metrics while running node-exporter:

  • in kubernetes (1.32, cri-o 1.32)
  • as non-root (e.g. nobody)
  • with CAP_PERFMON

In other words: use minimal privileges for node_exporter.
The sysctl change is still required, however a value of 1 is sufficient (no unprivileged access needed, higher values may work but I haven't tested this yet).

Testing the status quo

I am testing on ubuntu 24.04 LTS, kubernetes + cri-o 1.32 from the official deb packages, kubeadm with default settings. The only adjustment is the sysctl change for perf_event_paraniod to 1 and disabled swapping.

The test daemonset yaml is: (derived from the community helm chart)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      containers:
        - name: node-exporter
          image: quay.io/prometheus/node-exporter:latest
          imagePullPolicy: Always
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --collector.disable-defaults
            - --collector.perf
            - --collector.perf.hardware-profilers=CpuCycles,CpuInstr,CacheRef,CacheMisses,BranchInstr,BranchMisses,StalledCyclesBackend,StalledCyclesFrontend,RefCpuCycles
            - --collector.perf.software-profilers=PageFault,ContextSwitch,CpuMigration,MinorFault,MajorFault
            - --collector.perf.cache-profilers=InstrTLBReadHit
          securityContext:
            capabilities:
              add: [ "PERFMON" ]
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys

tolerations are needed to run on the local single control plane node.

Capabilities as non-root

The setup leads to the following permissions

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    65534   65534   65534   65534
Gid:    65534   65534   65534   65534
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

The privileges are locked down, the bounding set includes CAP_PERFMON (00000040000005fb) and the container is running as non-root (nobody / 65534).

node_exporter is not able to export perf collector related metrics as the process can't get CAP_PERFMON into effect.

Capabilities as root

Adding a security context that runs the pod as root

      securityContext:
        runAsNonRoot: false
        runAsUser: 0

leads to the following permissions:

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    0       0       0       0
Gid:    0       0       0       0
CapInh: 0000000000000000
CapPrm: 00000040000005fb
CapEff: 00000040000005fb
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

Note that the effective capabilities are now in line with the bounding set. perf collector metrics work.

No capabilities and root

Changing the container security context to

          securityContext:
            capabilities:
              add: [ ]

and keeping root leads to

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    0       0       0       0
Gid:    0       0       0       0
CapInh: 0000000000000000
CapPrm: 00000000000005fb
CapEff: 00000000000005fb
CapBnd: 00000000000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

There are no perf collector metrics. root in a container alone is not enough to get the metrics (this is good).

Results of the status quo

  • CAP_PERFMON makes it possible to fetch performance metrics
    • even under restricted (<4, >0) conditions
  • root is required to get CAP_PERFMON into effect

Overcoming root

What we have:

CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb

what we want

CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb

man capabilities has extensive documentation on how CapPerm and CapEff are calculated when executing a process.
There are 2 ways in which we could gain CAP_PERFMON

  • Set file capability (setcap cap_perfmon=ep node_exporter)
  • Use a wrapper that configures the ambient and inheritable set and then executes the binary

TL;DR: non-root usage requires some file with CAP_PERFMON permissions

Note that busybox does not support the PERFMON capability, at least not by name. I also could not find any binaries in /bin/ that would have that capability or special privileges (overall good).

Let's try the other approach by building a node-exporter:perfmon image.

# cat Dockerfile
ARG ARCH="amd64"
ARG OS="linux"

FROM quay.io/prometheus/node-exporter:latest

FROM ubuntu:24.04
COPY --from=0 /bin/node_exporter /srv/node_exporter
RUN  apt update && apt install -y libcap2-bin && \
     setcap cap_perfmon=ep /srv/node_exporter

FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
COPY --from=1 /srv/node_exporter /bin/node_exporter

EXPOSE      9100
USER        nobody
ENTRYPOINT  [ "/bin/node_exporter" ]
# podman build -t localhost/node-exporter:perfmon .

Running the patched image as non-root leads to the following security permissions

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    65534   65534   65534   65534
Gid:    65534   65534   65534   65534
CapInh: 0000000000000000
CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

as well as working perfmon metrics.
Removing the CAP_PERFMON would lead to a startup error.

Question / Path forward

How should we proceed with this? I am happy to help, but what would be the desired path forward?

  • CAP_PERFMON was added with Linux 5.8 (2020), older kernels with long term support exist
  • This may exclude using CAP_PERFMON by default or shipping images with this enabled by default (TODO: test this)
  • Startup failures without CAP_PERFMON would be annoying for users :-/
  • Running node_exporter as non-root with working perf metrics seems highly desirable as instructions per second and instructions per cycle are important performance metrics
  • Usefulness might be limited if the sysctl change is always needed....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions