Add bubblewrap-in-kubernetes post

TristanCacqueray · TristanCacqueray · commit c3619ec1d2e6 · 2024-12-10T16:23:26.000-05:00
Change-Id: I4f65aa8f2fe7e52614c48de4cee5bd8a19250aad
diff --git a/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.md b/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.md
@@ -0,0 +1,151 @@
+This post explores how to create nested containers securely inside Kubernetes.
+In the previous post titled [Recursive namespaces to run containers inside a container][prev-post]
+I showed how to create nested containers using a rootless container runtimes like Podman.
+In this post, I'll demonstrate how to run the same workload with [Kubernetes][k8s].
+
+In two parts, I will present:
+
+- How to run Kubernetes from source.
+- The ProcMountType feature to work around the original issue.
+
+
+## Context and problem statement
+
+The context of this post is to deploy a service named zuul-executor for running CI builds securely inside Kubernetes,
+without requiring a privileged security context.
+
+The problem is that this service performs build isolation locally using [Bubblewrap][bwrap],
+which is similar to running a container inside a container.
+
+
+## Run kubernetes locally
+
+In this section, let's set up Kubernetes locally.
+On a fresh Fedora 41 system, install the following requirements:
+
+```ShellSession
+$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins
+$ sudo systemctl start crio
+```
+
+Then, start Kubernetes using the *local-up-cluster* script as follows:
+
+```ShellSession
+$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes
+$ git clone https://github.com/kubernetes/kubernetes/
+$ cd kubernetes
+$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \
+    ./hack/local-up-cluster.sh
+...
+Local Kubernetes cluster is running. Press Ctrl-C to shut it down.
+```
+
+… using the following test resource:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-bwrap
+spec:
+  containers:
+    - name: test
+      image: quay.io/zuul-ci/zuul-executor
+      command: ["/bin/sleep", "infinity"]
+      securityContext:
+        capabilities:
+          add: ["SETFCAP"]
+```
+
+> As seen previously, we need *CAP_SETFCAP* to create the user namespace, otherwise bwrap fails early with the following error:
+>
+> ```
+> bwrap: setting up uid map: Operation not permitted
+> ```
+
+Apply the test resource with the following commands:
+
+```ShellSession
+$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
+$ kubectl apply -f test-bwrap.yaml
+$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+bwrap: Can't mount proc on /newroot/proc: Operation not permitted
+```
+
+This produces the same error we encountered in the [previous post][prev-post]: the /proc filesystem is tainted in the pod, preventing Bubblewrap from being able to create a new procfs for the new PID namespace.
+
+The next section introduces the *ProcMountType* feature to work around this issue.
+
+## The ProcMountType feature
+
+The *ProcMountType* feature can be enabled by adding the following environment variable to the *local-up-cluster*: `FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'`.
+To make use of the new feature, we also need to activate *UserNamespacesSupport*, as explained in the following [documentation](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access).
+
+With these features, we can update the resource like that:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-bwrap
+spec:
+  hostUsers: false
+  containers:
+    - name: test
+      image: quay.io/zuul-ci/zuul-executor
+      command: ["/bin/sleep", "infinity"]
+      securityContext:
+        procMount: Unmasked
+        capabilities:
+          add: ["SETFCAP"]
+```
+
+… using the following commands:
+
+```
+$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml
+pod/test-bwrap created
+$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+bwrap: Can't mount proc on /newroot/proc: Permission denied
+```
+
+This time we get a new permission denied, which is caused by SELinux. Using *audit2allow*, we can see that the following policy needs to be installed:
+
+```
+module nestedcontainers 1.0;
+
+require {
+    type proc_t;
+    type devpts_t;
+    type container_t;
+    class filesystem mount;
+}
+
+#============= container_t ==============
+allow container_t devpts_t:filesystem mount;
+allow container_t proc_t:filesystem mount;
+```
+
+… which lets us run Bubblewrap inside an unprivileged pod:
+
+```ShellSession
+$ sudo semodule -i nestedcontainers.pp
+$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+    PID TTY      STAT   TIME COMMAND
+      1 ?        Ss     0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx
+      2 ?        R      0:00 ps afx
+```
+
+Notice how the `sleep infinity` process is not visible in the ps output, confirming that we are indeed running in a nested container.
+
+## Conclusion
+
+This post demonstrates that we can run a container inside a container with Kubernetes thanks to the following settings:
+
+- The SETFCAP to create the user namespace,
+- The ProcMountType and UserNamespacesSupport to unmask the /proc filesystem, and
+- A SELinux policy to enable mounting filesystems inside the new namespace.
+
+[prev-post]: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
+[k8s]: https://kubernetes.io/
+[bwrap]: https://github.com/containers/bubblewrap
diff --git a/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst b/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst
@@ -0,0 +1,18 @@
+Secure Bubblewrap inside Kubernetes with ProcMount
+##################################################
+
+:date: 2024-12-09
+:category: blog
+:authors: tristanC
+
+.. raw:: html
+
+   <style type="text/css">
+
+     .literal {
+       border-radius: 6px;
+       padding: 1px 1px;
+       background-color: rgba(27,31,35,.05);
+     }
+
+   </style>
diff --git a/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.sh b/src/blog-bubblewrap-in-kubernetes-pod-with-procmount.sh
@@ -0,0 +1,11 @@
+#! /usr/bin/env nix-shell
+#! nix-shell -i bash -p pandoc
+#! nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4d2b37a84fad1091b9de401eb450aae66f1a741e.tar.gz
+
+NAME="blog-bubblewrap-in-kubernetes-pod-with-procmount"
+
+pandoc --include-in-header=./$NAME.rst \
+       -f gfm --reference-links  \
+       -t rst ./$NAME.md -o ../website/content/$NAME.rst
+
+sed -e 's|^.. code::|.. code-block::|' -i ../website/content/$NAME.rst
diff --git a/website/content/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst b/website/content/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst
@@ -0,0 +1,192 @@
+Secure Bubblewrap inside Kubernetes with ProcMount
+##################################################
+
+:date: 2024-12-09
+:category: blog
+:authors: tristanC
+
+.. raw:: html
+
+   <style type="text/css">
+
+     .literal {
+       border-radius: 6px;
+       padding: 1px 1px;
+       background-color: rgba(27,31,35,.05);
+     }
+
+   </style>
+
+This post explores how to create nested containers securely inside
+Kubernetes. In the previous post titled `Recursive namespaces to run
+containers inside a container`_ I showed how to create nested containers
+using a rootless container runtimes like Podman. In this post, I'll
+demonstrate how to run the same workload with `Kubernetes`_.
+
+In two parts, I will present:
+
+-  How to run Kubernetes from source.
+-  The ProcMountType feature to work around the original issue.
+
+Context and problem statement
+=============================
+
+The context of this post is to deploy a service named zuul-executor for
+running CI builds securely inside Kubernetes, without requiring a
+privileged security context.
+
+The problem is that this service performs build isolation locally using
+`Bubblewrap`_, which is similar to running a container inside a
+container.
+
+Run kubernetes locally
+======================
+
+In this section, let's set up Kubernetes locally. On a fresh Fedora 41
+system, install the following requirements:
+
+.. code-block:: ShellSession
+
+   $ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins
+   $ sudo systemctl start crio
+
+Then, start Kubernetes using the *local-up-cluster* script as follows:
+
+.. code-block:: ShellSession
+
+   $ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes
+   $ git clone https://github.com/kubernetes/kubernetes/
+   $ cd kubernetes
+   $ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \
+       ./hack/local-up-cluster.sh
+   ...
+   Local Kubernetes cluster is running. Press Ctrl-C to shut it down.
+
+… using the following test resource:
+
+.. code-block:: yaml
+
+   apiVersion: v1
+   kind: Pod
+   metadata:
+     name: test-bwrap
+   spec:
+     containers:
+       - name: test
+         image: quay.io/zuul-ci/zuul-executor
+         command: ["/bin/sleep", "infinity"]
+         securityContext:
+           capabilities:
+             add: ["SETFCAP"]
+
+..
+
+   As seen previously, we need *CAP_SETFCAP* to create the user
+   namespace, otherwise bwrap fails early with the following error:
+
+   ::
+
+      bwrap: setting up uid map: Operation not permitted
+
+Apply the test resource with the following commands:
+
+.. code-block:: ShellSession
+
+   $ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
+   $ kubectl apply -f test-bwrap.yaml
+   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+   bwrap: Can't mount proc on /newroot/proc: Operation not permitted
+
+This produces the same error we encountered in the `previous post`_: the
+/proc filesystem is tainted in the pod, preventing Bubblewrap from being
+able to create a new procfs for the new PID namespace.
+
+The next section introduces the *ProcMountType* feature to work around
+this issue.
+
+The ProcMountType feature
+=========================
+
+The *ProcMountType* feature can be enabled by adding the following
+environment variable to the *local-up-cluster*:
+``FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'``. To
+make use of the new feature, we also need to activate
+*UserNamespacesSupport*, as explained in the following `documentation`_.
+
+With these features, we can update the resource like that:
+
+.. code-block:: yaml
+
+   apiVersion: v1
+   kind: Pod
+   metadata:
+     name: test-bwrap
+   spec:
+     hostUsers: false
+     containers:
+       - name: test
+         image: quay.io/zuul-ci/zuul-executor
+         command: ["/bin/sleep", "infinity"]
+         securityContext:
+           procMount: Unmasked
+           capabilities:
+             add: ["SETFCAP"]
+
+… using the following commands:
+
+::
+
+   $ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml
+   pod/test-bwrap created
+   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+   bwrap: Can't mount proc on /newroot/proc: Permission denied
+
+This time we get a new permission denied, which is caused by SELinux.
+Using *audit2allow*, we can see that the following policy needs to be
+installed:
+
+::
+
+   module nestedcontainers 1.0;
+
+   require {
+       type proc_t;
+       type devpts_t;
+       type container_t;
+       class filesystem mount;
+   }
+
+   #============= container_t ==============
+   allow container_t devpts_t:filesystem mount;
+   allow container_t proc_t:filesystem mount;
+
+… which lets us run Bubblewrap inside an unprivileged pod:
+
+.. code-block:: ShellSession
+
+   $ sudo semodule -i nestedcontainers.pp
+   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
+       PID TTY      STAT   TIME COMMAND
+         1 ?        Ss     0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx
+         2 ?        R      0:00 ps afx
+
+Notice how the ``sleep infinity`` process is not visible in the ps
+output, confirming that we are indeed running in a nested container.
+
+Conclusion
+==========
+
+This post demonstrates that we can run a container inside a container
+with Kubernetes thanks to the following settings:
+
+-  The SETFCAP to create the user namespace,
+-  The ProcMountType and UserNamespacesSupport to unmask the /proc
+   filesystem, and
+-  A SELinux policy to enable mounting filesystems inside the new
+   namespace.
+
+.. _Recursive namespaces to run containers inside a container: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
+.. _Kubernetes: https://kubernetes.io/
+.. _Bubblewrap: https://github.com/containers/bubblewrap
+.. _previous post: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
+.. _documentation: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access