|  | 
|  | 1 | +Secure Bubblewrap inside Kubernetes with ProcMount | 
|  | 2 | +################################################## | 
|  | 3 | + | 
|  | 4 | +:date: 2024-12-09 | 
|  | 5 | +:category: blog | 
|  | 6 | +:authors: tristanC | 
|  | 7 | + | 
|  | 8 | +.. raw:: html | 
|  | 9 | + | 
|  | 10 | +   <style type="text/css"> | 
|  | 11 | +
 | 
|  | 12 | +     .literal { | 
|  | 13 | +       border-radius: 6px; | 
|  | 14 | +       padding: 1px 1px; | 
|  | 15 | +       background-color: rgba(27,31,35,.05); | 
|  | 16 | +     } | 
|  | 17 | +
 | 
|  | 18 | +   </style> | 
|  | 19 | + | 
|  | 20 | +This post explores how to create nested containers securely inside | 
|  | 21 | +Kubernetes. In the previous post titled `Recursive namespaces to run | 
|  | 22 | +containers inside a container`_ I showed how to create nested containers | 
|  | 23 | +using a rootless container runtimes like Podman. In this post, I'll | 
|  | 24 | +demonstrate how to run the same workload with `Kubernetes`_. | 
|  | 25 | + | 
|  | 26 | +In two parts, I will present: | 
|  | 27 | + | 
|  | 28 | +-  How to run Kubernetes from source. | 
|  | 29 | +-  The ProcMountType feature to work around the original issue. | 
|  | 30 | + | 
|  | 31 | +Context and problem statement | 
|  | 32 | +============================= | 
|  | 33 | + | 
|  | 34 | +The context of this post is to deploy a service named zuul-executor for | 
|  | 35 | +running CI builds securely inside Kubernetes, without requiring a | 
|  | 36 | +privileged security context. | 
|  | 37 | + | 
|  | 38 | +The problem is that this service performs build isolation locally using | 
|  | 39 | +`Bubblewrap`_, which is similar to running a container inside a | 
|  | 40 | +container. | 
|  | 41 | + | 
|  | 42 | +Run kubernetes locally | 
|  | 43 | +====================== | 
|  | 44 | + | 
|  | 45 | +In this section, let's set up Kubernetes locally. On a fresh Fedora 41 | 
|  | 46 | +system, install the following requirements: | 
|  | 47 | + | 
|  | 48 | +.. code-block:: ShellSession | 
|  | 49 | +
 | 
|  | 50 | +   $ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins | 
|  | 51 | +   $ sudo systemctl start crio | 
|  | 52 | +
 | 
|  | 53 | +Then, start Kubernetes using the *local-up-cluster* script as follows: | 
|  | 54 | + | 
|  | 55 | +.. code-block:: ShellSession | 
|  | 56 | +
 | 
|  | 57 | +   $ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes | 
|  | 58 | +   $ git clone https://github.com/kubernetes/kubernetes/ | 
|  | 59 | +   $ cd kubernetes | 
|  | 60 | +   $ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \ | 
|  | 61 | +       ./hack/local-up-cluster.sh | 
|  | 62 | +   ... | 
|  | 63 | +   Local Kubernetes cluster is running. Press Ctrl-C to shut it down. | 
|  | 64 | +
 | 
|  | 65 | +… using the following test resource: | 
|  | 66 | + | 
|  | 67 | +.. code-block:: yaml | 
|  | 68 | +
 | 
|  | 69 | +   apiVersion: v1 | 
|  | 70 | +   kind: Pod | 
|  | 71 | +   metadata: | 
|  | 72 | +     name: test-bwrap | 
|  | 73 | +   spec: | 
|  | 74 | +     containers: | 
|  | 75 | +       - name: test | 
|  | 76 | +         image: quay.io/zuul-ci/zuul-executor | 
|  | 77 | +         command: ["/bin/sleep", "infinity"] | 
|  | 78 | +         securityContext: | 
|  | 79 | +           capabilities: | 
|  | 80 | +             add: ["SETFCAP"] | 
|  | 81 | +
 | 
|  | 82 | +.. | 
|  | 83 | +
 | 
|  | 84 | +   As seen previously, we need *CAP_SETFCAP* to create the user | 
|  | 85 | +   namespace, otherwise bwrap fails early with the following error: | 
|  | 86 | + | 
|  | 87 | +   :: | 
|  | 88 | + | 
|  | 89 | +      bwrap: setting up uid map: Operation not permitted | 
|  | 90 | + | 
|  | 91 | +Apply the test resource with the following commands: | 
|  | 92 | + | 
|  | 93 | +.. code-block:: ShellSession | 
|  | 94 | +
 | 
|  | 95 | +   $ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig | 
|  | 96 | +   $ kubectl apply -f test-bwrap.yaml | 
|  | 97 | +   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | 
|  | 98 | +   bwrap: Can't mount proc on /newroot/proc: Operation not permitted | 
|  | 99 | +
 | 
|  | 100 | +This produces the same error we encountered in the `previous post`_: the | 
|  | 101 | +/proc filesystem is tainted in the pod, preventing Bubblewrap from being | 
|  | 102 | +able to create a new procfs for the new PID namespace. | 
|  | 103 | + | 
|  | 104 | +The next section introduces the *ProcMountType* feature to work around | 
|  | 105 | +this issue. | 
|  | 106 | + | 
|  | 107 | +The ProcMountType feature | 
|  | 108 | +========================= | 
|  | 109 | + | 
|  | 110 | +The *ProcMountType* feature can be enabled by adding the following | 
|  | 111 | +environment variable to the *local-up-cluster*: | 
|  | 112 | +``FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'``. To | 
|  | 113 | +make use of the new feature, we also need to activate | 
|  | 114 | +*UserNamespacesSupport*, as explained in the following `documentation`_. | 
|  | 115 | + | 
|  | 116 | +With these features, we can update the resource like that: | 
|  | 117 | + | 
|  | 118 | +.. code-block:: yaml | 
|  | 119 | +
 | 
|  | 120 | +   apiVersion: v1 | 
|  | 121 | +   kind: Pod | 
|  | 122 | +   metadata: | 
|  | 123 | +     name: test-bwrap | 
|  | 124 | +   spec: | 
|  | 125 | +     hostUsers: false | 
|  | 126 | +     containers: | 
|  | 127 | +       - name: test | 
|  | 128 | +         image: quay.io/zuul-ci/zuul-executor | 
|  | 129 | +         command: ["/bin/sleep", "infinity"] | 
|  | 130 | +         securityContext: | 
|  | 131 | +           procMount: Unmasked | 
|  | 132 | +           capabilities: | 
|  | 133 | +             add: ["SETFCAP"] | 
|  | 134 | +
 | 
|  | 135 | +… using the following commands: | 
|  | 136 | + | 
|  | 137 | +:: | 
|  | 138 | + | 
|  | 139 | +   $ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml | 
|  | 140 | +   pod/test-bwrap created | 
|  | 141 | +   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | 
|  | 142 | +   bwrap: Can't mount proc on /newroot/proc: Permission denied | 
|  | 143 | + | 
|  | 144 | +This time we get a new permission denied, which is caused by SELinux. | 
|  | 145 | +Using *audit2allow*, we can see that the following policy needs to be | 
|  | 146 | +installed: | 
|  | 147 | + | 
|  | 148 | +:: | 
|  | 149 | + | 
|  | 150 | +   module nestedcontainers 1.0; | 
|  | 151 | + | 
|  | 152 | +   require { | 
|  | 153 | +       type proc_t; | 
|  | 154 | +       type devpts_t; | 
|  | 155 | +       type container_t; | 
|  | 156 | +       class filesystem mount; | 
|  | 157 | +   } | 
|  | 158 | + | 
|  | 159 | +   #============= container_t ============== | 
|  | 160 | +   allow container_t devpts_t:filesystem mount; | 
|  | 161 | +   allow container_t proc_t:filesystem mount; | 
|  | 162 | + | 
|  | 163 | +… which lets us run Bubblewrap inside an unprivileged pod: | 
|  | 164 | + | 
|  | 165 | +.. code-block:: ShellSession | 
|  | 166 | +
 | 
|  | 167 | +   $ sudo semodule -i nestedcontainers.pp | 
|  | 168 | +   $ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | 
|  | 169 | +       PID TTY      STAT   TIME COMMAND | 
|  | 170 | +         1 ?        Ss     0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx | 
|  | 171 | +         2 ?        R      0:00 ps afx | 
|  | 172 | +
 | 
|  | 173 | +Notice how the ``sleep infinity`` process is not visible in the ps | 
|  | 174 | +output, confirming that we are indeed running in a nested container. | 
|  | 175 | + | 
|  | 176 | +Conclusion | 
|  | 177 | +========== | 
|  | 178 | + | 
|  | 179 | +This post demonstrates that we can run a container inside a container | 
|  | 180 | +with Kubernetes thanks to the following settings: | 
|  | 181 | + | 
|  | 182 | +-  The SETFCAP to create the user namespace, | 
|  | 183 | +-  The ProcMountType and UserNamespacesSupport to unmask the /proc | 
|  | 184 | +   filesystem, and | 
|  | 185 | +-  A SELinux policy to enable mounting filesystems inside the new | 
|  | 186 | +   namespace. | 
|  | 187 | + | 
|  | 188 | +.. _Recursive namespaces to run containers inside a container: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html | 
|  | 189 | +.. _Kubernetes: https://kubernetes.io/ | 
|  | 190 | +.. _Bubblewrap: https://github.com/containers/bubblewrap | 
|  | 191 | +.. _previous post: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html | 
|  | 192 | +.. _documentation: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access | 
0 commit comments