|
| 1 | +# nodecryptor: better wireguard for cilium |
| 2 | + |
| 3 | +> [!WARNING] > This is experimental / works-for-the-author-grade software! |
| 4 | +> Tested with cilium 1.18.2 |
| 5 | +
|
| 6 | +## Motivation |
| 7 | + |
| 8 | +Cilium's [transparent wireguard encryption](https://docs.cilium.io/en/stable/security/network/encryption-wireguard/) |
| 9 | +encrypts (most of) the traffic flowing between nodes of a cluster. While this |
| 10 | +undoubtedly improves the security of your cluster network, it's current implementation |
| 11 | +has two important limitations: |
| 12 | + |
| 13 | +1) Cilium treats tunneling and encryption as orthogonal problems: With transparent |
| 14 | + encryption on, it will send vxlan tunnel traffic through the wg tunnel. That |
| 15 | + means very packet is encapsulated twice. Unless you need L2-connectivity between |
| 16 | + pods, this is wasteful not only in terms of compute, but more importantly you have |
| 17 | + the MTU-overhead of wg (60 byte) and vxlan (50 byte) instead of only the former. |
| 18 | + |
| 19 | +2) Traffic between root network-namespaces of nodes can be encrypted (Node-to-Node |
| 20 | + encryption, considered beta) but it creates a bootstrapping-problem for |
| 21 | + control-plane nodes, which cilium solves by excluding control-plane nodes from |
| 22 | + Node-to-Node encryption. (See "The bootsrapping problem" below) |
| 23 | + |
| 24 | +### Why these are relevant limitations: |
| 25 | + |
| 26 | +Given that cilium currently does not support different MTUs for South-West and |
| 27 | +North-South traffic, most setups will be working with an MTU of about 1500. That |
| 28 | +ditching vxlan would remove a 3% overhead. The downside is that there is no more |
| 29 | +L2-connectivity between pods on different nodes, but that is neither mandated by k8s, |
| 30 | +nor a common requirement for cluster workloads. |
| 31 | + |
| 32 | +In my experience, it is surprisingly easy to overlook that pods on different nodes |
| 33 | +communicate using the root network namespace. While arguably all workloads should be |
| 34 | +encrypted end-to-end and considering the cluster network trusted is perhaps a bit |
| 35 | +foolish, it would be even more foolish to assume every organization has the |
| 36 | +capabilities to implement encryption on all services. Especially in multi-cloud-clusters |
| 37 | +overlooking these limitations of node-to-node encryption may mean transmitting |
| 38 | +credentials |
| 39 | +or other sensitive information in the clear. Setups without dedicated control-plane |
| 40 | +nodes are especially susceptible to such "accidents". |
| 41 | + |
| 42 | +## Approach |
| 43 | + |
| 44 | +The first issue is already almost entirely solved by setting cilium's |
| 45 | +`routingMode` to `native`, configuring `ipv4NativeRoutingCIDR` to cover the entire |
| 46 | +pod-cidr and simultaneously enabling encryption with `type: wireguard`. With these |
| 47 | +settings, pod-to-pod traffic is sent through `cilium_wg0`. |
| 48 | + |
| 49 | +However, traffic between your nodes' root namespaces and remote pods, as well as |
| 50 | +node-to-node traffic will be treated like north-south traffic and simply go through the |
| 51 | +main routing table and thus likely the default interface, unencrypted. This issue is |
| 52 | +rather easy to fix, especially since cilium already configures all relevant ips |
| 53 | +as allowed ips on the peers of the `cilium_wg0` interface. We merely need to route |
| 54 | +traffic between pods and remote nodes (and vice versa) through that interface. The |
| 55 | +fact that cilium doesn't already do that in the above description may arguably |
| 56 | +be considered a bug. |
| 57 | + |
| 58 | +The second issue, allowing node-to-node encryption on control-plane nodes is a bit |
| 59 | +more tricky, since we need to solve the bootstrapping-problem. If you are not familiar |
| 60 | +with this bootstrapping problem is, I recommend going to the description below |
| 61 | +and returning here. |
| 62 | + |
| 63 | +Now, the way this PoC solves - or rather: avoids - the bootstrapping this problem is |
| 64 | +that it exempts only the traffic necessary for bootstraping from encryption, rather |
| 65 | +than the entire node. The key circumstance that allows us to do this is that the |
| 66 | +kubernetes-api and etcd connections between control-plane nodes are usually already |
| 67 | +encrypted! |
| 68 | + |
| 69 | +Let's walk through how a control-plane node can connect to the wireguard network |
| 70 | +even if the other nodes have an outdated public key set for it: |
| 71 | + |
| 72 | +0) The node went offline, traffic from other nodes destined to it is encrypted with |
| 73 | + the old key, while this traffic will never be decrypted, it at least remains |
| 74 | + encrypted. |
| 75 | +1) The node boots and connects to the kubernetes api and etcd (exempted from encryption) |
| 76 | +2) The node can now update it's public key on the CiliumNode CRD with the new one. |
| 77 | +3) The other nodes update their wireguard interface: full, encrypted connectivity is |
| 78 | + restored. |
| 79 | + |
| 80 | +## Implementation |
| 81 | + |
| 82 | +The key novelty is to selectively exempt traffic between nodes from wireguard: The |
| 83 | +traffic that is required for bootstrapping and is encrypted by other means. |
| 84 | +This can be achieved using Linux's policy based routing facilities (ip rule) and |
| 85 | +certainly also with BPF in ciliums data-path. |
| 86 | + |
| 87 | +I will illustrate this with control-plane nodes that run kube api on port 6443 and |
| 88 | +etcd on 2379 and 2380. The shell script below illustrates what nodecryptor does: |
| 89 | + |
| 90 | +```sh |
| 91 | +# Ensure traffic already encrypted by wg goes through the main routing table |
| 92 | +# cilium configures the cilium_wg0 interface to emit packets with the 0xe00 mark |
| 93 | +ip rule add fwmark 0xe00 lookup main priority 0 |
| 94 | + |
| 95 | +# Add a routing table (id 100) that sends everything through the wg interface |
| 96 | +ip route add default dev cilium_wg0 scope link table 100 |
| 97 | + |
| 98 | +# On Control-plane nodes: Force bootstrap traffic through the main table: |
| 99 | +# Reply traffic from exempted ports, the "more correct" solution here would be to |
| 100 | +# match the src address against the Nodes InternalIP rather than iif lo |
| 101 | +ip rule add iif lo sport 2379-2380 lookup main priority 200 |
| 102 | +ip rule add iif lo sport 6443 lookup main priority 200 |
| 103 | + |
| 104 | +# For every $CONTROL_PLANE_NODE |
| 105 | +# Ensure that exempted ports remain unencrypted |
| 106 | +ip rule add to $CONTROL_PLANE_NODE dport 2379-2380 lookup main priority 200 |
| 107 | +ip rule add to $CONTROL_PLANE_NODE dport 6443 lookup main priority 200 |
| 108 | +# Now send the rest of traffic to the table that puts everything to the wg interface |
| 109 | +ip rule add to $CONTROL_PLANE_NODE lookup 100 priority 201 |
| 110 | + |
| 111 | +# For every $WORKER_NODE |
| 112 | +# Regular nodes don't need exemptions |
| 113 | +ip rule add to $WORKER_NODE lookup 100 priority 201 |
| 114 | +``` |
| 115 | + |
| 116 | +## The bootstrap problem |
| 117 | + |
| 118 | +## Limitations of the PoC / Further Ideas |
| 119 | + |
| 120 | +It matches reply traffic from exempted ports using `iif lo`, matching for the |
| 121 | +InternalIPs and ExternalIPs would be better. |
| 122 | + |
| 123 | +It only distinguishes two types of nodes: control-plane and workers. A more general |
| 124 | +mechanism would be a custom resource that specifies exempted ports (or other traffic |
| 125 | +matchers) and label-selectors to support arbitrary exemptions on arbitrary subsets |
| 126 | +of nodes. |
| 127 | + |
0 commit comments