Skip to content

Commit d259514

Browse files
committed
WIP: Readme
1 parent 4b95eed commit d259514

1 file changed

Lines changed: 127 additions & 0 deletions

File tree

Readme.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# nodecryptor: better wireguard for cilium
2+
3+
> [!WARNING] > This is experimental / works-for-the-author-grade software!
4+
> Tested with cilium 1.18.2
5+
6+
## Motivation
7+
8+
Cilium's [transparent wireguard encryption](https://docs.cilium.io/en/stable/security/network/encryption-wireguard/)
9+
encrypts (most of) the traffic flowing between nodes of a cluster. While this
10+
undoubtedly improves the security of your cluster network, it's current implementation
11+
has two important limitations:
12+
13+
1) Cilium treats tunneling and encryption as orthogonal problems: With transparent
14+
encryption on, it will send vxlan tunnel traffic through the wg tunnel. That
15+
means very packet is encapsulated twice. Unless you need L2-connectivity between
16+
pods, this is wasteful not only in terms of compute, but more importantly you have
17+
the MTU-overhead of wg (60 byte) and vxlan (50 byte) instead of only the former.
18+
19+
2) Traffic between root network-namespaces of nodes can be encrypted (Node-to-Node
20+
encryption, considered beta) but it creates a bootstrapping-problem for
21+
control-plane nodes, which cilium solves by excluding control-plane nodes from
22+
Node-to-Node encryption. (See "The bootsrapping problem" below)
23+
24+
### Why these are relevant limitations:
25+
26+
Given that cilium currently does not support different MTUs for South-West and
27+
North-South traffic, most setups will be working with an MTU of about 1500. That
28+
ditching vxlan would remove a 3% overhead. The downside is that there is no more
29+
L2-connectivity between pods on different nodes, but that is neither mandated by k8s,
30+
nor a common requirement for cluster workloads.
31+
32+
In my experience, it is surprisingly easy to overlook that pods on different nodes
33+
communicate using the root network namespace. While arguably all workloads should be
34+
encrypted end-to-end and considering the cluster network trusted is perhaps a bit
35+
foolish, it would be even more foolish to assume every organization has the
36+
capabilities to implement encryption on all services. Especially in multi-cloud-clusters
37+
overlooking these limitations of node-to-node encryption may mean transmitting
38+
credentials
39+
or other sensitive information in the clear. Setups without dedicated control-plane
40+
nodes are especially susceptible to such "accidents".
41+
42+
## Approach
43+
44+
The first issue is already almost entirely solved by setting cilium's
45+
`routingMode` to `native`, configuring `ipv4NativeRoutingCIDR` to cover the entire
46+
pod-cidr and simultaneously enabling encryption with `type: wireguard`. With these
47+
settings, pod-to-pod traffic is sent through `cilium_wg0`.
48+
49+
However, traffic between your nodes' root namespaces and remote pods, as well as
50+
node-to-node traffic will be treated like north-south traffic and simply go through the
51+
main routing table and thus likely the default interface, unencrypted. This issue is
52+
rather easy to fix, especially since cilium already configures all relevant ips
53+
as allowed ips on the peers of the `cilium_wg0` interface. We merely need to route
54+
traffic between pods and remote nodes (and vice versa) through that interface. The
55+
fact that cilium doesn't already do that in the above description may arguably
56+
be considered a bug.
57+
58+
The second issue, allowing node-to-node encryption on control-plane nodes is a bit
59+
more tricky, since we need to solve the bootstrapping-problem. If you are not familiar
60+
with this bootstrapping problem is, I recommend going to the description below
61+
and returning here.
62+
63+
Now, the way this PoC solves - or rather: avoids - the bootstrapping this problem is
64+
that it exempts only the traffic necessary for bootstraping from encryption, rather
65+
than the entire node. The key circumstance that allows us to do this is that the
66+
kubernetes-api and etcd connections between control-plane nodes are usually already
67+
encrypted!
68+
69+
Let's walk through how a control-plane node can connect to the wireguard network
70+
even if the other nodes have an outdated public key set for it:
71+
72+
0) The node went offline, traffic from other nodes destined to it is encrypted with
73+
the old key, while this traffic will never be decrypted, it at least remains
74+
encrypted.
75+
1) The node boots and connects to the kubernetes api and etcd (exempted from encryption)
76+
2) The node can now update it's public key on the CiliumNode CRD with the new one.
77+
3) The other nodes update their wireguard interface: full, encrypted connectivity is
78+
restored.
79+
80+
## Implementation
81+
82+
The key novelty is to selectively exempt traffic between nodes from wireguard: The
83+
traffic that is required for bootstrapping and is encrypted by other means.
84+
This can be achieved using Linux's policy based routing facilities (ip rule) and
85+
certainly also with BPF in ciliums data-path.
86+
87+
I will illustrate this with control-plane nodes that run kube api on port 6443 and
88+
etcd on 2379 and 2380. The shell script below illustrates what nodecryptor does:
89+
90+
```sh
91+
# Ensure traffic already encrypted by wg goes through the main routing table
92+
# cilium configures the cilium_wg0 interface to emit packets with the 0xe00 mark
93+
ip rule add fwmark 0xe00 lookup main priority 0
94+
95+
# Add a routing table (id 100) that sends everything through the wg interface
96+
ip route add default dev cilium_wg0 scope link table 100
97+
98+
# On Control-plane nodes: Force bootstrap traffic through the main table:
99+
# Reply traffic from exempted ports, the "more correct" solution here would be to
100+
# match the src address against the Nodes InternalIP rather than iif lo
101+
ip rule add iif lo sport 2379-2380 lookup main priority 200
102+
ip rule add iif lo sport 6443 lookup main priority 200
103+
104+
# For every $CONTROL_PLANE_NODE
105+
# Ensure that exempted ports remain unencrypted
106+
ip rule add to $CONTROL_PLANE_NODE dport 2379-2380 lookup main priority 200
107+
ip rule add to $CONTROL_PLANE_NODE dport 6443 lookup main priority 200
108+
# Now send the rest of traffic to the table that puts everything to the wg interface
109+
ip rule add to $CONTROL_PLANE_NODE lookup 100 priority 201
110+
111+
# For every $WORKER_NODE
112+
# Regular nodes don't need exemptions
113+
ip rule add to $WORKER_NODE lookup 100 priority 201
114+
```
115+
116+
## The bootstrap problem
117+
118+
## Limitations of the PoC / Further Ideas
119+
120+
It matches reply traffic from exempted ports using `iif lo`, matching for the
121+
InternalIPs and ExternalIPs would be better.
122+
123+
It only distinguishes two types of nodes: control-plane and workers. A more general
124+
mechanism would be a custom resource that specifies exempted ports (or other traffic
125+
matchers) and label-selectors to support arbitrary exemptions on arbitrary subsets
126+
of nodes.
127+

0 commit comments

Comments
 (0)