|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "NFTables mode for kube-proxy" |
| 4 | +date: 2025-02-28 |
| 5 | +draft: true |
| 6 | +slug: nftables-kube-proxy |
| 7 | +author: > |
| 8 | + Dan Winship (Red Hat) |
| 9 | +--- |
| 10 | + |
| 11 | +A new nftables mode for kube-proxy was introduced as an alpha feature |
| 12 | +in Kubernetes 1.29. Currently in beta, it is expected to be GA as of |
| 13 | +1.33. The new mode fixes long-standing performance problems with the |
| 14 | +iptables mode and all users running on systems with reasonably-recent |
| 15 | +kernels are encouraged to try it out. (For compatibility reasons, even |
| 16 | +once nftables becomes GA, iptables will still be the _default_.) |
| 17 | + |
| 18 | +## Why nftables? Part 1: data plane latency |
| 19 | + |
| 20 | +The iptables API was designed for implementing simple firewalls, and |
| 21 | +has problems scaling up to support Service proxying in a large |
| 22 | +Kubernetes cluster with tens of thousands of Services. |
| 23 | + |
| 24 | +In general, the ruleset generated by kube-proxy in iptables mode has a |
| 25 | +number of iptables rules proportional to the sum of the number of |
| 26 | +Services and the total number of endpoints. In particular, at the top |
| 27 | +level of the ruleset, there is one rule to test each possible Service |
| 28 | +IP (and port) that a packet might be addressed to: |
| 29 | + |
| 30 | +``` |
| 31 | +# If the packet is addressed to 172.30.0.41:80, then jump to the chain |
| 32 | +# KUBE-SVC-XPGD46QRK7WJZT7O for further processing |
| 33 | +-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O |
| 34 | +
|
| 35 | +# If the packet is addressed to 172.30.0.42:443, then... |
| 36 | +-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT |
| 37 | +
|
| 38 | +# etc... |
| 39 | +-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK |
| 40 | +``` |
| 41 | + |
| 42 | +This means that when a packet comes in, the time it takes the kernel |
| 43 | +to check it against all of the Service rules is **O(n)** in the number |
| 44 | +of Services. As the number of Services increases, both the average and |
| 45 | +the worst-case latency for the first packet of a new connection |
| 46 | +increases (with the difference between best-case, average, and |
| 47 | +worst-case being mostly determined by whether a given Service IP |
| 48 | +address appears earlier or later in the `KUBE-SERVICES` chain). |
| 49 | + |
| 50 | +{{< figure src="iptables-only.svg" alt="kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes" >}} |
| 51 | + |
| 52 | +By contrast, with nftables, the normal way to write a ruleset like |
| 53 | +this is to have a _single_ rule, using a "verdict map" to do the |
| 54 | +dispatch: |
| 55 | + |
| 56 | +``` |
| 57 | +table ip kube-proxy { |
| 58 | +
|
| 59 | + # The service-ips verdict map indicates the action to take for each matching packet. |
| 60 | + map service-ips { |
| 61 | + type ipv4_addr . inet_proto . inet_service : verdict |
| 62 | + comment "ClusterIP, ExternalIP and LoadBalancer IP traffic" |
| 63 | + elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80, |
| 64 | + 172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443, |
| 65 | + 172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80, |
| 66 | + ... } |
| 67 | + } |
| 68 | +
|
| 69 | + # Now we just need a single rule to process all packets matching an |
| 70 | + # element in the map. (This rule says, "construct a tuple from the |
| 71 | + # destination IP address, layer 4 protocol, and destination port; look |
| 72 | + # that tuple up in "service-ips"; and if there's a match, execute the |
| 73 | + # associated verdict.) |
| 74 | + chain services { |
| 75 | + ip daddr . meta l4proto . th dport vmap @service-ips |
| 76 | + } |
| 77 | +
|
| 78 | + ... |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +Since there's only a single rule, with a roughly **O(1)** map lookup, |
| 83 | +packet processing time is more or less constant regardless of cluster |
| 84 | +size, and the best/average/worst cases are very similar: |
| 85 | + |
| 86 | +{{< figure src="nftables-only.svg" alt="kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes" >}} |
| 87 | + |
| 88 | +But note the huge difference in the vertical scale between the |
| 89 | +iptables and nftables graphs! In the clusters with 5000 and 10,000 |
| 90 | +Services, the p50 (average) latency for nftables is about the same as |
| 91 | +the p01 (approximately best-case) latency for iptables. In the 30,000 |
| 92 | +Service cluster, the p99 (approximately worst-case) latency for |
| 93 | +nftables manages to beat out the p01 latency for iptables by a few |
| 94 | +microseconds! Here's both sets of data together, but you may have to |
| 95 | +squint to see the nftables results!: |
| 96 | + |
| 97 | +{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}} |
| 98 | + |
| 99 | +## Why nftables? Part 2: control plane latency |
| 100 | + |
| 101 | +While the improvements to data plane latency in large clusters are |
| 102 | +great, there's another problem with iptables kube-proxy that often |
| 103 | +keeps users from even being able to grow their clusters to that size: |
| 104 | +the time it takes kube-proxy to program new iptables rules when |
| 105 | +Services and their endpoints change. |
| 106 | + |
| 107 | +With both iptables and nftables, the total size of the ruleset as a |
| 108 | +whole (actual rules, plus associated data) is **O(n)** in the combined |
| 109 | +number of Services and their endpoints. Originally, the iptables |
| 110 | +backend would rewrite every rule on every update, and with tens of |
| 111 | +thousands of Services, this could grow to be hundreds of thousands of |
| 112 | +iptables rules. Starting in Kubernetes 1.26, we began improving |
| 113 | +kube-proxy so that it could skip updating _most_ of the unchanged |
| 114 | +rules in each update, but the limitations of `iptables-restore` as an |
| 115 | +API meant that it was still always necessary to send an update that's |
| 116 | +**O(n)** in the number of Services (though with a noticeably smaller |
| 117 | +constant than it used to be). Even with those optimizations, it can |
| 118 | +still be necessary to make use of kube-proxy's `minSyncPeriod` config |
| 119 | +option to ensure that it doesn't spend every waking second trying to |
| 120 | +push iptables updates. |
| 121 | + |
| 122 | +The nftables APIs allow for doing much more incremental updates, and |
| 123 | +when kube-proxy in nftables mode does an update, the size of the |
| 124 | +update is only **O(n)** in the number of Services and endpoints that |
| 125 | +have changed since the last sync, regardless of the total number of |
| 126 | +Services and endpoints. The fact that the nftables API allows each |
| 127 | +nftables-using component to have its own private table also means that |
| 128 | +there is no global lock contention between components like with |
| 129 | +iptables. As a result, kube-proxy's nftables updates can be done much |
| 130 | +more efficiently than with iptables. |
| 131 | + |
| 132 | +(Unfortunately I don't have cool graphs for this part.) |
| 133 | + |
| 134 | +## Why _not_ nftables? {#why-not-nftables} |
| 135 | + |
| 136 | +All that said, there are a few reasons why you might not want to jump |
| 137 | +right into using the nftables backend for now. |
| 138 | + |
| 139 | +First, the code is still fairly new. While it has plenty of unit |
| 140 | +tests, performs correctly in our CI system, and has now been used in |
| 141 | +the real world by multiple users, it has not seen anything close to as |
| 142 | +much real-world usage as the iptables backend has, so we can't promise |
| 143 | +that it is as stable and bug-free. |
| 144 | + |
| 145 | +Second, the nftables mode will not work on older Linux distributions; |
| 146 | +currently it requires a 5.13 or newer kernel. Additionally, because of |
| 147 | +bugs in early versions of the `nft` command line tool, you should not |
| 148 | +run kube-proxy in nftables mode on nodes that have an old (earlier |
| 149 | +than 1.0.0) version of `nft` in the host filesystem (or else |
| 150 | +kube-proxy's use of nftables may interfere with other uses of nftables |
| 151 | +on the system). |
| 152 | + |
| 153 | +Third, you may have other networking components in your cluster, such |
| 154 | +as the pod network or NetworkPolicy implementation, that do not yet |
| 155 | +support kube-proxy in nftables mode. You should consult the |
| 156 | +documentation (or forums, bug tracker, etc.) for any such components |
| 157 | +to see if they have problems with nftables mode. (In many cases they |
| 158 | +will not; as long as they don't try to directly interact with or |
| 159 | +override kube-proxy's iptables rules, they shouldn't care whether |
| 160 | +kube-proxy is using iptables or nftables.) Additionally, observability |
| 161 | +and monitoring tools that have not been updated may report less data |
| 162 | +for kube-proxy in nftables mode than they do for kube-proxy in |
| 163 | +iptables mode. |
| 164 | + |
| 165 | +Finally, kube-proxy in nftables mode is intentionally not 100% |
| 166 | +compatible with kube-proxy in iptables mode. There are a few old |
| 167 | +kube-proxy features whose default behaviors are less secure, less |
| 168 | +performant, or less intuitive than we'd like, but where we felt that |
| 169 | +changing the default would be a compatibility break. Since the |
| 170 | +nftables mode is opt-in, this gave us a chance to fix those bad |
| 171 | +defaults without breaking users who weren't expecting changes. (In |
| 172 | +particular, with nftables mode, NodePort Services are now only |
| 173 | +reachable on their nodes' default IPs, as opposed to being reachable |
| 174 | +on all IPs, including `127.0.0.1`, with iptables mode.) The |
| 175 | +[kube-proxy documentation] has more information about this, including |
| 176 | +information about metrics you can look at to determine if you are |
| 177 | +relying on any of the changed functionality, and what configuration |
| 178 | +options are available to get more backward-compatible behavior. |
| 179 | + |
| 180 | +[kube-proxy documentation]: https://kubernetes.io/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables |
| 181 | + |
| 182 | +## Trying out nftables mode |
| 183 | + |
| 184 | +Ready to try it out? In Kubernetes 1.31 and later, you just need to |
| 185 | +pass `--proxy-mode nftables` to kube-proxy (or set `mode: nftables` in |
| 186 | +your kube-proxy config file). |
| 187 | + |
| 188 | +If you are using kubeadm to set up your cluster, the kubeadm |
| 189 | +documentation explains [how to pass a `KubeProxyConfiguration` to |
| 190 | +`kubeadm init`]. You can also [deploy nftables-based clusters with |
| 191 | +`kind`]. |
| 192 | + |
| 193 | +You can also convert existing clusters from iptables (or ipvs) mode to |
| 194 | +nftables by updating the kube-proxy configuration and restarting the |
| 195 | +kube-proxy pods. (You do not need to reboot the nodes: when restarting |
| 196 | +in nftables mode, kube-proxy will delete any existing iptables or ipvs |
| 197 | +rules, and likewise, if you later revert back to iptables or ipvs |
| 198 | +mode, it will delete any existing nftables rules.) |
| 199 | + |
| 200 | +[how to pass a `KubeProxyConfiguration` to `kubeadm init`]: /docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy |
| 201 | +[deploy nftables-based clusters with `kind`]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode |
| 202 | + |
| 203 | +## Future plans |
| 204 | + |
| 205 | +As mentioned above, while nftables is now the _best_ kube-proxy mode, |
| 206 | +it is not the _default_, and we do not yet have a plan for changing |
| 207 | +that. We will continue to support the iptables mode for a long time. |
| 208 | + |
| 209 | +The future of the IPVS mode of kube-proxy is less certain: its main |
| 210 | +advantage over iptables was that it was faster, but certain aspects of |
| 211 | +the IPVS architecture and APIs were awkward for kube-proxy's purposes |
| 212 | +(for example, the fact that the `kube-ipvs0` device needs to have |
| 213 | +_every_ Service IP address assigned to it), and some parts of |
| 214 | +Kubernetes Service proxying semantics were difficult to implement |
| 215 | +using IPVS (particularly the fact that some Services had to have |
| 216 | +different endpoints depending on whether you connected to them from a |
| 217 | +local or remote client). And now, the nftables mode has the same |
| 218 | +performance as IPVS mode (actually, slightly better), without any of |
| 219 | +the downsides: |
| 220 | + |
| 221 | +{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}} |
| 222 | + |
| 223 | +(In theory the IPVS mode also has the advantage of being able to use |
| 224 | +various other IPVS functionality, like alternative "schedulers" for |
| 225 | +balancing endpoints. In practice, this ended up not being very useful, |
| 226 | +because kube-proxy runs independently on every node, and the IPVS |
| 227 | +schedulers on each node had no way of sharing their state with the |
| 228 | +proxies on other nodes, thus thwarting the effort to balance traffic |
| 229 | +more cleverly.) |
| 230 | + |
| 231 | +While the Kubernetes project does not have an immediate plan to drop |
| 232 | +the IPVS backend, it is probably doomed in the long run, and people |
| 233 | +who are currently using IPVS mode should try out the nftables mode |
| 234 | +instead (and file bugs if you think there is missing functionality in |
| 235 | +nftables mode that you can't work around). |
| 236 | + |
| 237 | +## Learn more |
| 238 | + |
| 239 | +- "[KEP-3866: Add an nftables-based kube-proxy backend]" has the |
| 240 | + history of the new feature. |
| 241 | + |
| 242 | +- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]", |
| 243 | + from KubeCon/CloudNativeCon North America 2024, talks about porting |
| 244 | + kube-proxy and Calico from iptables to nftables. |
| 245 | + |
| 246 | +- "[From Observability to Performance]", from KubeCon/CloudNativeCon |
| 247 | + North America 2024. (This is where the kube-proxy latency data came |
| 248 | + from; the [raw data for the charts] is also available.) |
| 249 | + |
| 250 | +[KEP-3866: Add an nftables-based kube-proxy backend]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3866-nftables-proxy/README.md |
| 251 | +[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]: https://youtu.be/yOGHb2HjslY?si=6O4PVJu7fGpReo1U |
| 252 | +[From Observability to Performance]: https://youtu.be/uYo2O3jbJLk?si=py2AXzMJZ4PuhxNg |
| 253 | +[raw data for the charts]: https://docs.google.com/spreadsheets/d/1-ryDNc6gZocnMHEXC7mNtqknKSOv5uhXFKDx8Hu3AYA/edit |
0 commit comments