Skip to content

Commit f0544bf

Browse files
authored
Merge pull request #49393 from danwinship/nftables-blog
nftables kube-proxy blog post for 1.33
2 parents b18fcfb + a446075 commit f0544bf

File tree

5 files changed

+257
-0
lines changed

5 files changed

+257
-0
lines changed
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
---
2+
layout: blog
3+
title: "NFTables mode for kube-proxy"
4+
date: 2025-02-28
5+
draft: true
6+
slug: nftables-kube-proxy
7+
author: >
8+
Dan Winship (Red Hat)
9+
---
10+
11+
A new nftables mode for kube-proxy was introduced as an alpha feature
12+
in Kubernetes 1.29. Currently in beta, it is expected to be GA as of
13+
1.33. The new mode fixes long-standing performance problems with the
14+
iptables mode and all users running on systems with reasonably-recent
15+
kernels are encouraged to try it out. (For compatibility reasons, even
16+
once nftables becomes GA, iptables will still be the _default_.)
17+
18+
## Why nftables? Part 1: data plane latency
19+
20+
The iptables API was designed for implementing simple firewalls, and
21+
has problems scaling up to support Service proxying in a large
22+
Kubernetes cluster with tens of thousands of Services.
23+
24+
In general, the ruleset generated by kube-proxy in iptables mode has a
25+
number of iptables rules proportional to the sum of the number of
26+
Services and the total number of endpoints. In particular, at the top
27+
level of the ruleset, there is one rule to test each possible Service
28+
IP (and port) that a packet might be addressed to:
29+
30+
```
31+
# If the packet is addressed to 172.30.0.41:80, then jump to the chain
32+
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
33+
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
34+
35+
# If the packet is addressed to 172.30.0.42:443, then...
36+
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
37+
38+
# etc...
39+
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
40+
```
41+
42+
This means that when a packet comes in, the time it takes the kernel
43+
to check it against all of the Service rules is **O(n)** in the number
44+
of Services. As the number of Services increases, both the average and
45+
the worst-case latency for the first packet of a new connection
46+
increases (with the difference between best-case, average, and
47+
worst-case being mostly determined by whether a given Service IP
48+
address appears earlier or later in the `KUBE-SERVICES` chain).
49+
50+
{{< figure src="iptables-only.svg" alt="kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes" >}}
51+
52+
By contrast, with nftables, the normal way to write a ruleset like
53+
this is to have a _single_ rule, using a "verdict map" to do the
54+
dispatch:
55+
56+
```
57+
table ip kube-proxy {
58+
59+
# The service-ips verdict map indicates the action to take for each matching packet.
60+
map service-ips {
61+
type ipv4_addr . inet_proto . inet_service : verdict
62+
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
63+
elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
64+
172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
65+
172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
66+
... }
67+
}
68+
69+
# Now we just need a single rule to process all packets matching an
70+
# element in the map. (This rule says, "construct a tuple from the
71+
# destination IP address, layer 4 protocol, and destination port; look
72+
# that tuple up in "service-ips"; and if there's a match, execute the
73+
# associated verdict.)
74+
chain services {
75+
ip daddr . meta l4proto . th dport vmap @service-ips
76+
}
77+
78+
...
79+
}
80+
```
81+
82+
Since there's only a single rule, with a roughly **O(1)** map lookup,
83+
packet processing time is more or less constant regardless of cluster
84+
size, and the best/average/worst cases are very similar:
85+
86+
{{< figure src="nftables-only.svg" alt="kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
87+
88+
But note the huge difference in the vertical scale between the
89+
iptables and nftables graphs! In the clusters with 5000 and 10,000
90+
Services, the p50 (average) latency for nftables is about the same as
91+
the p01 (approximately best-case) latency for iptables. In the 30,000
92+
Service cluster, the p99 (approximately worst-case) latency for
93+
nftables manages to beat out the p01 latency for iptables by a few
94+
microseconds! Here's both sets of data together, but you may have to
95+
squint to see the nftables results!:
96+
97+
{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
98+
99+
## Why nftables? Part 2: control plane latency
100+
101+
While the improvements to data plane latency in large clusters are
102+
great, there's another problem with iptables kube-proxy that often
103+
keeps users from even being able to grow their clusters to that size:
104+
the time it takes kube-proxy to program new iptables rules when
105+
Services and their endpoints change.
106+
107+
With both iptables and nftables, the total size of the ruleset as a
108+
whole (actual rules, plus associated data) is **O(n)** in the combined
109+
number of Services and their endpoints. Originally, the iptables
110+
backend would rewrite every rule on every update, and with tens of
111+
thousands of Services, this could grow to be hundreds of thousands of
112+
iptables rules. Starting in Kubernetes 1.26, we began improving
113+
kube-proxy so that it could skip updating _most_ of the unchanged
114+
rules in each update, but the limitations of `iptables-restore` as an
115+
API meant that it was still always necessary to send an update that's
116+
**O(n)** in the number of Services (though with a noticeably smaller
117+
constant than it used to be). Even with those optimizations, it can
118+
still be necessary to make use of kube-proxy's `minSyncPeriod` config
119+
option to ensure that it doesn't spend every waking second trying to
120+
push iptables updates.
121+
122+
The nftables APIs allow for doing much more incremental updates, and
123+
when kube-proxy in nftables mode does an update, the size of the
124+
update is only **O(n)** in the number of Services and endpoints that
125+
have changed since the last sync, regardless of the total number of
126+
Services and endpoints. The fact that the nftables API allows each
127+
nftables-using component to have its own private table also means that
128+
there is no global lock contention between components like with
129+
iptables. As a result, kube-proxy's nftables updates can be done much
130+
more efficiently than with iptables.
131+
132+
(Unfortunately I don't have cool graphs for this part.)
133+
134+
## Why _not_ nftables? {#why-not-nftables}
135+
136+
All that said, there are a few reasons why you might not want to jump
137+
right into using the nftables backend for now.
138+
139+
First, the code is still fairly new. While it has plenty of unit
140+
tests, performs correctly in our CI system, and has now been used in
141+
the real world by multiple users, it has not seen anything close to as
142+
much real-world usage as the iptables backend has, so we can't promise
143+
that it is as stable and bug-free.
144+
145+
Second, the nftables mode will not work on older Linux distributions;
146+
currently it requires a 5.13 or newer kernel. Additionally, because of
147+
bugs in early versions of the `nft` command line tool, you should not
148+
run kube-proxy in nftables mode on nodes that have an old (earlier
149+
than 1.0.0) version of `nft` in the host filesystem (or else
150+
kube-proxy's use of nftables may interfere with other uses of nftables
151+
on the system).
152+
153+
Third, you may have other networking components in your cluster, such
154+
as the pod network or NetworkPolicy implementation, that do not yet
155+
support kube-proxy in nftables mode. You should consult the
156+
documentation (or forums, bug tracker, etc.) for any such components
157+
to see if they have problems with nftables mode. (In many cases they
158+
will not; as long as they don't try to directly interact with or
159+
override kube-proxy's iptables rules, they shouldn't care whether
160+
kube-proxy is using iptables or nftables.) Additionally, observability
161+
and monitoring tools that have not been updated may report less data
162+
for kube-proxy in nftables mode than they do for kube-proxy in
163+
iptables mode.
164+
165+
Finally, kube-proxy in nftables mode is intentionally not 100%
166+
compatible with kube-proxy in iptables mode. There are a few old
167+
kube-proxy features whose default behaviors are less secure, less
168+
performant, or less intuitive than we'd like, but where we felt that
169+
changing the default would be a compatibility break. Since the
170+
nftables mode is opt-in, this gave us a chance to fix those bad
171+
defaults without breaking users who weren't expecting changes. (In
172+
particular, with nftables mode, NodePort Services are now only
173+
reachable on their nodes' default IPs, as opposed to being reachable
174+
on all IPs, including `127.0.0.1`, with iptables mode.) The
175+
[kube-proxy documentation] has more information about this, including
176+
information about metrics you can look at to determine if you are
177+
relying on any of the changed functionality, and what configuration
178+
options are available to get more backward-compatible behavior.
179+
180+
[kube-proxy documentation]: https://kubernetes.io/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables
181+
182+
## Trying out nftables mode
183+
184+
Ready to try it out? In Kubernetes 1.31 and later, you just need to
185+
pass `--proxy-mode nftables` to kube-proxy (or set `mode: nftables` in
186+
your kube-proxy config file).
187+
188+
If you are using kubeadm to set up your cluster, the kubeadm
189+
documentation explains [how to pass a `KubeProxyConfiguration` to
190+
`kubeadm init`]. You can also [deploy nftables-based clusters with
191+
`kind`].
192+
193+
You can also convert existing clusters from iptables (or ipvs) mode to
194+
nftables by updating the kube-proxy configuration and restarting the
195+
kube-proxy pods. (You do not need to reboot the nodes: when restarting
196+
in nftables mode, kube-proxy will delete any existing iptables or ipvs
197+
rules, and likewise, if you later revert back to iptables or ipvs
198+
mode, it will delete any existing nftables rules.)
199+
200+
[how to pass a `KubeProxyConfiguration` to `kubeadm init`]: /docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy
201+
[deploy nftables-based clusters with `kind`]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode
202+
203+
## Future plans
204+
205+
As mentioned above, while nftables is now the _best_ kube-proxy mode,
206+
it is not the _default_, and we do not yet have a plan for changing
207+
that. We will continue to support the iptables mode for a long time.
208+
209+
The future of the IPVS mode of kube-proxy is less certain: its main
210+
advantage over iptables was that it was faster, but certain aspects of
211+
the IPVS architecture and APIs were awkward for kube-proxy's purposes
212+
(for example, the fact that the `kube-ipvs0` device needs to have
213+
_every_ Service IP address assigned to it), and some parts of
214+
Kubernetes Service proxying semantics were difficult to implement
215+
using IPVS (particularly the fact that some Services had to have
216+
different endpoints depending on whether you connected to them from a
217+
local or remote client). And now, the nftables mode has the same
218+
performance as IPVS mode (actually, slightly better), without any of
219+
the downsides:
220+
221+
{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
222+
223+
(In theory the IPVS mode also has the advantage of being able to use
224+
various other IPVS functionality, like alternative "schedulers" for
225+
balancing endpoints. In practice, this ended up not being very useful,
226+
because kube-proxy runs independently on every node, and the IPVS
227+
schedulers on each node had no way of sharing their state with the
228+
proxies on other nodes, thus thwarting the effort to balance traffic
229+
more cleverly.)
230+
231+
While the Kubernetes project does not have an immediate plan to drop
232+
the IPVS backend, it is probably doomed in the long run, and people
233+
who are currently using IPVS mode should try out the nftables mode
234+
instead (and file bugs if you think there is missing functionality in
235+
nftables mode that you can't work around).
236+
237+
## Learn more
238+
239+
- "[KEP-3866: Add an nftables-based kube-proxy backend]" has the
240+
history of the new feature.
241+
242+
- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]",
243+
from KubeCon/CloudNativeCon North America 2024, talks about porting
244+
kube-proxy and Calico from iptables to nftables.
245+
246+
- "[From Observability to Performance]", from KubeCon/CloudNativeCon
247+
North America 2024. (This is where the kube-proxy latency data came
248+
from; the [raw data for the charts] is also available.)
249+
250+
[KEP-3866: Add an nftables-based kube-proxy backend]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3866-nftables-proxy/README.md
251+
[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]: https://youtu.be/yOGHb2HjslY?si=6O4PVJu7fGpReo1U
252+
[From Observability to Performance]: https://youtu.be/uYo2O3jbJLk?si=py2AXzMJZ4PuhxNg
253+
[raw data for the charts]: https://docs.google.com/spreadsheets/d/1-ryDNc6gZocnMHEXC7mNtqknKSOv5uhXFKDx8Hu3AYA/edit

content/en/blog/_posts/2025-02-28-nftables-kube-proxy/iptables-only.svg

Lines changed: 1 addition & 0 deletions
Loading

content/en/blog/_posts/2025-02-28-nftables-kube-proxy/iptables-vs-nftables.svg

Lines changed: 1 addition & 0 deletions
Loading

content/en/blog/_posts/2025-02-28-nftables-kube-proxy/ipvs-vs-nftables.svg

Lines changed: 1 addition & 0 deletions
Loading

content/en/blog/_posts/2025-02-28-nftables-kube-proxy/nftables-only.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)