Skip to content

Commit 297ac47

Browse files
authored
Merge pull request #34478 from danwinship/iptables-chains
Add blog post about KEP-3178 iptables cleanup
2 parents 8d61be8 + f97c5d1 commit 297ac47

File tree

1 file changed

+185
-0
lines changed

1 file changed

+185
-0
lines changed
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes’s IPTables Chains Are Not API"
4+
date: 2022-09-07
5+
slug: iptables-chains-not-api
6+
---
7+
8+
**Author:** Dan Winship (Red Hat)
9+
10+
Some Kubernetes components (such as kubelet and kube-proxy) create
11+
iptables chains and rules as part of their operation. These chains
12+
were never intended to be part of any Kubernetes API/ABI guarantees,
13+
but some external components nonetheless make use of some of them (in
14+
particular, using `KUBE-MARK-MASQ` to mark packets as needing to be
15+
masqueraded).
16+
17+
As a part of the v1.25 release, SIG Network made this declaration
18+
explicit: that (with one exception), the iptables chains that
19+
Kubernetes creates are intended only for Kubernetes’s own internal
20+
use, and third-party components should not assume that Kubernetes will
21+
create any specific iptables chains, or that those chains will contain
22+
any specific rules if they do exist.
23+
24+
Then, in future releases, as part of [KEP-3178], we will begin phasing
25+
out certain chains that Kubernetes itself no longer needs. Components
26+
outside of Kubernetes itself that make use of `KUBE-MARK-MASQ`,
27+
`KUBE-MARK-DROP`, or other Kubernetes-generated iptables chains should
28+
start migrating away from them now.
29+
30+
[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178
31+
32+
## Background
33+
34+
In addition to various service-specific iptables chains, kube-proxy
35+
creates certain general-purpose iptables chains that it uses as part
36+
of service proxying. In the past, kubelet also used iptables for a few
37+
features (such as setting up `hostPort` mapping for pods) and so it
38+
also redundantly created some of the same chains.
39+
40+
However, with [the removal of dockershim] in Kubernetes in 1.24,
41+
kubelet now no longer ever uses any iptables rules for its own
42+
purposes; the things that it used to use iptables for are now always
43+
the responsibility of the container runtime or the network plugin, and
44+
there is no reason for kubelet to be creating any iptables rules.
45+
46+
Meanwhile, although `iptables` is still the default kube-proxy backend
47+
on Linux, it is unlikely to remain the default forever, since the
48+
associated command-line tools and kernel APIs are essentially
49+
deprecated, and no longer receiving improvements. (RHEL 9
50+
[logs a warning] if you use the iptables API, even via
51+
`iptables-nft`.)
52+
53+
Although as of Kubernetes 1.25 iptables kube-proxy remains popular,
54+
and kubelet continues to create the iptables rules that it
55+
historically created (despite no longer _using_ them), third party
56+
software cannot assume that core Kubernetes components will keep
57+
creating these rules in the future.
58+
59+
[the removal of dockershim]: https://kubernetes.io/blog/2022/02/17/dockershim-faq/
60+
[logs a warning]: https://access.redhat.com/solutions/6739041
61+
62+
## Upcoming changes
63+
64+
Starting a few releases from now, kubelet will no longer create the
65+
following iptables chains in the `nat` table:
66+
67+
- `KUBE-MARK-DROP`
68+
- `KUBE-MARK-MASQ`
69+
- `KUBE-POSTROUTING`
70+
71+
Additionally, the `KUBE-FIREWALL` chain in the `filter` table will no
72+
longer have the functionality currently associated with
73+
`KUBE-MARK-DROP` (and it may eventually go away entirely).
74+
75+
This change will be phased in via the `IPTablesOwnershipCleanup`
76+
feature gate. That feature gate is available and can be manually
77+
enabled for testing in Kubernetes 1.25. The current plan is that it
78+
will become enabled-by-default in Kubernetes 1.27, though this may be
79+
delayed to a later release. (It will not happen sooner than Kubernetes
80+
1.27.)
81+
82+
## What to do if you use Kubernetes’s iptables chains
83+
84+
(Although the discussion below focuses on short-term fixes that are
85+
still based on iptables, you should probably also start thinking about
86+
eventually migrating to nftables or another API).
87+
88+
### If you use `KUBE-MARK-MASQ`... {#use-case-kube-mark-masq}
89+
90+
If you are making use of the `KUBE-MARK-MASQ` chain to cause packets
91+
to be masqueraded, you have two options: (1) rewrite your rules to use
92+
`-j MASQUERADE` directly, (2) create your own alternative “mark for
93+
masquerade” chain.
94+
95+
The reason kube-proxy uses `KUBE-MARK-MASQ` is because there are lots
96+
of cases where it needs to call both `-j DNAT` and `-j MASQUERADE` on
97+
a packet, but it’s not possible to do both of those at the same time
98+
in iptables; `DNAT` must be called from the `PREROUTING` (or `OUTPUT`)
99+
chain (because it potentially changes where the packet will be routed
100+
to) while `MASQUERADE` must be called from `POSTROUTING` (because the
101+
masqueraded source IP that it picks depends on what the final routing
102+
decision was).
103+
104+
In theory, kube-proxy could have one set of rules to match packets in
105+
`PREROUTING`/`OUTPUT` and call `-j DNAT`, and then have a second set
106+
of rules to match the same packets in `POSTROUTING` and call `-j
107+
MASQUERADE`. But instead, for efficiency, it only matches them once,
108+
during `PREROUTING`/`OUTPUT`, at which point it calls `-j DNAT` and
109+
then calls `-j KUBE-MARK-MASQ` to set a bit on the kernel packet mark
110+
as a reminder to itself. Then later, during `POSTROUTING`, it has a
111+
single rule that matches all previously-marked packets, and calls `-j
112+
MASQUERADE` on them.
113+
114+
If you have _a lot_ of rules where you need to apply both DNAT and
115+
masquerading to the same packets like kube-proxy does, then you may
116+
want a similar arrangement. But in many cases, components that use
117+
`KUBE-MARK-MASQ` are only doing it because they copied kube-proxy’s
118+
behavior without understanding why kube-proxy was doing it that way.
119+
Many of these components could easily be rewritten to just use
120+
separate DNAT and masquerade rules. (In cases where no DNAT is
121+
occurring then there is even less point to using `KUBE-MARK-MASQ`;
122+
just move your rules from `PREROUTING` to `POSTROUTING` and call `-j
123+
MASQUERADE` directly.)
124+
125+
### If you use `KUBE-MARK-DROP`... {#use-case-kube-mark-drop}
126+
127+
The rationale for `KUBE-MARK-DROP` is similar to the rationale for
128+
`KUBE-MARK-MASQ`: kube-proxy wanted to make packet-dropping decisions
129+
alongside other decisions in the `nat` `KUBE-SERVICES` chain, but you
130+
can only call `-j DROP` from the `filter` table. So instead, it uses
131+
`KUBE-MARK-DROP` to mark packets to be dropped later on.
132+
133+
In general, the approach for removing a dependency on `KUBE-MARK-DROP`
134+
is the same as for removing a dependency on `KUBE-MARK-MASQ`. In
135+
kube-proxy’s case, it is actually quite easy to replace the usage of
136+
`KUBE-MARK-DROP` in the `nat` table with direct calls to `DROP` in the
137+
`filter` table, because there are no complicated interactions between
138+
DNAT rules and drop rules, and so the drop rules can simply be moved
139+
from `nat` to `filter`.
140+
141+
In more complicated cases, it might be necessary to “re-match” the
142+
same packets in both `nat` and `filter`.
143+
144+
### If you use Kubelet’s iptables rules to figure out `iptables-legacy` vs `iptables-nft`... {#use-case-iptables-mode}
145+
146+
Components that manipulate host-network-namespace iptables rules from
147+
inside a container need some way to figure out whether the host is
148+
using the old `iptables-legacy` binaries or the newer `iptables-nft`
149+
binaries (which talk to a different kernel API underneath).
150+
151+
The [`iptables-wrappers`] module provides a way for such components to
152+
autodetect the system iptables mode, but in the past it did this by
153+
assuming that Kubelet will have created “a bunch” of iptables rules
154+
before any containers start, and so it can guess which mode the
155+
iptables binaries in the host filesystem are using by seeing which
156+
mode has more rules defined.
157+
158+
In future releases, Kubelet will no longer create many iptables rules,
159+
so heuristics based on counting the number of rules present may fail.
160+
161+
However, as of 1.24, Kubelet always creates a chain named
162+
`KUBE-IPTABLES-HINT` in the `mangle` table of whichever iptables
163+
subsystem it is using. Components can now look for this specific chain
164+
to know which iptables subsystem Kubelet (and thus, presumably, the
165+
rest of the system) is using.
166+
167+
(Additionally, since Kubernetes 1.17, kubelet has created a chain
168+
called `KUBE-KUBELET-CANARY` in the `mangle` table. While this chain
169+
may go away in the future, it will of course still be there in older
170+
releases, so in any recent version of Kubernetes, at least one of
171+
`KUBE-IPTABLES-HINT` or `KUBE-KUBELET-CANARY` will be present.)
172+
173+
The `iptables-wrappers` package has [already been updated] with this new
174+
heuristic, so if you were previously using that, you can rebuild your
175+
container images with an updated version of that.
176+
177+
[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/
178+
[already been updated]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3
179+
180+
## Further reading
181+
182+
The project to clean up iptables chain ownership and deprecate the old
183+
chains is tracked by [KEP-3178].
184+
185+
[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178

0 commit comments

Comments
 (0)