Skip to content

Commit 5087c55

Browse files
authored
Merge pull request #39730 from my-git9/2022-09-07-iptables-chains.md
[zh-cn]sync blog 2022-09-07-iptables-chains.md
2 parents ed7d4cc + 76694cd commit 5087c55

File tree

1 file changed

+343
-0
lines changed

1 file changed

+343
-0
lines changed
Lines changed: 343 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 的 iptables 链不是 API"
4+
date: 2022-09-07
5+
slug: iptables-chains-not-api
6+
---
7+
8+
<!--
9+
layout: blog
10+
title: "Kubernetes’s IPTables Chains Are Not API"
11+
date: 2022-09-07
12+
slug: iptables-chains-not-api
13+
-->
14+
15+
<!--
16+
**Author:** Dan Winship (Red Hat)
17+
-->
18+
**作者:** Dan Winship (Red Hat)
19+
20+
**译者:** Xin Li (DaoCloud)
21+
22+
<!--
23+
Some Kubernetes components (such as kubelet and kube-proxy) create
24+
iptables chains and rules as part of their operation. These chains
25+
were never intended to be part of any Kubernetes API/ABI guarantees,
26+
but some external components nonetheless make use of some of them (in
27+
particular, using `KUBE-MARK-MASQ` to mark packets as needing to be
28+
masqueraded).
29+
-->
30+
一些 Kubernetes 组件(例如 kubelet 和 kube-proxy)在执行操作时,会创建特定的 iptables 链和规则。
31+
这些链从未被计划使其成为任何 Kubernetes API/ABI 保证的一部分,
32+
但一些外部组件仍然使用其中的一些链(特别是使用 `KUBE-MARK-MASQ` 将数据包标记为需要伪装)。
33+
34+
<!--
35+
As a part of the v1.25 release, SIG Network made this declaration
36+
explicit: that (with one exception), the iptables chains that
37+
Kubernetes creates are intended only for Kubernetes’s own internal
38+
use, and third-party components should not assume that Kubernetes will
39+
create any specific iptables chains, or that those chains will contain
40+
any specific rules if they do exist.
41+
-->
42+
作为 v1.25 版本的一部分,SIG Network 明确声明:
43+
Kubernetes 创建的 iptables 链仅供 Kubernetes 内部使用(有一个例外),
44+
第三方组件不应假定 Kubernetes 会创建任何特定的 iptables 链,
45+
或者这些链将包含任何特定的规则(即使它们确实存在)。
46+
47+
<!--
48+
Then, in future releases, as part of [KEP-3178], we will begin phasing
49+
out certain chains that Kubernetes itself no longer needs. Components
50+
outside of Kubernetes itself that make use of `KUBE-MARK-MASQ`,
51+
`KUBE-MARK-DROP`, or other Kubernetes-generated iptables chains should
52+
start migrating away from them now.
53+
-->
54+
然后,在未来的版本中,作为 [KEP-3178] 的一部分,我们将开始逐步淘汰 Kubernetes
55+
本身不再需要的某些链。Kubernetes 自身之外且使用了 `KUBE-MARK-MASQ``KUBE-MARK-DROP`
56+
或 Kubernetes 所生成的其它 iptables 链的组件应当开始迁移。
57+
58+
[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178
59+
60+
<!--
61+
## Background
62+
63+
In addition to various service-specific iptables chains, kube-proxy
64+
creates certain general-purpose iptables chains that it uses as part
65+
of service proxying. In the past, kubelet also used iptables for a few
66+
features (such as setting up `hostPort` mapping for pods) and so it
67+
also redundantly created some of the same chains.
68+
-->
69+
## 背景 {#background}
70+
71+
除了各种为 Service 创建的 iptables 链之外,kube-proxy 还创建了某些通用 iptables 链,
72+
用作服务代理的一部分。 过去,kubelet 还使用 iptables
73+
来实现一些功能(例如为 Pod 设置 `hostPort` 映射),因此它也冗余地创建了一些重复的链。
74+
75+
<!--
76+
However, with [the removal of dockershim] in Kubernetes in 1.24,
77+
kubelet now no longer ever uses any iptables rules for its own
78+
purposes; the things that it used to use iptables for are now always
79+
the responsibility of the container runtime or the network plugin, and
80+
there is no reason for kubelet to be creating any iptables rules.
81+
82+
Meanwhile, although `iptables` is still the default kube-proxy backend
83+
on Linux, it is unlikely to remain the default forever, since the
84+
associated command-line tools and kernel APIs are essentially
85+
deprecated, and no longer receiving improvements. (RHEL 9
86+
[logs a warning] if you use the iptables API, even via
87+
`iptables-nft`.)
88+
-->
89+
然而,随着 1.24 版本 Kubernetes 中 [dockershim 的移除]
90+
kubelet 现在不再为某种目的使用任何 iptables 规则;
91+
过去使用 iptables 来完成的事情现在总是由容器运行时或网络插件负责,
92+
现在 kubelet 没有理由创建任何 iptables 规则。
93+
94+
同时,虽然 iptables 仍然是 Linux 上默认的 kube-proxy 后端,
95+
但它不会永远是默认选项,因为相关的命令行工具和内核 API 基本上已被弃用,
96+
并且不再得到改进。(RHEL 9 [记录警告] 如果你使用 iptables API,即使是通过 `iptables-nft`。)
97+
98+
<!--
99+
Although as of Kubernetes 1.25 iptables kube-proxy remains popular,
100+
and kubelet continues to create the iptables rules that it
101+
historically created (despite no longer _using_ them), third party
102+
software cannot assume that core Kubernetes components will keep
103+
creating these rules in the future.
104+
105+
[the removal of dockershim]: https://kubernetes.io/blog/2022/02/17/dockershim-faq/
106+
[logs a warning]: https://access.redhat.com/solutions/6739041
107+
-->
108+
尽管在 Kubernetes 1.25,iptables kube-proxy 仍然很流行,
109+
并且 kubelet 继续创建它过去创建的 iptables 规则(尽管不再**使用**它们),
110+
第三方软件不能假设核心 Kubernetes 组件将来会继续创建这些规则。
111+
112+
[移除 dockershim]: https://kubernetes.io/zh-cn/blog/2022/02/17/dockershim-faq/
113+
[记录警告]: https://access.redhat.com/solutions/6739041
114+
115+
<!--
116+
## Upcoming changes
117+
118+
Starting a few releases from now, kubelet will no longer create the
119+
following iptables chains in the `nat` table:
120+
121+
- `KUBE-MARK-DROP`
122+
- `KUBE-MARK-MASQ`
123+
- `KUBE-POSTROUTING`
124+
125+
Additionally, the `KUBE-FIREWALL` chain in the `filter` table will no
126+
longer have the functionality currently associated with
127+
`KUBE-MARK-DROP` (and it may eventually go away entirely).
128+
-->
129+
## 即将发生的变化
130+
131+
从现在开始的几个版本中,kubelet 将不再在 `nat` 表中创建以下 iptables 链:
132+
133+
- `KUBE-MARK-DROP`
134+
- `KUBE-MARK-MASQ`
135+
- `KUBE-POSTROUTING`
136+
137+
此外,`filter` 表中的 `KUBE-FIREWALL` 链将不再具有当前与
138+
`KUBE-MARK-DROP` 关联的功能(并且它最终可能会完全消失)。
139+
140+
<!--
141+
This change will be phased in via the `IPTablesOwnershipCleanup`
142+
feature gate. That feature gate is available and can be manually
143+
enabled for testing in Kubernetes 1.25. The current plan is that it
144+
will become enabled-by-default in Kubernetes 1.27, though this may be
145+
delayed to a later release. (It will not happen sooner than Kubernetes
146+
1.27.)
147+
-->
148+
此更改将通过 `IPTablesOwnershipCleanup` 特性门控逐步实施。
149+
你可以手动在 Kubernetes 1.25 中开启此特性进行测试。
150+
目前的计划是将其在 Kubernetes 1.27 中默认启用,
151+
尽管这可能会延迟到以后的版本。(不会在 Kubernetes 1.27 版本之前调整。)
152+
153+
<!--
154+
## What to do if you use Kubernetes’s iptables chains
155+
156+
(Although the discussion below focuses on short-term fixes that are
157+
still based on iptables, you should probably also start thinking about
158+
eventually migrating to nftables or another API).
159+
-->
160+
## 如果你使用 Kubernetes 的 iptables 链怎么办
161+
162+
(尽管下面的讨论侧重于仍然基于 iptables 的短期修复,
163+
但你可能也应该开始考虑最终迁移到 nftables 或其他 API。)
164+
165+
<!--
166+
### If you use `KUBE-MARK-MASQ`... {#use-case-kube-mark-masq}
167+
168+
If you are making use of the `KUBE-MARK-MASQ` chain to cause packets
169+
to be masqueraded, you have two options: (1) rewrite your rules to use
170+
`-j MASQUERADE` directly, (2) create your own alternative “mark for
171+
masquerade” chain.
172+
-->
173+
### 如果你使用 `KUBE-MARK-MASQ` 链... {#use-case-kube-mark-drop}
174+
175+
如果你正在使用 `KUBE-MARK-MASQ` 链来伪装数据包,
176+
你有两个选择:(1)重写你的规则以直接使用 `-j MASQUERADE`
177+
(2)创建你自己的替代链,完成“为伪装而设标记”的任务。
178+
179+
<!--
180+
The reason kube-proxy uses `KUBE-MARK-MASQ` is because there are lots
181+
of cases where it needs to call both `-j DNAT` and `-j MASQUERADE` on
182+
a packet, but it’s not possible to do both of those at the same time
183+
in iptables; `DNAT` must be called from the `PREROUTING` (or `OUTPUT`)
184+
chain (because it potentially changes where the packet will be routed
185+
to) while `MASQUERADE` must be called from `POSTROUTING` (because the
186+
masqueraded source IP that it picks depends on what the final routing
187+
decision was).
188+
-->
189+
kube-proxy 使用 `KUBE-MARK-MASQ` 的原因是因为在很多情况下它需要在数据包上同时调用
190+
`-j DNAT``-j MASQUERADE`,但不可能同时在 iptables 中调用这两种方法;
191+
`DNAT` 必须从 `PREROUTING`(或 `OUTPUT`)链中调用(因为它可能会改变数据包将被路由到的位置)而
192+
`MASQUERADE` 必须从 `POSTROUTING` 中调用(因为它伪装的源 IP 地址取决于最终的路由)。
193+
194+
<!--
195+
In theory, kube-proxy could have one set of rules to match packets in
196+
`PREROUTING`/`OUTPUT` and call `-j DNAT`, and then have a second set
197+
of rules to match the same packets in `POSTROUTING` and call `-j
198+
MASQUERADE`. But instead, for efficiency, it only matches them once,
199+
during `PREROUTING`/`OUTPUT`, at which point it calls `-j DNAT` and
200+
then calls `-j KUBE-MARK-MASQ` to set a bit on the kernel packet mark
201+
as a reminder to itself. Then later, during `POSTROUTING`, it has a
202+
single rule that matches all previously-marked packets, and calls `-j
203+
MASQUERADE` on them.
204+
-->
205+
理论上,kube-proxy 可以有一组规则来匹配 `PREROUTING`/`OUTPUT`
206+
中的数据包并调用 `-j DNAT`,然后有第二组规则来匹配 `POSTROUTING`
207+
中的相同数据包并调用 `-j MASQUERADE`
208+
但是,为了提高效率,kube-proxy 只匹配了一次,在 `PREROUTING`/`OUTPUT` 期间调用 `-j DNAT`
209+
然后调用 `-j KUBE-MARK-MASQ` 在内核数据包标记属性上设置一个比特,作为对自身的提醒。
210+
然后,在 `POSTROUTING` 期间,通过一条规则来匹配所有先前标记的数据包,并对它们调用 `-j MASQUERADE`
211+
212+
<!--
213+
If you have _a lot_ of rules where you need to apply both DNAT and
214+
masquerading to the same packets like kube-proxy does, then you may
215+
want a similar arrangement. But in many cases, components that use
216+
`KUBE-MARK-MASQ` are only doing it because they copied kube-proxy’s
217+
behavior without understanding why kube-proxy was doing it that way.
218+
Many of these components could easily be rewritten to just use
219+
separate DNAT and masquerade rules. (In cases where no DNAT is
220+
occurring then there is even less point to using `KUBE-MARK-MASQ`;
221+
just move your rules from `PREROUTING` to `POSTROUTING` and call `-j
222+
MASQUERADE` directly.)
223+
-->
224+
如果你有**很多**规则需要像 kube-proxy 一样对同一个数据包同时执行 DNAT 和伪装操作,
225+
那么你可能需要类似的安排。但在许多情况下,使用 `KUBE-MARK-MASQ` 的组件之所以这样做,
226+
只是因为它们复制了 kube-proxy 的行为,而不理解 kube-proxy 为何这样做。
227+
许多这些组件可以很容易地重写为仅使用单独的 DNAT 和伪装规则。
228+
(在没有发生 DNAT 的情况下,使用 `KUBE-MARK-MASQ` 的意义就更小了;
229+
只需将你的规则从 `PREROUTING` 移至 `POSTROUTING` 并直接调用 `-j MASQUERADE`。)
230+
231+
<!--
232+
### If you use `KUBE-MARK-DROP`... {#use-case-kube-mark-drop}
233+
234+
The rationale for `KUBE-MARK-DROP` is similar to the rationale for
235+
`KUBE-MARK-MASQ`: kube-proxy wanted to make packet-dropping decisions
236+
alongside other decisions in the `nat` `KUBE-SERVICES` chain, but you
237+
can only call `-j DROP` from the `filter` table. So instead, it uses
238+
`KUBE-MARK-DROP` to mark packets to be dropped later on.
239+
-->
240+
### 如果你使用 `KUBE-MARK-DROP`... {#use-case-kube-mark-drop}
241+
242+
`KUBE-MARK-DROP` 的基本原理与 `KUBE-MARK-MASQ` 类似:
243+
kube-proxy 想要在 `nat` `KUBE-SERVICES` 链中做出丢包决定以及其他决定,
244+
但你只能从 `filter` 表中调用 `-j DROP`
245+
246+
<!--
247+
In general, the approach for removing a dependency on `KUBE-MARK-DROP`
248+
is the same as for removing a dependency on `KUBE-MARK-MASQ`. In
249+
kube-proxy’s case, it is actually quite easy to replace the usage of
250+
`KUBE-MARK-DROP` in the `nat` table with direct calls to `DROP` in the
251+
`filter` table, because there are no complicated interactions between
252+
DNAT rules and drop rules, and so the drop rules can simply be moved
253+
from `nat` to `filter`.
254+
255+
In more complicated cases, it might be necessary to “re-match” the
256+
same packets in both `nat` and `filter`.
257+
-->
258+
通常,删除对 `KUBE-MARK-DROP` 的依赖的方法与删除对 `KUBE-MARK-MASQ` 的依赖的方法相同。
259+
在 kube-proxy 的场景中,很容易将 `nat` 表中的 `KUBE-MARK-DROP`
260+
的用法替换为直接调用 `filter` 表中的 `DROP`,因为 DNAT 规则和 DROP 规则之间没有复杂的交互关系,
261+
因此 DROP 规则可以简单地从 `nat` 移动到 `filter`
262+
更复杂的场景中,可能需要在 `nat``filter` 表中“重新匹配”相同的数据包。
263+
264+
<!--
265+
### If you use Kubelet’s iptables rules to figure out `iptables-legacy` vs `iptables-nft`... {#use-case-iptables-mode}
266+
267+
Components that manipulate host-network-namespace iptables rules from
268+
inside a container need some way to figure out whether the host is
269+
using the old `iptables-legacy` binaries or the newer `iptables-nft`
270+
binaries (which talk to a different kernel API underneath).
271+
-->
272+
### 如果你使用 Kubelet 的 iptables 规则来确定 `iptables-legacy``iptables-nft`... {#use-case-iptables-mode}
273+
274+
对于从容器内部操纵主机网络命名空间 iptables 规则的组件而言,需要一些方法来确定主机是使用旧的
275+
`iptables-legacy` 二进制文件还是新的 `iptables-nft` 二进制文件(与不同的内核 API 交互)下。
276+
277+
<!--
278+
The [`iptables-wrappers`] module provides a way for such components to
279+
autodetect the system iptables mode, but in the past it did this by
280+
assuming that Kubelet will have created “a bunch” of iptables rules
281+
before any containers start, and so it can guess which mode the
282+
iptables binaries in the host filesystem are using by seeing which
283+
mode has more rules defined.
284+
285+
In future releases, Kubelet will no longer create many iptables rules,
286+
so heuristics based on counting the number of rules present may fail.
287+
-->
288+
[`iptables-wrappers`] 模块为此类组件提供了一种自动检测系统 iptables 模式的方法,
289+
但在过去,它通过假设 kubelet 将在任何容器启动之前创建“一堆” iptables
290+
规则来实现这一点,因此它可以通过查看哪种模式定义了更多规则来猜测主机文件系统中的
291+
iptables 二进制文件正在使用哪种模式。
292+
293+
在未来的版本中,kubelet 将不再创建许多 iptables 规则,
294+
因此基于计算存在的规则数量的启发式方法可能会失败。
295+
296+
<!--
297+
However, as of 1.24, Kubelet always creates a chain named
298+
`KUBE-IPTABLES-HINT` in the `mangle` table of whichever iptables
299+
subsystem it is using. Components can now look for this specific chain
300+
to know which iptables subsystem Kubelet (and thus, presumably, the
301+
rest of the system) is using.
302+
303+
(Additionally, since Kubernetes 1.17, kubelet has created a chain
304+
called `KUBE-KUBELET-CANARY` in the `mangle` table. While this chain
305+
may go away in the future, it will of course still be there in older
306+
releases, so in any recent version of Kubernetes, at least one of
307+
`KUBE-IPTABLES-HINT` or `KUBE-KUBELET-CANARY` will be present.)
308+
-->
309+
然而,从 1.24 开始,kubelet 总是在它使用的任何 iptables 子系统的
310+
`mangle` 表中创建一个名为 `KUBE-IPTABLES-HINT` 的链。
311+
组件现在可以查找这个特定的链,以了解 kubelet(以及系统的其余部分)正在使用哪个 iptables 子系统。
312+
313+
(此外,从 Kubernetes 1.17 开始,kubelet 在 `mangle` 表中创建了一个名为 `KUBE-KUBELET-CANARY` 的链。
314+
虽然这条链在未来可能会消失,但它仍然会在旧版本中存在,因此在任何最新版本的 Kubernetes 中,
315+
至少会包含 `KUBE-IPTABLES-HINT``KUBE-KUBELET-CANARY` 两条链的其中一个。)
316+
317+
<!--
318+
The `iptables-wrappers` package has [already been updated] with this new
319+
heuristic, so if you were previously using that, you can rebuild your
320+
container images with an updated version of that.
321+
322+
[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/
323+
[already been updated]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3
324+
-->
325+
`iptables-wrappers`[已经被更新],以提供这个新的启发式逻辑,
326+
所以如果你以前使用过它,你可以用它的更新版本重建你的容器镜像。
327+
328+
[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/
329+
[已经更新]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3
330+
331+
<!--
332+
## Further reading
333+
334+
The project to clean up iptables chain ownership and deprecate the old
335+
chains is tracked by [KEP-3178].
336+
337+
[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178
338+
-->
339+
## 延伸阅读
340+
341+
[KEP-3178] 跟踪了清理 iptables 链所有权和弃用旧链的项目。
342+
343+
[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178

0 commit comments

Comments
 (0)