|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 的 iptables 链不是 API" |
| 4 | +date: 2022-09-07 |
| 5 | +slug: iptables-chains-not-api |
| 6 | +--- |
| 7 | + |
| 8 | +<!-- |
| 9 | +layout: blog |
| 10 | +title: "Kubernetes’s IPTables Chains Are Not API" |
| 11 | +date: 2022-09-07 |
| 12 | +slug: iptables-chains-not-api |
| 13 | +--> |
| 14 | + |
| 15 | +<!-- |
| 16 | +**Author:** Dan Winship (Red Hat) |
| 17 | +--> |
| 18 | +**作者:** Dan Winship (Red Hat) |
| 19 | + |
| 20 | +**译者:** Xin Li (DaoCloud) |
| 21 | + |
| 22 | +<!-- |
| 23 | +Some Kubernetes components (such as kubelet and kube-proxy) create |
| 24 | +iptables chains and rules as part of their operation. These chains |
| 25 | +were never intended to be part of any Kubernetes API/ABI guarantees, |
| 26 | +but some external components nonetheless make use of some of them (in |
| 27 | +particular, using `KUBE-MARK-MASQ` to mark packets as needing to be |
| 28 | +masqueraded). |
| 29 | +--> |
| 30 | +一些 Kubernetes 组件(例如 kubelet 和 kube-proxy)在执行操作时,会创建特定的 iptables 链和规则。 |
| 31 | +这些链从未被计划使其成为任何 Kubernetes API/ABI 保证的一部分, |
| 32 | +但一些外部组件仍然使用其中的一些链(特别是使用 `KUBE-MARK-MASQ` 将数据包标记为需要伪装)。 |
| 33 | + |
| 34 | +<!-- |
| 35 | +As a part of the v1.25 release, SIG Network made this declaration |
| 36 | +explicit: that (with one exception), the iptables chains that |
| 37 | +Kubernetes creates are intended only for Kubernetes’s own internal |
| 38 | +use, and third-party components should not assume that Kubernetes will |
| 39 | +create any specific iptables chains, or that those chains will contain |
| 40 | +any specific rules if they do exist. |
| 41 | +--> |
| 42 | +作为 v1.25 版本的一部分,SIG Network 明确声明: |
| 43 | +Kubernetes 创建的 iptables 链仅供 Kubernetes 内部使用(有一个例外), |
| 44 | +第三方组件不应假定 Kubernetes 会创建任何特定的 iptables 链, |
| 45 | +或者这些链将包含任何特定的规则(即使它们确实存在)。 |
| 46 | + |
| 47 | +<!-- |
| 48 | +Then, in future releases, as part of [KEP-3178], we will begin phasing |
| 49 | +out certain chains that Kubernetes itself no longer needs. Components |
| 50 | +outside of Kubernetes itself that make use of `KUBE-MARK-MASQ`, |
| 51 | +`KUBE-MARK-DROP`, or other Kubernetes-generated iptables chains should |
| 52 | +start migrating away from them now. |
| 53 | +--> |
| 54 | +然后,在未来的版本中,作为 [KEP-3178] 的一部分,我们将开始逐步淘汰 Kubernetes |
| 55 | +本身不再需要的某些链。Kubernetes 自身之外且使用了 `KUBE-MARK-MASQ`、`KUBE-MARK-DROP` |
| 56 | +或 Kubernetes 所生成的其它 iptables 链的组件应当开始迁移。 |
| 57 | + |
| 58 | +[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178 |
| 59 | + |
| 60 | +<!-- |
| 61 | +## Background |
| 62 | +
|
| 63 | +In addition to various service-specific iptables chains, kube-proxy |
| 64 | +creates certain general-purpose iptables chains that it uses as part |
| 65 | +of service proxying. In the past, kubelet also used iptables for a few |
| 66 | +features (such as setting up `hostPort` mapping for pods) and so it |
| 67 | +also redundantly created some of the same chains. |
| 68 | +--> |
| 69 | +## 背景 {#background} |
| 70 | + |
| 71 | +除了各种为 Service 创建的 iptables 链之外,kube-proxy 还创建了某些通用 iptables 链, |
| 72 | +用作服务代理的一部分。 过去,kubelet 还使用 iptables |
| 73 | +来实现一些功能(例如为 Pod 设置 `hostPort` 映射),因此它也冗余地创建了一些重复的链。 |
| 74 | + |
| 75 | +<!-- |
| 76 | +However, with [the removal of dockershim] in Kubernetes in 1.24, |
| 77 | +kubelet now no longer ever uses any iptables rules for its own |
| 78 | +purposes; the things that it used to use iptables for are now always |
| 79 | +the responsibility of the container runtime or the network plugin, and |
| 80 | +there is no reason for kubelet to be creating any iptables rules. |
| 81 | +
|
| 82 | +Meanwhile, although `iptables` is still the default kube-proxy backend |
| 83 | +on Linux, it is unlikely to remain the default forever, since the |
| 84 | +associated command-line tools and kernel APIs are essentially |
| 85 | +deprecated, and no longer receiving improvements. (RHEL 9 |
| 86 | +[logs a warning] if you use the iptables API, even via |
| 87 | +`iptables-nft`.) |
| 88 | +--> |
| 89 | +然而,随着 1.24 版本 Kubernetes 中 [dockershim 的移除], |
| 90 | +kubelet 现在不再为某种目的使用任何 iptables 规则; |
| 91 | +过去使用 iptables 来完成的事情现在总是由容器运行时或网络插件负责, |
| 92 | +现在 kubelet 没有理由创建任何 iptables 规则。 |
| 93 | + |
| 94 | +同时,虽然 iptables 仍然是 Linux 上默认的 kube-proxy 后端, |
| 95 | +但它不会永远是默认选项,因为相关的命令行工具和内核 API 基本上已被弃用, |
| 96 | +并且不再得到改进。(RHEL 9 [记录警告] 如果你使用 iptables API,即使是通过 `iptables-nft`。) |
| 97 | + |
| 98 | +<!-- |
| 99 | +Although as of Kubernetes 1.25 iptables kube-proxy remains popular, |
| 100 | +and kubelet continues to create the iptables rules that it |
| 101 | +historically created (despite no longer _using_ them), third party |
| 102 | +software cannot assume that core Kubernetes components will keep |
| 103 | +creating these rules in the future. |
| 104 | +
|
| 105 | +[the removal of dockershim]: https://kubernetes.io/blog/2022/02/17/dockershim-faq/ |
| 106 | +[logs a warning]: https://access.redhat.com/solutions/6739041 |
| 107 | +--> |
| 108 | +尽管在 Kubernetes 1.25,iptables kube-proxy 仍然很流行, |
| 109 | +并且 kubelet 继续创建它过去创建的 iptables 规则(尽管不再**使用**它们), |
| 110 | +第三方软件不能假设核心 Kubernetes 组件将来会继续创建这些规则。 |
| 111 | + |
| 112 | +[移除 dockershim]: https://kubernetes.io/zh-cn/blog/2022/02/17/dockershim-faq/ |
| 113 | +[记录警告]: https://access.redhat.com/solutions/6739041 |
| 114 | + |
| 115 | +<!-- |
| 116 | +## Upcoming changes |
| 117 | +
|
| 118 | +Starting a few releases from now, kubelet will no longer create the |
| 119 | +following iptables chains in the `nat` table: |
| 120 | +
|
| 121 | + - `KUBE-MARK-DROP` |
| 122 | + - `KUBE-MARK-MASQ` |
| 123 | + - `KUBE-POSTROUTING` |
| 124 | +
|
| 125 | +Additionally, the `KUBE-FIREWALL` chain in the `filter` table will no |
| 126 | +longer have the functionality currently associated with |
| 127 | +`KUBE-MARK-DROP` (and it may eventually go away entirely). |
| 128 | +--> |
| 129 | +## 即将发生的变化 |
| 130 | + |
| 131 | +从现在开始的几个版本中,kubelet 将不再在 `nat` 表中创建以下 iptables 链: |
| 132 | + |
| 133 | + - `KUBE-MARK-DROP` |
| 134 | + - `KUBE-MARK-MASQ` |
| 135 | + - `KUBE-POSTROUTING` |
| 136 | + |
| 137 | +此外,`filter` 表中的 `KUBE-FIREWALL` 链将不再具有当前与 |
| 138 | +`KUBE-MARK-DROP` 关联的功能(并且它最终可能会完全消失)。 |
| 139 | + |
| 140 | +<!-- |
| 141 | +This change will be phased in via the `IPTablesOwnershipCleanup` |
| 142 | +feature gate. That feature gate is available and can be manually |
| 143 | +enabled for testing in Kubernetes 1.25. The current plan is that it |
| 144 | +will become enabled-by-default in Kubernetes 1.27, though this may be |
| 145 | +delayed to a later release. (It will not happen sooner than Kubernetes |
| 146 | +1.27.) |
| 147 | +--> |
| 148 | +此更改将通过 `IPTablesOwnershipCleanup` 特性门控逐步实施。 |
| 149 | +你可以手动在 Kubernetes 1.25 中开启此特性进行测试。 |
| 150 | +目前的计划是将其在 Kubernetes 1.27 中默认启用, |
| 151 | +尽管这可能会延迟到以后的版本。(不会在 Kubernetes 1.27 版本之前调整。) |
| 152 | + |
| 153 | +<!-- |
| 154 | +## What to do if you use Kubernetes’s iptables chains |
| 155 | +
|
| 156 | +(Although the discussion below focuses on short-term fixes that are |
| 157 | +still based on iptables, you should probably also start thinking about |
| 158 | +eventually migrating to nftables or another API). |
| 159 | +--> |
| 160 | +## 如果你使用 Kubernetes 的 iptables 链怎么办 |
| 161 | + |
| 162 | +(尽管下面的讨论侧重于仍然基于 iptables 的短期修复, |
| 163 | +但你可能也应该开始考虑最终迁移到 nftables 或其他 API。) |
| 164 | + |
| 165 | +<!-- |
| 166 | +### If you use `KUBE-MARK-MASQ`... {#use-case-kube-mark-masq} |
| 167 | +
|
| 168 | +If you are making use of the `KUBE-MARK-MASQ` chain to cause packets |
| 169 | +to be masqueraded, you have two options: (1) rewrite your rules to use |
| 170 | +`-j MASQUERADE` directly, (2) create your own alternative “mark for |
| 171 | +masquerade” chain. |
| 172 | +--> |
| 173 | +### 如果你使用 `KUBE-MARK-MASQ` 链... {#use-case-kube-mark-drop} |
| 174 | + |
| 175 | +如果你正在使用 `KUBE-MARK-MASQ` 链来伪装数据包, |
| 176 | +你有两个选择:(1)重写你的规则以直接使用 `-j MASQUERADE`, |
| 177 | +(2)创建你自己的替代链,完成“为伪装而设标记”的任务。 |
| 178 | + |
| 179 | +<!-- |
| 180 | +The reason kube-proxy uses `KUBE-MARK-MASQ` is because there are lots |
| 181 | +of cases where it needs to call both `-j DNAT` and `-j MASQUERADE` on |
| 182 | +a packet, but it’s not possible to do both of those at the same time |
| 183 | +in iptables; `DNAT` must be called from the `PREROUTING` (or `OUTPUT`) |
| 184 | +chain (because it potentially changes where the packet will be routed |
| 185 | +to) while `MASQUERADE` must be called from `POSTROUTING` (because the |
| 186 | +masqueraded source IP that it picks depends on what the final routing |
| 187 | +decision was). |
| 188 | +--> |
| 189 | +kube-proxy 使用 `KUBE-MARK-MASQ` 的原因是因为在很多情况下它需要在数据包上同时调用 |
| 190 | +`-j DNAT` 和 `-j MASQUERADE`,但不可能同时在 iptables 中调用这两种方法; |
| 191 | +`DNAT` 必须从 `PREROUTING`(或 `OUTPUT`)链中调用(因为它可能会改变数据包将被路由到的位置)而 |
| 192 | +`MASQUERADE` 必须从 `POSTROUTING` 中调用(因为它伪装的源 IP 地址取决于最终的路由)。 |
| 193 | + |
| 194 | +<!-- |
| 195 | +In theory, kube-proxy could have one set of rules to match packets in |
| 196 | +`PREROUTING`/`OUTPUT` and call `-j DNAT`, and then have a second set |
| 197 | +of rules to match the same packets in `POSTROUTING` and call `-j |
| 198 | +MASQUERADE`. But instead, for efficiency, it only matches them once, |
| 199 | +during `PREROUTING`/`OUTPUT`, at which point it calls `-j DNAT` and |
| 200 | +then calls `-j KUBE-MARK-MASQ` to set a bit on the kernel packet mark |
| 201 | +as a reminder to itself. Then later, during `POSTROUTING`, it has a |
| 202 | +single rule that matches all previously-marked packets, and calls `-j |
| 203 | +MASQUERADE` on them. |
| 204 | +--> |
| 205 | +理论上,kube-proxy 可以有一组规则来匹配 `PREROUTING`/`OUTPUT` |
| 206 | +中的数据包并调用 `-j DNAT`,然后有第二组规则来匹配 `POSTROUTING` |
| 207 | +中的相同数据包并调用 `-j MASQUERADE`。 |
| 208 | +但是,为了提高效率,kube-proxy 只匹配了一次,在 `PREROUTING`/`OUTPUT` 期间调用 `-j DNAT`, |
| 209 | +然后调用 `-j KUBE-MARK-MASQ` 在内核数据包标记属性上设置一个比特,作为对自身的提醒。 |
| 210 | +然后,在 `POSTROUTING` 期间,通过一条规则来匹配所有先前标记的数据包,并对它们调用 `-j MASQUERADE`。 |
| 211 | + |
| 212 | +<!-- |
| 213 | +If you have _a lot_ of rules where you need to apply both DNAT and |
| 214 | +masquerading to the same packets like kube-proxy does, then you may |
| 215 | +want a similar arrangement. But in many cases, components that use |
| 216 | +`KUBE-MARK-MASQ` are only doing it because they copied kube-proxy’s |
| 217 | +behavior without understanding why kube-proxy was doing it that way. |
| 218 | +Many of these components could easily be rewritten to just use |
| 219 | +separate DNAT and masquerade rules. (In cases where no DNAT is |
| 220 | +occurring then there is even less point to using `KUBE-MARK-MASQ`; |
| 221 | +just move your rules from `PREROUTING` to `POSTROUTING` and call `-j |
| 222 | +MASQUERADE` directly.) |
| 223 | +--> |
| 224 | +如果你有**很多**规则需要像 kube-proxy 一样对同一个数据包同时执行 DNAT 和伪装操作, |
| 225 | +那么你可能需要类似的安排。但在许多情况下,使用 `KUBE-MARK-MASQ` 的组件之所以这样做, |
| 226 | +只是因为它们复制了 kube-proxy 的行为,而不理解 kube-proxy 为何这样做。 |
| 227 | +许多这些组件可以很容易地重写为仅使用单独的 DNAT 和伪装规则。 |
| 228 | +(在没有发生 DNAT 的情况下,使用 `KUBE-MARK-MASQ` 的意义就更小了; |
| 229 | +只需将你的规则从 `PREROUTING` 移至 `POSTROUTING` 并直接调用 `-j MASQUERADE`。) |
| 230 | + |
| 231 | +<!-- |
| 232 | +### If you use `KUBE-MARK-DROP`... {#use-case-kube-mark-drop} |
| 233 | +
|
| 234 | +The rationale for `KUBE-MARK-DROP` is similar to the rationale for |
| 235 | +`KUBE-MARK-MASQ`: kube-proxy wanted to make packet-dropping decisions |
| 236 | +alongside other decisions in the `nat` `KUBE-SERVICES` chain, but you |
| 237 | +can only call `-j DROP` from the `filter` table. So instead, it uses |
| 238 | +`KUBE-MARK-DROP` to mark packets to be dropped later on. |
| 239 | +--> |
| 240 | +### 如果你使用 `KUBE-MARK-DROP`... {#use-case-kube-mark-drop} |
| 241 | + |
| 242 | +`KUBE-MARK-DROP` 的基本原理与 `KUBE-MARK-MASQ` 类似: |
| 243 | +kube-proxy 想要在 `nat` `KUBE-SERVICES` 链中做出丢包决定以及其他决定, |
| 244 | +但你只能从 `filter` 表中调用 `-j DROP`。 |
| 245 | + |
| 246 | +<!-- |
| 247 | +In general, the approach for removing a dependency on `KUBE-MARK-DROP` |
| 248 | +is the same as for removing a dependency on `KUBE-MARK-MASQ`. In |
| 249 | +kube-proxy’s case, it is actually quite easy to replace the usage of |
| 250 | +`KUBE-MARK-DROP` in the `nat` table with direct calls to `DROP` in the |
| 251 | +`filter` table, because there are no complicated interactions between |
| 252 | +DNAT rules and drop rules, and so the drop rules can simply be moved |
| 253 | +from `nat` to `filter`. |
| 254 | +
|
| 255 | +In more complicated cases, it might be necessary to “re-match” the |
| 256 | +same packets in both `nat` and `filter`. |
| 257 | +--> |
| 258 | +通常,删除对 `KUBE-MARK-DROP` 的依赖的方法与删除对 `KUBE-MARK-MASQ` 的依赖的方法相同。 |
| 259 | +在 kube-proxy 的场景中,很容易将 `nat` 表中的 `KUBE-MARK-DROP` |
| 260 | +的用法替换为直接调用 `filter` 表中的 `DROP`,因为 DNAT 规则和 DROP 规则之间没有复杂的交互关系, |
| 261 | +因此 DROP 规则可以简单地从 `nat` 移动到 `filter`。 |
| 262 | +更复杂的场景中,可能需要在 `nat` 和 `filter` 表中“重新匹配”相同的数据包。 |
| 263 | + |
| 264 | +<!-- |
| 265 | +### If you use Kubelet’s iptables rules to figure out `iptables-legacy` vs `iptables-nft`... {#use-case-iptables-mode} |
| 266 | +
|
| 267 | +Components that manipulate host-network-namespace iptables rules from |
| 268 | +inside a container need some way to figure out whether the host is |
| 269 | +using the old `iptables-legacy` binaries or the newer `iptables-nft` |
| 270 | +binaries (which talk to a different kernel API underneath). |
| 271 | +--> |
| 272 | +### 如果你使用 Kubelet 的 iptables 规则来确定 `iptables-legacy` 与 `iptables-nft`... {#use-case-iptables-mode} |
| 273 | + |
| 274 | +对于从容器内部操纵主机网络命名空间 iptables 规则的组件而言,需要一些方法来确定主机是使用旧的 |
| 275 | +`iptables-legacy` 二进制文件还是新的 `iptables-nft` 二进制文件(与不同的内核 API 交互)下。 |
| 276 | + |
| 277 | +<!-- |
| 278 | +The [`iptables-wrappers`] module provides a way for such components to |
| 279 | +autodetect the system iptables mode, but in the past it did this by |
| 280 | +assuming that Kubelet will have created “a bunch” of iptables rules |
| 281 | +before any containers start, and so it can guess which mode the |
| 282 | +iptables binaries in the host filesystem are using by seeing which |
| 283 | +mode has more rules defined. |
| 284 | +
|
| 285 | +In future releases, Kubelet will no longer create many iptables rules, |
| 286 | +so heuristics based on counting the number of rules present may fail. |
| 287 | +--> |
| 288 | +[`iptables-wrappers`] 模块为此类组件提供了一种自动检测系统 iptables 模式的方法, |
| 289 | +但在过去,它通过假设 kubelet 将在任何容器启动之前创建“一堆” iptables |
| 290 | +规则来实现这一点,因此它可以通过查看哪种模式定义了更多规则来猜测主机文件系统中的 |
| 291 | +iptables 二进制文件正在使用哪种模式。 |
| 292 | + |
| 293 | +在未来的版本中,kubelet 将不再创建许多 iptables 规则, |
| 294 | +因此基于计算存在的规则数量的启发式方法可能会失败。 |
| 295 | + |
| 296 | +<!-- |
| 297 | +However, as of 1.24, Kubelet always creates a chain named |
| 298 | +`KUBE-IPTABLES-HINT` in the `mangle` table of whichever iptables |
| 299 | +subsystem it is using. Components can now look for this specific chain |
| 300 | +to know which iptables subsystem Kubelet (and thus, presumably, the |
| 301 | +rest of the system) is using. |
| 302 | +
|
| 303 | +(Additionally, since Kubernetes 1.17, kubelet has created a chain |
| 304 | +called `KUBE-KUBELET-CANARY` in the `mangle` table. While this chain |
| 305 | +may go away in the future, it will of course still be there in older |
| 306 | +releases, so in any recent version of Kubernetes, at least one of |
| 307 | +`KUBE-IPTABLES-HINT` or `KUBE-KUBELET-CANARY` will be present.) |
| 308 | +--> |
| 309 | +然而,从 1.24 开始,kubelet 总是在它使用的任何 iptables 子系统的 |
| 310 | +`mangle` 表中创建一个名为 `KUBE-IPTABLES-HINT` 的链。 |
| 311 | +组件现在可以查找这个特定的链,以了解 kubelet(以及系统的其余部分)正在使用哪个 iptables 子系统。 |
| 312 | + |
| 313 | +(此外,从 Kubernetes 1.17 开始,kubelet 在 `mangle` 表中创建了一个名为 `KUBE-KUBELET-CANARY` 的链。 |
| 314 | +虽然这条链在未来可能会消失,但它仍然会在旧版本中存在,因此在任何最新版本的 Kubernetes 中, |
| 315 | +至少会包含 `KUBE-IPTABLES-HINT` 或 `KUBE-KUBELET-CANARY` 两条链的其中一个。) |
| 316 | + |
| 317 | +<!-- |
| 318 | +The `iptables-wrappers` package has [already been updated] with this new |
| 319 | +heuristic, so if you were previously using that, you can rebuild your |
| 320 | +container images with an updated version of that. |
| 321 | +
|
| 322 | +[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/ |
| 323 | +[already been updated]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3 |
| 324 | +--> |
| 325 | +`iptables-wrappers` 包[已经被更新],以提供这个新的启发式逻辑, |
| 326 | +所以如果你以前使用过它,你可以用它的更新版本重建你的容器镜像。 |
| 327 | + |
| 328 | +[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/ |
| 329 | +[已经更新]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3 |
| 330 | + |
| 331 | +<!-- |
| 332 | +## Further reading |
| 333 | +
|
| 334 | +The project to clean up iptables chain ownership and deprecate the old |
| 335 | +chains is tracked by [KEP-3178]. |
| 336 | +
|
| 337 | +[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178 |
| 338 | +--> |
| 339 | +## 延伸阅读 |
| 340 | + |
| 341 | +[KEP-3178] 跟踪了清理 iptables 链所有权和弃用旧链的项目。 |
| 342 | + |
| 343 | +[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178 |
0 commit comments