|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes’s IPTables Chains Are Not API" |
| 4 | +date: 2022-09-07 |
| 5 | +slug: iptables-chains-not-api |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Dan Winship (Red Hat) |
| 9 | + |
| 10 | +Some Kubernetes components (such as kubelet and kube-proxy) create |
| 11 | +iptables chains and rules as part of their operation. These chains |
| 12 | +were never intended to be part of any Kubernetes API/ABI guarantees, |
| 13 | +but some external components nonetheless make use of some of them (in |
| 14 | +particular, using `KUBE-MARK-MASQ` to mark packets as needing to be |
| 15 | +masqueraded). |
| 16 | + |
| 17 | +As a part of the v1.25 release, SIG Network made this declaration |
| 18 | +explicit: that (with one exception), the iptables chains that |
| 19 | +Kubernetes creates are intended only for Kubernetes’s own internal |
| 20 | +use, and third-party components should not assume that Kubernetes will |
| 21 | +create any specific iptables chains, or that those chains will contain |
| 22 | +any specific rules if they do exist. |
| 23 | + |
| 24 | +Then, in future releases, as part of [KEP-3178], we will begin phasing |
| 25 | +out certain chains that Kubernetes itself no longer needs. Components |
| 26 | +outside of Kubernetes itself that make use of `KUBE-MARK-MASQ`, |
| 27 | +`KUBE-MARK-DROP`, or other Kubernetes-generated iptables chains should |
| 28 | +start migrating away from them now. |
| 29 | + |
| 30 | +[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178 |
| 31 | + |
| 32 | +## Background |
| 33 | + |
| 34 | +In addition to various service-specific iptables chains, kube-proxy |
| 35 | +creates certain general-purpose iptables chains that it uses as part |
| 36 | +of service proxying. In the past, kubelet also used iptables for a few |
| 37 | +features (such as setting up `hostPort` mapping for pods) and so it |
| 38 | +also redundantly created some of the same chains. |
| 39 | + |
| 40 | +However, with [the removal of dockershim] in Kubernetes in 1.24, |
| 41 | +kubelet now no longer ever uses any iptables rules for its own |
| 42 | +purposes; the things that it used to use iptables for are now always |
| 43 | +the responsibility of the container runtime or the network plugin, and |
| 44 | +there is no reason for kubelet to be creating any iptables rules. |
| 45 | + |
| 46 | +Meanwhile, although `iptables` is still the default kube-proxy backend |
| 47 | +on Linux, it is unlikely to remain the default forever, since the |
| 48 | +associated command-line tools and kernel APIs are essentially |
| 49 | +deprecated, and no longer receiving improvements. (RHEL 9 |
| 50 | +[logs a warning] if you use the iptables API, even via |
| 51 | +`iptables-nft`.) |
| 52 | + |
| 53 | +Although as of Kubernetes 1.25 iptables kube-proxy remains popular, |
| 54 | +and kubelet continues to create the iptables rules that it |
| 55 | +historically created (despite no longer _using_ them), third party |
| 56 | +software cannot assume that core Kubernetes components will keep |
| 57 | +creating these rules in the future. |
| 58 | + |
| 59 | +[the removal of dockershim]: https://kubernetes.io/blog/2022/02/17/dockershim-faq/ |
| 60 | +[logs a warning]: https://access.redhat.com/solutions/6739041 |
| 61 | + |
| 62 | +## Upcoming changes |
| 63 | + |
| 64 | +Starting a few releases from now, kubelet will no longer create the |
| 65 | +following iptables chains in the `nat` table: |
| 66 | + |
| 67 | + - `KUBE-MARK-DROP` |
| 68 | + - `KUBE-MARK-MASQ` |
| 69 | + - `KUBE-POSTROUTING` |
| 70 | + |
| 71 | +Additionally, the `KUBE-FIREWALL` chain in the `filter` table will no |
| 72 | +longer have the functionality currently associated with |
| 73 | +`KUBE-MARK-DROP` (and it may eventually go away entirely). |
| 74 | + |
| 75 | +This change will be phased in via the `IPTablesOwnershipCleanup` |
| 76 | +feature gate. That feature gate is available and can be manually |
| 77 | +enabled for testing in Kubernetes 1.25. The current plan is that it |
| 78 | +will become enabled-by-default in Kubernetes 1.27, though this may be |
| 79 | +delayed to a later release. (It will not happen sooner than Kubernetes |
| 80 | +1.27.) |
| 81 | + |
| 82 | +## What to do if you use Kubernetes’s iptables chains |
| 83 | + |
| 84 | +(Although the discussion below focuses on short-term fixes that are |
| 85 | +still based on iptables, you should probably also start thinking about |
| 86 | +eventually migrating to nftables or another API). |
| 87 | + |
| 88 | +### If you use `KUBE-MARK-MASQ`... {#use-case-kube-mark-masq} |
| 89 | + |
| 90 | +If you are making use of the `KUBE-MARK-MASQ` chain to cause packets |
| 91 | +to be masqueraded, you have two options: (1) rewrite your rules to use |
| 92 | +`-j MASQUERADE` directly, (2) create your own alternative “mark for |
| 93 | +masquerade” chain. |
| 94 | + |
| 95 | +The reason kube-proxy uses `KUBE-MARK-MASQ` is because there are lots |
| 96 | +of cases where it needs to call both `-j DNAT` and `-j MASQUERADE` on |
| 97 | +a packet, but it’s not possible to do both of those at the same time |
| 98 | +in iptables; `DNAT` must be called from the `PREROUTING` (or `OUTPUT`) |
| 99 | +chain (because it potentially changes where the packet will be routed |
| 100 | +to) while `MASQUERADE` must be called from `POSTROUTING` (because the |
| 101 | +masqueraded source IP that it picks depends on what the final routing |
| 102 | +decision was). |
| 103 | + |
| 104 | +In theory, kube-proxy could have one set of rules to match packets in |
| 105 | +`PREROUTING`/`OUTPUT` and call `-j DNAT`, and then have a second set |
| 106 | +of rules to match the same packets in `POSTROUTING` and call `-j |
| 107 | +MASQUERADE`. But instead, for efficiency, it only matches them once, |
| 108 | +during `PREROUTING`/`OUTPUT`, at which point it calls `-j DNAT` and |
| 109 | +then calls `-j KUBE-MARK-MASQ` to set a bit on the kernel packet mark |
| 110 | +as a reminder to itself. Then later, during `POSTROUTING`, it has a |
| 111 | +single rule that matches all previously-marked packets, and calls `-j |
| 112 | +MASQUERADE` on them. |
| 113 | + |
| 114 | +If you have _a lot_ of rules where you need to apply both DNAT and |
| 115 | +masquerading to the same packets like kube-proxy does, then you may |
| 116 | +want a similar arrangement. But in many cases, components that use |
| 117 | +`KUBE-MARK-MASQ` are only doing it because they copied kube-proxy’s |
| 118 | +behavior without understanding why kube-proxy was doing it that way. |
| 119 | +Many of these components could easily be rewritten to just use |
| 120 | +separate DNAT and masquerade rules. (In cases where no DNAT is |
| 121 | +occurring then there is even less point to using `KUBE-MARK-MASQ`; |
| 122 | +just move your rules from `PREROUTING` to `POSTROUTING` and call `-j |
| 123 | +MASQUERADE` directly.) |
| 124 | + |
| 125 | +### If you use `KUBE-MARK-DROP`... {#use-case-kube-mark-drop} |
| 126 | + |
| 127 | +The rationale for `KUBE-MARK-DROP` is similar to the rationale for |
| 128 | +`KUBE-MARK-MASQ`: kube-proxy wanted to make packet-dropping decisions |
| 129 | +alongside other decisions in the `nat` `KUBE-SERVICES` chain, but you |
| 130 | +can only call `-j DROP` from the `filter` table. So instead, it uses |
| 131 | +`KUBE-MARK-DROP` to mark packets to be dropped later on. |
| 132 | + |
| 133 | +In general, the approach for removing a dependency on `KUBE-MARK-DROP` |
| 134 | +is the same as for removing a dependency on `KUBE-MARK-MASQ`. In |
| 135 | +kube-proxy’s case, it is actually quite easy to replace the usage of |
| 136 | +`KUBE-MARK-DROP` in the `nat` table with direct calls to `DROP` in the |
| 137 | +`filter` table, because there are no complicated interactions between |
| 138 | +DNAT rules and drop rules, and so the drop rules can simply be moved |
| 139 | +from `nat` to `filter`. |
| 140 | + |
| 141 | +In more complicated cases, it might be necessary to “re-match” the |
| 142 | +same packets in both `nat` and `filter`. |
| 143 | + |
| 144 | +### If you use Kubelet’s iptables rules to figure out `iptables-legacy` vs `iptables-nft`... {#use-case-iptables-mode} |
| 145 | + |
| 146 | +Components that manipulate host-network-namespace iptables rules from |
| 147 | +inside a container need some way to figure out whether the host is |
| 148 | +using the old `iptables-legacy` binaries or the newer `iptables-nft` |
| 149 | +binaries (which talk to a different kernel API underneath). |
| 150 | + |
| 151 | +The [`iptables-wrappers`] module provides a way for such components to |
| 152 | +autodetect the system iptables mode, but in the past it did this by |
| 153 | +assuming that Kubelet will have created “a bunch” of iptables rules |
| 154 | +before any containers start, and so it can guess which mode the |
| 155 | +iptables binaries in the host filesystem are using by seeing which |
| 156 | +mode has more rules defined. |
| 157 | + |
| 158 | +In future releases, Kubelet will no longer create many iptables rules, |
| 159 | +so heuristics based on counting the number of rules present may fail. |
| 160 | + |
| 161 | +However, as of 1.24, Kubelet always creates a chain named |
| 162 | +`KUBE-IPTABLES-HINT` in the `mangle` table of whichever iptables |
| 163 | +subsystem it is using. Components can now look for this specific chain |
| 164 | +to know which iptables subsystem Kubelet (and thus, presumably, the |
| 165 | +rest of the system) is using. |
| 166 | + |
| 167 | +(Additionally, since Kubernetes 1.17, kubelet has created a chain |
| 168 | +called `KUBE-KUBELET-CANARY` in the `mangle` table. While this chain |
| 169 | +may go away in the future, it will of course still be there in older |
| 170 | +releases, so in any recent version of Kubernetes, at least one of |
| 171 | +`KUBE-IPTABLES-HINT` or `KUBE-KUBELET-CANARY` will be present.) |
| 172 | + |
| 173 | +The `iptables-wrappers` package has [already been updated] with this new |
| 174 | +heuristic, so if you were previously using that, you can rebuild your |
| 175 | +container images with an updated version of that. |
| 176 | + |
| 177 | +[`iptables-wrappers`]: https://github.com/kubernetes-sigs/iptables-wrappers/ |
| 178 | +[already been updated]: https://github.com/kubernetes-sigs/iptables-wrappers/pull/3 |
| 179 | + |
| 180 | +## Further reading |
| 181 | + |
| 182 | +The project to clean up iptables chain ownership and deprecate the old |
| 183 | +chains is tracked by [KEP-3178]. |
| 184 | + |
| 185 | +[KEP-3178]: https://github.com/kubernetes/enhancements/issues/3178 |
0 commit comments