|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes v1.26: Advancements in Kubernetes Traffic Engineering" |
| 4 | +date: 2022-12-30 |
| 5 | +slug: advancements-in-kubernetes-traffic-engineering |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Andrew Sy Kim (Google) |
| 9 | + |
| 10 | +Kubernetes v1.26 includes significant advancements in network traffic engineering with the graduation of |
| 11 | +two features (Service internal traffic policy support, and EndpointSlice terminating conditions) to GA, |
| 12 | +and a third feature (Proxy terminating endpoints) to beta. The combination of these enhancements aims |
| 13 | +to address short-comings in traffic engineering that people face today, and unlock new capabilities for the future. |
| 14 | + |
| 15 | +## Traffic Loss from Load Balancers During Rolling Updates |
| 16 | + |
| 17 | +Prior to Kubernetes v1.26, clusters could experience [loss of traffic](https://github.com/kubernetes/kubernetes/issues/85643) |
| 18 | +from Service load balancers during rolling updates when setting the `externalTrafficPolicy` field to `Local`. |
| 19 | +There are a lot of moving parts at play here so a quick overview of how Kubernetes manages load balancers might help! |
| 20 | + |
| 21 | +In Kubernetes, you can create a Service with `type: LoadBalancer` to expose an application externally with a load balancer. |
| 22 | +The load balancer implementation varies between clusters and platforms, but the Service provides a generic abstraction |
| 23 | +representing the load balancer that is consistent across all Kubernetes installations. |
| 24 | + |
| 25 | +```yaml |
| 26 | +apiVersion: v1 |
| 27 | +kind: Service |
| 28 | +metadata: |
| 29 | + name: my-service |
| 30 | +spec: |
| 31 | + selector: |
| 32 | + app.kubernetes.io/name: my-app |
| 33 | + ports: |
| 34 | + - protocol: TCP |
| 35 | + port: 80 |
| 36 | + targetPort: 9376 |
| 37 | + type: LoadBalancer |
| 38 | +``` |
| 39 | +
|
| 40 | +Under the hood, Kubernetes allocates a NodePort for the Service, which is then used by kube-proxy to provide a |
| 41 | +network data path from the NodePort to the Pod. A controller will then add all available Nodes in the cluster |
| 42 | +to the load balancer’s backend pool, using the designated NodePort for the Service as the backend target port. |
| 43 | +
|
| 44 | +{{< figure src="traffic-engineering-service-load-balancer.png" caption="Figure 1: Overview of Service load balancers" >}} |
| 45 | +
|
| 46 | +Oftentimes it is beneficial to set `externalTrafficPolicy: Local` for Services, to avoid extra hops between |
| 47 | +Nodes that are not running healthy Pods backing that Service. When using `externalTrafficPolicy: Local`, |
| 48 | +an additional NodePort is allocated for health checking purposes, such that Nodes that do not contain healthy |
| 49 | +Pods are excluded from the backend pool for a load balancer. |
| 50 | + |
| 51 | +{{< figure src="traffic-engineering-lb-healthy.png" caption="Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local" >}} |
| 52 | + |
| 53 | +One such scenario where traffic can be lost is when a Node loses all Pods for a Service, |
| 54 | +but the external load balancer has not probed the health check NodePort yet. The likelihood of this situation |
| 55 | +is largely dependent on the health checking interval configured on the load balancer. The larger the interval, |
| 56 | +the more likely this will happen, since the load balancer will continue to send traffic to a node |
| 57 | +even after kube-proxy has removed forwarding rules for that Service. This also occurrs when Pods start terminating |
| 58 | +during rolling updates. Since Kubernetes does not consider terminating Pods as “Ready”, traffic can be loss |
| 59 | +when there are only terminating Pods on any given Node during a rolling update. |
| 60 | + |
| 61 | +{{< figure src="traffic-engineering-lb-without-proxy-terminating-endpoints.png" caption="Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local" >}} |
| 62 | + |
| 63 | +Starting in Kubernetes v1.26, kube-proxy enables the `ProxyTerminatingEndpoints` feature by default, which |
| 64 | +adds automatic failover and routing to terminating endpoints in scenarios where the traffic would otherwise |
| 65 | +be dropped. More specifically, when there is a rolling update and a Node only contains terminating Pods, |
| 66 | +kube-proxy will route traffic to the terminating Pods based on their readiness. In addition, kube-proxy will |
| 67 | +actively fail the health check NodePort if there are only terminating Pods available. By doing so, |
| 68 | +kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will |
| 69 | +gracefully handle requests for existing connections. |
| 70 | + |
| 71 | +{{< figure src="traffic-engineering-lb-with-proxy-terminating-endpoints.png" caption="Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local" >}} |
| 72 | + |
| 73 | +### EndpointSlice Conditions |
| 74 | + |
| 75 | +In order to support this new capability in kube-proxy, the EndpointSlice API introduced new conditions for endpoints: |
| 76 | +`serving` and `terminating`. |
| 77 | + |
| 78 | +{{< figure src="endpointslice-overview.png" caption="Figure 5: Overview of EndpointSlice conditions" >}} |
| 79 | + |
| 80 | +The `serving` condition is semantically identical to `ready`, except that it can be `true` or `false` |
| 81 | +while a Pod is terminating, unlike `ready` which will always be `false` for terminating Pods for compatibility reasons. |
| 82 | +The `terminating` condition is true for Pods undergoing termination (non-empty deletionTimestamp), false otherwise. |
| 83 | + |
| 84 | +The addition of these two conditions enables consumers of this API to understand Pod states that were previously not possible. |
| 85 | +For example, we can now track "ready" and "not ready" Pods that are also terminating. |
| 86 | + |
| 87 | +{{< figure src="endpointslice-with-terminating-pod.png" caption="Figure 6: EndpointSlice conditions with a terminating Pod" >}} |
| 88 | + |
| 89 | +Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining |
| 90 | +events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints. |
| 91 | + |
| 92 | +## Optimizing Internal Node-Local Traffic |
| 93 | + |
| 94 | +Similar to how Services can set `externalTrafficPolicy: Local` to avoid extra hops for externally sourced traffic, Kubernetes |
| 95 | +now supports `internalTrafficPolicy: Local`, to enable the same optimization for traffic originating within the cluster, specifically |
| 96 | +for traffic using the Service Cluster IP as the destination address. This feature graduated to Beta in Kubernetes v1.24 and is graduating to GA in v1.26. |
| 97 | + |
| 98 | +Services default the `internalTrafficPolicy` field to `Cluster`, where traffic is randomly distributed to all endpoints. |
| 99 | + |
| 100 | +{{< figure src="service-internal-traffic-policy-cluster.png" caption="Figure 7: Service routing when internalTrafficPolicy is Cluster" >}} |
| 101 | + |
| 102 | +When `internalTrafficPolicy` is set to `Local`, kube-proxy will forward internal traffic for a Service only if there is an available endpoint |
| 103 | +that is local to the same Node. |
| 104 | + |
| 105 | +{{< figure src="service-internal-traffic-policy-local.png" caption="Figure 8: Service routing when internalTrafficPolicy is Local" >}} |
| 106 | + |
| 107 | +{{< caution >}} |
| 108 | +When using `internalTrafficPoliy: Local`, traffic will be dropped by kube-proxy when no local endpoints are available. |
| 109 | +{{< /caution >}} |
| 110 | + |
| 111 | +## Getting Involved |
| 112 | + |
| 113 | +If you're interested in future discussions on Kubernetes traffic engineering, you can get involved in SIG Network through the following ways: |
| 114 | +* Slack: [#sig-network](https://kubernetes.slack.com/messages/sig-network) |
| 115 | +* [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-network) |
| 116 | +* [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnetwork) |
| 117 | +* [Biweekly meetings](https://github.com/kubernetes/community/tree/master/sig-network#meetings) |
0 commit comments