@@ -14,10 +14,10 @@ The `kube-proxy` component is responsible for implementing a _virtual IP_
14
14
mechanism for {{< glossary_tooltip term_id="service" text="Services">}}
15
15
of ` type ` other than
16
16
[ ` ExternalName ` ] ( /docs/concepts/services-networking/service/#externalname ) .
17
- Each instance of kube-proxy watches the Kubernetes {{< glossary_tooltip
18
- term_id="control-plane" text="control plane" >}} for the addition and
19
- removal of Service and EndpointSlice {{< glossary_tooltip
20
- term_id="object" text="objects" >}}. For each Service, kube-proxy
17
+ Each instance of kube-proxy watches the Kubernetes
18
+ {{< glossary_tooltip term_id="control-plane" text="control plane" >}}
19
+ for the addition and removal of Service and EndpointSlice
20
+ {{< glossary_tooltip term_id="object" text="objects" >}}. For each Service, kube-proxy
21
21
calls appropriate APIs (depending on the kube-proxy mode) to configure
22
22
the node to capture traffic to the Service's ` clusterIP ` and ` port ` ,
23
23
and redirect that traffic to one of the Service's endpoints
@@ -45,9 +45,9 @@ There are a few reasons for using proxying for Services:
45
45
Later in this page you can read about how various kube-proxy implementations work.
46
46
Overall, you should note that, when running ` kube-proxy ` , kernel level rules may be modified
47
47
(for example, iptables rules might get created), which won't get cleaned up, in some
48
- cases until you reboot. Thus, running kube-proxy is something that should only be done
49
- by an administrator which understands the consequences of having a low level, privileged
50
- network proxying service on a computer. Although the ` kube-proxy ` executable supports a
48
+ cases until you reboot. Thus, running kube-proxy is something that should only be done
49
+ by an administrator who understands the consequences of having a low level, privileged
50
+ network proxying service on a computer. Although the ` kube-proxy ` executable supports a
51
51
` cleanup ` function, this function is not an official feature and thus is only available
52
52
to use as-is.
53
53
@@ -56,7 +56,7 @@ Some of the details in this reference refer to an example: the backend
56
56
{{< glossary_tooltip term_id="pod" text="Pods" >}} for a stateless
57
57
image-processing workloads, running with
58
58
three replicas. Those replicas are
59
- fungible&mdash ; frontends do not care which backend they use. While the actual Pods that
59
+ fungible&mdash ; frontends do not care which backend they use. While the actual Pods that
60
60
compose the backend set may change, the frontend clients should not need to be aware of that,
61
61
nor should they need to keep track of the set of backends themselves.
62
62
@@ -96,7 +96,7 @@ random.
96
96
As an example, consider the image processing application described [ earlier] ( #example )
97
97
in the page.
98
98
When the backend Service is created, the Kubernetes control plane assigns a virtual
99
- IP address, for example 10.0.0.1. For this example, assume that the
99
+ IP address, for example 10.0.0.1. For this example, assume that the
100
100
Service port is 1234.
101
101
All of the kube-proxy instances in the cluster observe the creation of the new
102
102
Service.
@@ -110,7 +110,7 @@ When a client connects to the Service's virtual IP address the iptables rule kic
110
110
A backend is chosen (either based on session affinity or randomly) and packets are
111
111
redirected to the backend without rewriting the client IP address.
112
112
113
- This same basic flow executes when traffic comes in through a node-port or
113
+ This same basic flow executes when traffic comes in through a ` type: NodePort ` Service, or
114
114
through a load-balancer, though in those cases the client IP address does get altered.
115
115
116
116
#### Optimizing iptables mode performance
@@ -120,9 +120,9 @@ Service, and a few iptables rules for each endpoint IP address. In
120
120
clusters with tens of thousands of Pods and Services, this means tens
121
121
of thousands of iptables rules, and kube-proxy may take a long time to update the rules
122
122
in the kernel when Services (or their EndpointSlices) change. You can adjust the syncing
123
- behavior of kube-proxy via options in the [ ` iptables ` section ] ( /docs/reference/config-api/kube-proxy-config.v1alpha1/#kubeproxy-config-k8s-io-v1alpha1-KubeProxyIPTablesConfiguration )
124
- of the
125
- kube-proxy [ configuration file] ( /docs/reference/config-api/kube-proxy-config.v1alpha1/ )
123
+ behavior of kube-proxy via options in the
124
+ [ ` iptables ` section ] ( /docs/reference/config-api/kube-proxy-config.v1alpha1/#kubeproxy-config-k8s-io-v1alpha1-KubeProxyIPTablesConfiguration )
125
+ of the kube-proxy [ configuration file] ( /docs/reference/config-api/kube-proxy-config.v1alpha1/ )
126
126
(which you specify via ` kube-proxy --config <path> ` ):
127
127
128
128
``` yaml
@@ -145,9 +145,8 @@ Service backed by a {{< glossary_tooltip term_id="deployment" text="Deployment"
145
145
with 100 pods, and you delete the
146
146
Deployment, then with ` minSyncPeriod: 0s ` , kube-proxy would end up
147
147
removing the Service's endpoints from the iptables rules one by one,
148
- for a total of 100 updates. With a larger ` minSyncPeriod ` , multiple
149
- Pod deletion events would get aggregated
150
- together, so kube-proxy might
148
+ resulting in a total of 100 updates. With a larger ` minSyncPeriod ` , multiple
149
+ Pod deletion events would get aggregated together, so kube-proxy might
151
150
instead end up making, say, 5 updates, each removing 20 endpoints,
152
151
which will be much more efficient in terms of CPU, and result in the
153
152
full set of changes being synchronized faster.
@@ -230,9 +229,9 @@ these are:
230
229
231
230
* ` lblcr ` (Locality Based Least Connection with Replication): Traffic for the same IP
232
231
address is sent to the server with least connections. If all the backing servers are
233
- overloaded, it picks up one with fewer connections and add it to the target set.
234
- If the target set has not changed for the specified time, the most loaded server
235
- is removed from the set, in order to avoid high degree of replication.
232
+ overloaded, it picks up one with fewer connections and adds it to the target set.
233
+ If the target set has not changed for the specified time, the server with the highest load
234
+ is removed from the set, in order to avoid a high degree of replication.
236
235
237
236
* ` sh ` (Source Hashing): Traffic is sent to a backing server by looking up a statically
238
237
assigned hash table based on the source IP addresses.
@@ -301,48 +300,48 @@ Users who want to switch from the default `iptables` mode to the
301
300
` nftables ` mode should be aware that some features work slightly
302
301
differently the ` nftables ` mode:
303
302
304
- - ** NodePort interfaces** : In ` iptables ` mode, by default,
305
- [ NodePort services] ( /docs/concepts/services-networking/service/#type-nodeport )
306
- are reachable on all local IP addresses. This is usually not what
307
- users want, so the ` nftables ` mode defaults to
308
- ` --nodeport-addresses primary ` , meaning NodePort services are only
309
- reachable on the node's primary IPv4 and/or IPv6 addresses. You can
310
- override this by specifying an explicit value for that option:
311
- e.g., ` --nodeport-addresses 0.0.0.0/0 ` to listen on all (local)
312
- IPv4 IPs.
313
-
314
- - ** NodePort services on ` 127.0.0.1 ` ** : In ` iptables ` mode, if the
315
- ` --nodeport-addresses ` range includes ` 127.0.0.1 ` (and the option
316
- ` --iptables-localhost-nodeports false ` option is not passed), then
317
- NodePort services are reachable even on "localhost" (` 127.0.0.1 ` ).
318
- In ` nftables ` mode (and ` ipvs ` mode), this will not work. If you
319
- are not sure if you are depending on this functionality, you can
320
- check kube-proxy's
321
- ` iptables_localhost_nodeports_accepted_packets_total ` metric; if it
322
- is non-0, that means that some client has connected to a NodePort
323
- service via ` 127.0.0.1 ` .
324
-
325
- - ** NodePort interaction with firewalls** : The ` iptables ` mode of
326
- kube-proxy tries to be compatible with overly-agressive firewalls;
327
- for each NodePort service, it will add rules to accept inbound
328
- traffic on that port, in case that traffic would otherwise be
329
- blocked by a firewall. This approach will not work with firewalls
330
- based on nftables, so kube-proxy's ` nftables ` mode does not do
331
- anything here; if you have a local firewall, you must ensure that
332
- it is properly configured to allow Kubernetes traffic through
333
- (e.g., by allowing inbound traffic on the entire NodePort range).
334
-
335
- - ** Conntrack bug workarounds** : Linux kernels prior to 6.1 have a
336
- bug that can result in long-lived TCP connections to service IPs
337
- being closed with the error "Connection reset by peer". The
338
- ` iptables ` mode of kube-proxy installs a workaround for this bug,
339
- but this workaround was later found to cause other problems in some
340
- clusters. The ` nftables ` mode does not install any workaround by
341
- default, but you can check kube-proxy's
342
- ` iptables_ct_state_invalid_dropped_packets_total ` metric to see if
343
- your cluster is depending on the workaround, and if so, you can run
344
- kube-proxy with the option ` --conntrack-tcp-be-liberal ` to work
345
- around the problem in ` nftables ` mode.
303
+ - ** NodePort interfaces** : In ` iptables ` mode, by default,
304
+ [ NodePort services] ( /docs/concepts/services-networking/service/#type-nodeport )
305
+ are reachable on all local IP addresses. This is usually not what
306
+ users want, so the ` nftables ` mode defaults to
307
+ ` --nodeport-addresses primary ` , meaning Services using ` type: NodePort ` are only
308
+ reachable on the node's primary IPv4 and/or IPv6 addresses. You can
309
+ override this by specifying an explicit value for that option:
310
+ e.g., ` --nodeport-addresses 0.0.0.0/0 ` to listen on all (local)
311
+ IPv4 IPs.
312
+
313
+ - ` type: NodePort` ** Services on ` 127.0.0.1 ` ** : In ` iptables ` mode, if the
314
+ ` --nodeport-addresses ` range includes ` 127.0.0.1 ` (and the option
315
+ ` --iptables-localhost-nodeports false ` option is not passed), then
316
+ Services of ` type: NodePort ` are reachable even on "localhost" (` 127.0.0.1 ` ).
317
+ In ` nftables ` mode (and ` ipvs ` mode), this will not work. If you
318
+ are not sure if you are depending on this functionality, you can
319
+ check kube-proxy's
320
+ ` iptables_localhost_nodeports_accepted_packets_total ` metric; if it
321
+ is non-0, that means that some client has connected to a ` type: NodePort`
322
+ Service via localhost/loopback .
323
+
324
+ - ** NodePort interaction with firewalls** : The ` iptables ` mode of
325
+ kube-proxy tries to be compatible with overly-agressive firewalls;
326
+ for each ` type: NodePort` service, it will add rules to accept inbound
327
+ traffic on that port, in case that traffic would otherwise be
328
+ blocked by a firewall. This approach will not work with firewalls
329
+ based on nftables, so kube-proxy's ` nftables ` mode does not do
330
+ anything here; if you have a local firewall, you must ensure that
331
+ it is properly configured to allow Kubernetes traffic through
332
+ (e.g., by allowing inbound traffic on the entire NodePort range).
333
+
334
+ - ** Conntrack bug workarounds** : Linux kernels prior to 6.1 have a
335
+ bug that can result in long-lived TCP connections to service IPs
336
+ being closed with the error "Connection reset by peer". The
337
+ ` iptables ` mode of kube-proxy installs a workaround for this bug,
338
+ but this workaround was later found to cause other problems in some
339
+ clusters. The ` nftables ` mode does not install any workaround by
340
+ default, but you can check kube-proxy's
341
+ ` iptables_ct_state_invalid_dropped_packets_total ` metric to see if
342
+ your cluster is depending on the workaround, and if so, you can run
343
+ kube-proxy with the option ` --conntrack-tcp-be-liberal ` to work
344
+ around the problem in ` nftables ` mode.
346
345
347
346
### ` kernelspace ` proxy mode {#proxy-mode-kernelspace}
348
347
@@ -399,7 +398,7 @@ On Windows, setting the maximum session sticky time for Services is not supporte
399
398
## IP address assignment to Services
400
399
401
400
Unlike Pod IP addresses, which actually route to a fixed destination,
402
- Service IPs are not actually answered by a single host. Instead, kube-proxy
401
+ Service IPs are not actually answered by a single host. Instead, kube-proxy
403
402
uses packet processing logic (such as Linux iptables) to define _ virtual_ IP
404
403
addresses which are transparently redirected as needed.
405
404
@@ -413,7 +412,7 @@ One of the primary philosophies of Kubernetes is that you should not be
413
412
exposed to situations that could cause your actions to fail through no fault
414
413
of your own. For the design of the Service resource, this means not making
415
414
you choose your own IP address if that choice might collide with
416
- someone else's choice. That is an isolation failure.
415
+ someone else's choice. That is an isolation failure.
417
416
418
417
In order to allow you to choose an IP address for your Services, we must
419
418
ensure that no two Services can collide. Kubernetes does that by allocating each
@@ -463,13 +462,16 @@ Here is a brief example of a user querying for IP addresses:
463
462
``` shell
464
463
kubectl get services
465
464
```
465
+
466
466
```
467
467
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
468
468
kubernetes ClusterIP 2001:db8:1:2::1 <none> 443/TCP 3d1h
469
469
```
470
+
470
471
``` shell
471
472
kubectl get ipaddresses
472
473
```
474
+
473
475
```
474
476
NAME PARENTREF
475
477
2001:db8:1:2::1 services/default/kubernetes
@@ -483,6 +485,7 @@ from the value of the `--service-cluster-ip-range` command line argument to kube
483
485
``` shell
484
486
kubectl get servicecidrs
485
487
```
488
+
486
489
```
487
490
NAME CIDRS AGE
488
491
kubernetes 10.96.0.0/28 17m
@@ -501,13 +504,15 @@ spec:
501
504
- 10.96.0.0/24
502
505
EOF
503
506
```
507
+
504
508
```
505
509
servicecidr.networking.k8s.io/newcidr1 created
506
510
```
507
511
508
512
``` shell
509
513
kubectl get servicecidrs
510
514
```
515
+
511
516
```
512
517
NAME CIDRS AGE
513
518
kubernetes 10.96.0.0/28 17m
@@ -560,7 +565,7 @@ spec:
560
565
561
566
Kubernetes divides the ` ClusterIP ` range into two bands, based on
562
567
the size of the configured ` service-cluster-ip-range ` by using the following formula
563
- ` min(max(16, cidrSize / 16), 256) ` . That formula paraphrases as _ never less than 16 or
568
+ ` min(max(16, cidrSize / 16), 256) ` . That formula means the result is _ never less than 16 or
564
569
more than 256, with a graduated step function between them_ .
565
570
566
571
Kubernetes prefers to allocate dynamic IP addresses to Services by choosing from the upper band,
@@ -588,34 +593,36 @@ node-local endpoints, traffic is dropped by kube-proxy.
588
593
You can set the ` .spec.externalTrafficPolicy ` field to control how traffic from
589
594
external sources is routed. Valid values are ` Cluster ` and ` Local ` . Set the field
590
595
to ` Cluster ` to route external traffic to all ready endpoints and ` Local ` to only
591
- route to ready node-local endpoints. If the traffic policy is ` Local ` and there are
596
+ route to ready node-local endpoints. If the traffic policy is ` Local ` and there
592
597
are no node-local endpoints, the kube-proxy does not forward any traffic for the
593
598
relevant Service.
594
599
595
- If ` Cluster ` is specified all nodes are eligible load balancing targets _ as long as_
596
- the node is not being deleted and kube-proxy is healthy. In this mode: load balancer
600
+ If ` Cluster ` is specified, all nodes are eligible load balancing targets _ as long as_
601
+ the node is not being deleted and kube-proxy is healthy. In this mode: load balancer
597
602
health checks are configured to target the service proxy's readiness port and path.
598
603
In the case of kube-proxy this evaluates to: ` ${NODE_IP}:10256/healthz ` . kube-proxy
599
604
will return either an HTTP code 200 or 503. kube-proxy's load balancer health check
600
605
endpoint returns 200 if:
601
606
602
607
1 . kube-proxy is healthy, meaning:
603
- - it's able to progress programming the network and isn't timing out while doing
604
- so (the timeout is defined to be: ** 2 × ` iptables.syncPeriod ` ** ); and
605
- 2 . the node is not being deleted (there is no deletion timestamp set for the Node).
606
608
607
- The reason why kube-proxy returns 503 and marks the node as not
608
- eligible when it's being deleted, is because kube-proxy supports connection
609
+ it's able to progress programming the network and isn't timing out while doing
610
+ so (the timeout is defined to be: ** 2 × ` iptables.syncPeriod ` ** ); and
611
+
612
+ 1 . the node is not being deleted (there is no deletion timestamp set for the Node).
613
+
614
+ kube-proxy returns 503 and marks the node as not
615
+ eligible when it's being deleted because it supports connection
609
616
draining for terminating nodes. A couple of important things occur from the point
610
617
of view of a Kubernetes-managed load balancer when a node _ is being_ / _ is_ deleted.
611
618
612
619
While deleting:
613
620
614
621
* kube-proxy will start failing its readiness probe and essentially mark the
615
- node as not eligible for load balancer traffic. The load balancer health
616
- check failing causes load balancers which support connection draining to
617
- allow existing connections to terminate, and block new connections from
618
- establishing.
622
+ node as not eligible for load balancer traffic. The load balancer health
623
+ check failing causes load balancers which support connection draining to
624
+ allow existing connections to terminate, and block new connections from
625
+ establishing.
619
626
620
627
When deleted:
621
628
@@ -640,7 +647,7 @@ metrics publish two series, one with the 200 label and one with the 503 one.
640
647
For ` Local ` Services: kube-proxy will return 200 if
641
648
642
649
1 . kube-proxy is healthy/ready, and
643
- 2 . has a local endpoint on the node in question.
650
+ 1 . has a local endpoint on the node in question.
644
651
645
652
Node deletion does ** not** have an impact on kube-proxy's return
646
653
code for what concerns load balancer health checks. The reason for this is:
@@ -667,13 +674,13 @@ If there are local endpoints and **all** of them are terminating, then kube-prox
667
674
will forward traffic to those terminating endpoints. Otherwise, kube-proxy will always
668
675
prefer forwarding traffic to endpoints that are not terminating.
669
676
670
- This forwarding behavior for terminating endpoints exist to allow ` NodePort ` and ` LoadBalancer `
677
+ This forwarding behavior for terminating endpoints exists to allow ` NodePort ` and ` LoadBalancer `
671
678
Services to gracefully drain connections when using ` externalTrafficPolicy: Local ` .
672
679
673
680
As a deployment goes through a rolling update, nodes backing a load balancer may transition from
674
681
N to 0 replicas of that deployment. In some cases, external load balancers can send traffic to
675
682
a node with 0 replicas in between health check probes. Routing traffic to terminating endpoints
676
- ensures that Node's that are scaling down Pods can gracefully receive and drain traffic to
683
+ ensures that Nodes that are scaling down Pods can gracefully receive and drain traffic to
677
684
those terminating Pods. By the time the Pod completes termination, the external load balancer
678
685
should have seen the node's health check failing and fully removed the node from the backend
679
686
pool.
@@ -738,8 +745,8 @@ difference in their approaches:
738
745
overload] ( #considerations-for-using-traffic-distribution-control ) .
739
746
740
747
If the ` service.kubernetes.io/topology-mode ` annotation is set to ` Auto ` , it
741
- will take precedence over ` trafficDistribution ` . ( The annotation may be deprecated
742
- in the future in favour of the ` trafficDistribution ` field) .
748
+ will take precedence over ` trafficDistribution ` . The annotation may be deprecated
749
+ in the future in favor of the ` trafficDistribution ` field.
743
750
744
751
### Interaction with Traffic Policies
745
752
@@ -770,8 +777,7 @@ node", etc.) as the clients, then endpoints may become overloaded. This is
770
777
especially likely if incoming traffic is not proportionally distributed across
771
778
the topology. To mitigate this, consider the following strategies:
772
779
773
- * [ Pod Topology Spread
774
- Constraints] ( /docs/concepts/scheduling-eviction/topology-spread-constraints/ ) :
780
+ * [ Pod Topology Spread Constraints] ( /docs/concepts/scheduling-eviction/topology-spread-constraints/ ) :
775
781
Use Pod Topology Spread Constraints to distribute your pods evenly
776
782
across zones or nodes.
777
783
@@ -793,4 +799,3 @@ You can also:
793
799
* Read about [ Services] ( /docs/concepts/services-networking/service/ ) as a concept
794
800
* Read about [ Ingresses] ( /docs/concepts/services-networking/ingress/ ) as a concept
795
801
* Read the [ API reference] ( /docs/reference/kubernetes-api/service-resources/service-v1/ ) for the Service API
796
-
0 commit comments