You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-network/1880-multiple-service-cidrs/README.md
+66-14Lines changed: 66 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -742,18 +742,38 @@ A rollout failure can impact the new apiserver to not be able to start or the cl
742
742
743
743
###### What specific metrics should inform a rollback?
744
744
745
-
If the apiserver is not able to start, can be detected during the rollout phase, not metrics needed.
746
-
Users not be able to create Services can be monitored via two sets of metrics:
747
-
1. IP allocator pkg/registry/core/service/ipallocator/metrics.go
748
-
2. IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go
745
+
The feature impact the apiserver bootstrap process, specially about the kubernetes.default IP address assignment,
746
+
in case the apiserver is not able to start after enabling the feature, it is a strong indicated that a rollback is required.
747
+
748
+
Another metrics that can indicate a rollback are `clusterip_allocator.allocation_errors_total`, `clusterip_repair.ip_errors_total` or `clusterip_repair.reconcile_errors_total`, definitions can be found on
749
+
- IP allocator pkg/registry/core/service/ipallocator/metrics.go
750
+
- IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go
749
751
750
752
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
751
753
752
-
<!--
753
-
Describe manual testing that was done and the outcomes.
754
-
Longer term, we may want to require automated upgrade/rollback tests, but we
755
-
are missing a bunch of machinery and tooling and can't do that now.
5. create new services and assert previous and new ones are correct
774
+
6. shutdown apiserver
775
+
7. start new apiserver with feature disabled
776
+
8. create new services and assert previous and new ones are correct
757
777
758
778
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
759
779
@@ -764,12 +784,38 @@ No
764
784
765
785
###### How can an operator determine if the feature is in use by workloads?
766
786
767
-
A new group of metrics are added for the new Cluster IP allocators
768
-
1. IP allocator pkg/registry/core/service/ipallocator/metrics.go
769
-
2. IP repair loop pkg/registry/core/service/ipallocator/controller/metrics.go
787
+
A group of metrics are added to each new Cluster IP allocators, labeled with the
788
+
correspoding ServiceCIDR associateed to the allocator:
789
+
790
+
`clusterip_allocator.allocated_ips`
791
+
`clusterip_allocator.available_ips`
792
+
`clusterip_allocator.allocation_total`
793
+
794
+
See IP allocator pkg/registry/core/service/ipallocator/metrics.go for definitions.
770
795
771
796
Users can also obtain the `ServiceCIDR` and `IPAddress` objects that are only available
772
-
if the feature is enabled.
797
+
if the feature is enabled, per each Service with a ClusterIP associated there must be a
798
+
corresponding IPAddress object. Per example:
799
+
800
+
```
801
+
$ kubectl get service kubernetes
802
+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
803
+
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17d
804
+
```
805
+
806
+
```
807
+
$ kubectl get ipaddress 10.96.0.1
808
+
NAME PARENTREF
809
+
10.96.0.1 services/default/kubernetes
810
+
```
811
+
812
+
All the ServiceCIDRs ranges configured must be present, included those ones created from the
813
+
apiserver flags to initialize the cluster, with the special name `kubernetes`:
814
+
```
815
+
$ kubectl get servicecidr
816
+
NAME CIDRS AGE
817
+
kubernetes 10.96.0.0/28 17d
818
+
```
773
819
774
820
###### How can someone using this feature know that it is working for their instance?
775
821
@@ -800,6 +846,8 @@ Recall that end users cannot usually observe component logs or access metrics.
800
846
801
847
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
802
848
849
+
- 99.9% of ClusterIP allocations per day take less than 500 ms.
850
+
- 100% of ClusterIP allocations succeed.
803
851
804
852
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
805
853
@@ -810,6 +858,10 @@ Recall that end users cannot usually observe component logs or access metrics.
810
858
811
859
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
812
860
861
+
Idially we should have metrics to detect overlaps or IP conflicts with the Pods and Nodes network, but this was
862
+
[heavily discussed on the SIG](https://docs.google.com/document/d/1Dx7Qu5rHGaqoWue-JmlwYO9g_kgOaQzwaeggUsLooKo/edit#heading=h.rkh0f6t1c3vc) and we concluded that is not possible to get the Pod and Nodes network information reliably,
863
+
so any metrics of this kind will be misleading.
864
+
813
865
### Dependencies
814
866
815
867
###### Does this feature depend on any specific services running in the cluster?
@@ -825,7 +877,7 @@ See Drawbacks section
825
877
826
878
When creating a Service this will require to create an IPAddress object,
827
879
previously we updated a bitmap on etcd, so we keep the number of request
828
-
but the size of the objects stored is reduced considerable.
880
+
but the size of the objects stored is reduced considerably.
829
881
830
882
###### Will enabling / using this feature result in introducing new API types?
0 commit comments