You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/1797-configure-fqdn-as-hostname-for-pods/README.md
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -505,12 +505,12 @@ in back-to-back releases.
505
505
506
506
#### Alpha -> Beta Graduation
507
507
508
-
- Gather feedback from developers
509
-
- Tests are in Testgrid and linked in KEP
508
+
- Gather feedback from users
509
+
- Ensure e2e tests are running in Testgrid and they are stable
510
510
511
511
#### Beta -> GA Graduation
512
512
513
-
- Allowing time for feedback
513
+
- Allowing time for feedback from production users
514
514
515
515
516
516
### Upgrade / Downgrade Strategy
@@ -527,7 +527,7 @@ enhancement:
527
527
cluster required to make on upgrade in order to make use of the enhancement?
528
528
-->
529
529
530
-
We will gate off this feature for one release (1.19), then we enable it by default as Beta in the next release (1.20), then GA in release 1.21
530
+
We will gate off this feature for one release (1.19), then we enable it by default as Beta in the next release (1.20), then GA in release 1.22
531
531
532
532
### Version Skew Strategy
533
533
@@ -579,7 +579,7 @@ _This section must be completed when targeting alpha to a release._
579
579
* **How can this feature be enabled / disabled in a live cluster?**
580
580
- [X] Feature gate (also fill in values in `kep.yaml`)
581
581
- Feature gate name: setHostnameAsFQDN
582
-
- Components depending on the feature gate: Kubelet
582
+
- Components depending on the feature gate: kube-apiserver and kubelet
583
583
- [ ] Other
584
584
- Describe the mechanism:
585
585
- Will enabling / disabling the feature require downtime of the control
@@ -601,23 +601,22 @@ _This section must be completed when targeting alpha to a release._
601
601
hostname field of kernel.
602
602
603
603
* **Are there any tests for feature enablement/disablement?**
604
-
We will have unit tests and integration tests. Not sure if we need conversion tests.
604
+
No, only manual testing was performed.
605
605
606
606
### Rollout, Upgrade and Rollback Planning
607
607
608
608
_This section must be completed when targeting beta graduation to a release._
609
609
610
610
* **How can a rollout fail? Can it impact already running workloads?**
611
-
It is not clear that the rollout can fail due to this feature. The scope of this feature
612
-
is very limited and it is disabled by default.
611
+
No known failure modes.
613
612
614
613
* **What specific metrics should inform a rollback?**
615
-
Abnormal increase in `run_podsandbox_errors_total` count could be related to this feature. We should check if the feature gate is enabled and pods are using it.
614
+
Abnormal increase in `run_podsandbox_errors_total` count could be related to this feature. We should filter those pods having issues to create sandbox and check whether they are stuck due to the length of their FQDN, as described in the proposal.
616
615
617
616
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
618
-
We tested introducing and removing this feature. Running pods are not affected by
619
-
either introducing nor removing the feature. When disabling the feature, Pods
620
-
using this feature that are "stuck" due to having long FQDNs will go into
617
+
We tested enabling and disabling this feature. Running pods are not affected by
618
+
either enabling nor disabling this feature. When disabling the feature, Pods
619
+
using it that are "stuck" due to having long FQDNs will go into
621
620
running.
622
621
623
622
* **Is the rollout accompanied by any deprecations and/or removals of features,
@@ -634,15 +633,16 @@ _This section must be completed when targeting beta graduation to a release._
634
633
checking if there are objects with field X set) may be last resort. Avoid
635
634
logs or events for this purpose.
636
635
637
-
For a pod to be using this feature it needs to define the PodSpec fields
638
-
`subDomain`and `setHostnameAsFQDN`.
636
+
Listing pods in the cluster and checking if any has both
637
+
`subDomain`and `setHostnameAsFQDN` fields set.
639
638
640
639
* **What are the SLIs (Service Level Indicators) an operator can use to
641
640
determine the health of the service?**
642
641
- [X] Metrics
643
-
- Metric name: Abnormal increase in `run_podsandbox_errors_total` might be related to this feature. We should check if the feature gate is enabled and pods are using it.
642
+
- Metric name: `run_podsandbox_errors_total`
643
+
- Comment: Abnormal increase in `run_podsandbox_errors_total` might be related to this feature. We should check if the feature gate is enabled and pods are using it.
644
644
- [Optional] Aggregation method:
645
-
- Components exposing the metric:
645
+
- Components exposing the metric:Kubelet
646
646
- [ ] Other (treat as last resort)
647
647
- Details:
648
648
@@ -660,7 +660,7 @@ N/A
660
660
observability if this feature?**
661
661
Describe the metrics themselves and the reason they weren't added (e.g. cost,
662
662
implementation difficulties, etc.).
663
-
N/A
663
+
664
664
665
665
### Dependencies
666
666
@@ -692,7 +692,7 @@ previous answers based on experience in the field._
692
692
693
693
* **Will enabling / using this feature result in increasing size or count
694
694
of the existing API objects?**
695
-
No
695
+
Pods using this feature are required to set a new field, which increases the size of their objects by a couple of bytes.
696
696
697
697
* **Will enabling / using this feature result in increasing time taken by any
698
698
operations covered by [existing SLIs/SLOs][]?**
@@ -723,14 +723,15 @@ _This section must be completed when targeting beta graduation to a release._
723
723
724
724
- Mitigations: What can be done to stop the bleeding, especially for already
725
725
running user workloads?
726
-
Pods should either set the PodSpec field `setHostnameAsFQDN`.
726
+
Pods having problems to start should unset the PodSpec field `setHostnameAsFQDN`.
727
727
728
728
- Diagnostics: What are the useful log messages and their required logging
729
729
levels that could help debugging the issue?
730
-
This issue will be logged in Error level log messages. The error will something like`GeneratePodSandboxConfig for pod foo failed: Failed to construct FQDN from pod hostname and cluster domain, FQDN <long-fqdn> is too long (64 characters is the max, 70 characters requested)`
730
+
This issue will be logged in Error level log messages and in the Events. The message will be something like`GeneratePodSandboxConfig for pod foo failed: Failed to construct FQDN from pod hostname and cluster domain, FQDN <long-fqdn> is too long (64 characters is the max, 70 characters requested)`
731
731
732
732
- Testing: Are there any tests for failure mode? If not describe why.
733
-
Unittests cover this failure scenario. The e2e tests also check for this error, but only as protection for the CI. A test that purposely hit this issue will be flaky as the pod will remain in Pending status logging errors until the test timesout.
733
+
Both unittests and e2e tests cover this failure scenario.
734
+
734
735
735
736
* **What steps should be taken if SLOs are not being met to determine the problem?**
0 commit comments