Skip to content

Commit c09cc13

Browse files
authored
Merge pull request kubernetes#2665 from ravisantoshgudimetla/promote-maxSurge-beta
Graduate DS maxSurge to beta
2 parents dc66028 + e749751 commit c09cc13

File tree

3 files changed

+46
-44
lines changed

3 files changed

+46
-44
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 1591
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@ehashman"

keps/sig-apps/1591-daemonset-surge/README.md

Lines changed: 38 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -213,46 +213,46 @@ you need any help or guidance.
213213
_This section must be completed when targeting beta graduation to a release._
214214

215215
* **How can a rollout fail? Can it impact already running workloads?**
216-
Try to be as paranoid as possible - e.g., what if some components will restart
217-
mid-rollout?
216+
It shouldn't impact already running workloads. This is an opt-in feature since users need to explicitly set the MaxSurge parameter in the DaemonSetSet spec's RollingUpdate i.e `.spec.strategy.rollingUpdate.maxSurge` field.
217+
if the feature is disabled the field is preserved if it was already set in the presisted DaemonSetSet object, otherwise it is silently dropped.
218218

219219
* **What specific metrics should inform a rollback?**
220+
MaxSurge in DaemonSet doesn't get respected and additional surge pods won't be
221+
created. We consider the feature to be failing if enabling the featuregate and giving
222+
appropriate value to MaxSurge doesn't cause additional surge pods to be created.
223+
220224

221225
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
222-
Describe manual testing that was done and the outcomes.
223-
Longer term, we may want to require automated upgrade/rollback tests, but we
224-
are missing a bunch of machinery and tooling and can't do that now.
226+
Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
227+
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario.
225228

226229
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
227230
fields of API types, flags, etc.?**
228-
Even if applying deprecation policies, they may still surprise some users.
231+
None
229232

230233
### Monitoring Requirements
231234

232235
_This section must be completed when targeting beta graduation to a release._
233236

234237
* **How can an operator determine if the feature is in use by workloads?**
235-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
236-
checking if there are objects with field X set) may be a last resort. Avoid
237-
logs or events for this purpose.
238+
By checking the DaemonSetSet's `.spec.strategy.rollingUpdate.maxSurge` field. The additional workload pods created should be respecting the value specified in the
239+
maxSurge field.
240+
238241

239242
* **What are the SLIs (Service Level Indicators) an operator can use to determine
240243
the health of the service?**
241244
- [ ] Metrics
242245
- Metric name:
243246
- [Optional] Aggregation method:
244247
- Components exposing the metric:
245-
- [ ] Other (treat as last resort)
246-
- Details:
248+
- [x] Other (treat as last resort)
249+
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in
250+
the DaemonSetSet definition. This can be used to determine the health of this feature.
251+
The existing metrics like `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` can be used to track additional pods created
247252

248253
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
249-
At a high level, this usually will be in the form of "high percentile of SLI
250-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
251-
high level (needs more precise definitions) those may be things like:
252-
- per-day percentage of API calls finishing with 5XX errors <= 1%
253-
- 99% percentile over day of absolute value from (job creation time minus expected
254-
job creation time) for cron job <= 10%
255-
- 99,9% of /health requests per day finish with 200 code
254+
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload
255+
service is available 99.99% of time during updates.
256256

257257
* **Are there any missing metrics that would be useful to have to improve observability
258258
of this feature?**
@@ -264,18 +264,7 @@ of this feature?**
264264
_This section must be completed when targeting beta graduation to a release._
265265

266266
* **Does this feature depend on any specific services running in the cluster?**
267-
Think about both cluster-level services (e.g. metrics-server) as well
268-
as node-level agents (e.g. specific version of CRI). Focus on external or
269-
optional services that are needed. For example, if this feature depends on
270-
a cloud provider API, or upon an external software-defined storage or network
271-
control plane.
272-
273-
For each of these, fill in the following—thinking about running existing user workloads
274-
and creating new ones, as well as about cluster-level services (e.g. DNS):
275-
- [Dependency name]
276-
- Usage description:
277-
- Impact of its outage on the feature:
278-
- Impact of its degraded performance or high-error rates on the feature:
267+
None. It is part of kube-controller-manager.
279268

280269

281270
### Scalability
@@ -328,18 +317,24 @@ details). For now, we leave it here.
328317
_This section must be completed when targeting beta graduation to a release._
329318

330319
* **How does this feature react if the API server and/or etcd is unavailable?**
320+
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for DaemonSetSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn't start since it cannot communicate with the API server
331321

332322
* **What are other known failure modes?**
333-
For each of them, fill in the following information by copying the below template:
334-
- [Failure mode brief description]
335-
- Detection: How can it be detected via metrics? Stated another way:
336-
how can an operator troubleshoot without logging into a master or worker node?
337-
- Mitigations: What can be done to stop the bleeding, especially for already
338-
running user workloads?
339-
- Diagnostics: What are the useful log messages and their required logging
340-
levels that could help debug the issue?
341-
Not required until feature graduated to beta.
342-
- Testing: Are there any tests for failure mode? If not, describe why.
323+
- MaxSurge not respected and too many pods are created
324+
- Detection: Looking at `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` metrics.
325+
- Mitigations: Disable the `DaemonSetUpdateSurge` feature flag
326+
- Diagnostics: Controller-manager when starting at log-level 4 and above
327+
- Testing: Yes, e2e tests are already in place
328+
- MaxSurge not respected and very few pods are created. This causes the workloads to be not be available at 99.99%
329+
- Detection: Looking at `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` metrics.
330+
- Mitigations: Disable the `DaemonSetUpdateSurge` feature flag
331+
- Diagnostics: Controller-manager when starting at log-level 4 and above
332+
- Testing: Yes, e2e tests are already in place
333+
- maxUnavailable should be set to 0 even when maxSurge is configured
334+
- Detection: Looking at the `.spec.strategy.rollingUpdate.maxSurge` and `.spec.strategy.rollingUpdate.maxUnavailable`
335+
- Mitigations: Setting maxUnavailable to appropriate value
336+
- Diagnostics: Controller-manager when starting at log-level 4 and above
337+
- Testing: Yes, e2e tests are already in place
343338

344339
* **What steps should be taken if SLOs are not being met to determine the problem?**
345340

@@ -348,4 +343,6 @@ _This section must be completed when targeting beta graduation to a release._
348343

349344
## Implementation History
350345

351-
- Initial PR:
346+
- 2021-02-09: Initial KEP merged
347+
- 2021-03-05: Initial implementation merged
348+
- 2021-04-30: Graduate the feature to Beta proposed

keps/sig-apps/1591-daemonset-surge/kep.yaml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,17 @@ participating-sigs:
99
reviewers:
1010
- "@kow3ns"
1111
- "@janetkuo"
12+
- "@soltysh"
1213
approvers:
1314
- "@kow3ns"
1415
- "@janetkuo"
16+
- "@soltysh"
1517
prr-approvers:
1618
- "@deads2k"
19+
- "@ehashman"
1720
editor: TBD
1821
creation-date: 2020-03-02
19-
last-updated: 2020-03-02
22+
last-updated: 2020-05-06
2023
status: implementable
2124
replaces:
2225
superseded-by:
@@ -26,8 +29,8 @@ feature-gates:
2629
- kube-apiserver
2730
- kube-controller-manager
2831
disable-supported: true
29-
stage: "alpha"
30-
latest-milestone: "v1.21"
32+
stage: "beta"
33+
latest-milestone: "v1.22"
3134
milestone:
3235
alpha: "v1.21"
3336
beta: "v1.22"

0 commit comments

Comments
 (0)