You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/1591-daemonset-surge/README.md
+38-41Lines changed: 38 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -213,46 +213,46 @@ you need any help or guidance.
213
213
_This section must be completed when targeting beta graduation to a release._
214
214
215
215
***How can a rollout fail? Can it impact already running workloads?**
216
-
Try to be as paranoid as possible - e.g., what if some components will restart
217
-
mid-rollout?
216
+
It shouldn't impact already running workloads. This is an opt-in feature since users need to explicitly set the MaxSurge parameter in the DaemonSetSet spec's RollingUpdate i.e `.spec.strategy.rollingUpdate.maxSurge` field.
217
+
if the feature is disabled the field is preserved if it was already set in the presisted DaemonSetSet object, otherwise it is silently dropped.
218
218
219
219
***What specific metrics should inform a rollback?**
220
+
MaxSurge in DaemonSet doesn't get respected and additional surge pods won't be
221
+
created. We consider the feature to be failing if enabling the featuregate and giving
222
+
appropriate value to MaxSurge doesn't cause additional surge pods to be created.
223
+
220
224
221
225
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
222
-
Describe manual testing that was done and the outcomes.
223
-
Longer term, we may want to require automated upgrade/rollback tests, but we
224
-
are missing a bunch of machinery and tooling and can't do that now.
226
+
Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
227
+
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario.
225
228
226
229
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
227
230
fields of API types, flags, etc.?**
228
-
Even if applying deprecation policies, they may still surprise some users.
231
+
None
229
232
230
233
### Monitoring Requirements
231
234
232
235
_This section must be completed when targeting beta graduation to a release._
233
236
234
237
***How can an operator determine if the feature is in use by workloads?**
235
-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
236
-
checking if there are objects with field X set) may be a last resort. Avoid
237
-
logs or events for this purpose.
238
+
By checking the DaemonSetSet's `.spec.strategy.rollingUpdate.maxSurge` field. The additional workload pods created should be respecting the value specified in the
239
+
maxSurge field.
240
+
238
241
239
242
***What are the SLIs (Service Level Indicators) an operator can use to determine
240
243
the health of the service?**
241
244
-[ ] Metrics
242
245
- Metric name:
243
246
-[Optional] Aggregation method:
244
247
- Components exposing the metric:
245
-
-[ ] Other (treat as last resort)
246
-
- Details:
248
+
-[x] Other (treat as last resort)
249
+
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in
250
+
the DaemonSetSet definition. This can be used to determine the health of this feature.
251
+
The existing metrics like `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` can be used to track additional pods created
247
252
248
253
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
249
-
At a high level, this usually will be in the form of "high percentile of SLI
250
-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
251
-
high level (needs more precise definitions) those may be things like:
252
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
253
-
- 99% percentile over day of absolute value from (job creation time minus expected
254
-
job creation time) for cron job <= 10%
255
-
- 99,9% of /health requests per day finish with 200 code
254
+
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload
255
+
service is available 99.99% of time during updates.
256
256
257
257
***Are there any missing metrics that would be useful to have to improve observability
258
258
of this feature?**
@@ -264,18 +264,7 @@ of this feature?**
264
264
_This section must be completed when targeting beta graduation to a release._
265
265
266
266
***Does this feature depend on any specific services running in the cluster?**
267
-
Think about both cluster-level services (e.g. metrics-server) as well
268
-
as node-level agents (e.g. specific version of CRI). Focus on external or
269
-
optional services that are needed. For example, if this feature depends on
270
-
a cloud provider API, or upon an external software-defined storage or network
271
-
control plane.
272
-
273
-
For each of these, fill in the following—thinking about running existing user workloads
274
-
and creating new ones, as well as about cluster-level services (e.g. DNS):
275
-
-[Dependency name]
276
-
- Usage description:
277
-
- Impact of its outage on the feature:
278
-
- Impact of its degraded performance or high-error rates on the feature:
267
+
None. It is part of kube-controller-manager.
279
268
280
269
281
270
### Scalability
@@ -328,18 +317,24 @@ details). For now, we leave it here.
328
317
_This section must be completed when targeting beta graduation to a release._
329
318
330
319
***How does this feature react if the API server and/or etcd is unavailable?**
320
+
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for DaemonSetSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn't start since it cannot communicate with the API server
331
321
332
322
***What are other known failure modes?**
333
-
For each of them, fill in the following information by copying the below template:
334
-
-[Failure mode brief description]
335
-
- Detection: How can it be detected via metrics? Stated another way:
336
-
how can an operator troubleshoot without logging into a master or worker node?
337
-
- Mitigations: What can be done to stop the bleeding, especially for already
338
-
running user workloads?
339
-
- Diagnostics: What are the useful log messages and their required logging
340
-
levels that could help debug the issue?
341
-
Not required until feature graduated to beta.
342
-
- Testing: Are there any tests for failure mode? If not, describe why.
323
+
- MaxSurge not respected and too many pods are created
324
+
- Detection: Looking at `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` metrics.
325
+
- Mitigations: Disable the `DaemonSetUpdateSurge` feature flag
326
+
- Diagnostics: Controller-manager when starting at log-level 4 and above
327
+
- Testing: Yes, e2e tests are already in place
328
+
- MaxSurge not respected and very few pods are created. This causes the workloads to be not be available at 99.99%
329
+
- Detection: Looking at `kube_daemonset_status_number_available` and `kube_daemonset_status_number_unavailable` metrics.
330
+
- Mitigations: Disable the `DaemonSetUpdateSurge` feature flag
331
+
- Diagnostics: Controller-manager when starting at log-level 4 and above
332
+
- Testing: Yes, e2e tests are already in place
333
+
- maxUnavailable should be set to 0 even when maxSurge is configured
334
+
- Detection: Looking at the `.spec.strategy.rollingUpdate.maxSurge` and `.spec.strategy.rollingUpdate.maxUnavailable`
335
+
- Mitigations: Setting maxUnavailable to appropriate value
336
+
- Diagnostics: Controller-manager when starting at log-level 4 and above
337
+
- Testing: Yes, e2e tests are already in place
343
338
344
339
***What steps should be taken if SLOs are not being met to determine the problem?**
345
340
@@ -348,4 +343,6 @@ _This section must be completed when targeting beta graduation to a release._
348
343
349
344
## Implementation History
350
345
351
-
- Initial PR:
346
+
- 2021-02-09: Initial KEP merged
347
+
- 2021-03-05: Initial implementation merged
348
+
- 2021-04-30: Graduate the feature to Beta proposed
0 commit comments