Skip to content

Commit 74bd630

Browse files
authored
Merge pull request kubernetes#3777 from atiratree/update-healthy-policy
KEP-3017: Promote unhealthyPodEvictionPolicy for PDBs to Beta
2 parents 4db3d8d + f2b84af commit 74bd630

File tree

3 files changed

+54
-66
lines changed

3 files changed

+54
-66
lines changed
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3017
22
alpha:
3-
approver: "@wojtek-t"
3+
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md

Lines changed: 49 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,7 @@ https://storage.googleapis.com/k8s-triage/index.html
277277
- <test>: <link to test coverage>
278278
-->
279279

280+
[TestEvictionWithUnhealthyPodEvictionPolicy](https://github.com/kubernetes/kubernetes/blob/c8010537913422cc221cdd784936ff99817f621c/test/integration/evictions/evictions_test.go#L417): https://storage.googleapis.com/k8s-triage/index.html?test=UnhealthyPodEvictionPolicy
280281

281282
##### e2e tests
282283

@@ -306,13 +307,15 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
306307

307308
- Feature gate enabled by default.
308309
- Existing E2E and conformance tests passing.
309-
- Consider whether we want to keep the `spec.unhealthyPodEvictionPolicy` field null by default when not specified
310-
or default it to `IfHealthyBudget` value. Both of these should preserve original behavior and handle null
311-
values. This should be tested and documented.
310+
- We want to keep the `spec.unhealthyPodEvictionPolicy` field null by default when not specified.
311+
This should preserve the original behavior and behave the same as the `IfHealthyBudget` value.
312+
This should be tested and documented.
313+
- manual test for upgrade->downgrade->upgrade path will be performed once 1.27 is released
312314

313315
#### GA
314316

315317
- Every bug report is fixed.
318+
- Introduce E2E tests for this field and confirm their stability.
316319
- The eviction API ignores the feature gate.
317320

318321
#### Deprecation
@@ -359,22 +362,27 @@ The eviction API will again start using the `unhealthyPodEvictionPolicy` if prov
359362

360363
###### Are there any tests for feature enablement/disablement?
361364

362-
No, but they will be added for alpha.
365+
- [TestPodDisruptionBudgetStrategy](https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L96-L114)
363366

364367
### Rollout, Upgrade and Rollback Planning
365368

366369
###### How can a rollout or rollback fail? Can it impact already running workloads?
367370

368-
It does not change the default behavior. Users will have to specify a policy
369-
on the PDB for behavior to be affected.
371+
Bugs could affect `/evictions` endpoint which would return server error in that case.
372+
It cannot directly affect workloads, but could potentially cause node drain to stall,
373+
which would have an effect on the cluster during an upgrade.
374+
375+
When the rollback occurs, existing filled `.spec.unhealthyPodEvictionPolicy` fields will be ignored
376+
and the old eviction behavior will be enforced for these PDBs.
370377

371378
###### What specific metrics should inform a rollback?
372379

373-
Unexpected failing eviction requests.
380+
Failing eviction requests could be an indicator. `apiserver_request_total{resource = "pods", subresource = "eviction"}` metric
381+
can be observed to detect increased rate of failing evictions.
374382

375383
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
376384

377-
A manual test will be performed, as follows:
385+
A manual test was performed, as follows:
378386

379387
1. Create a cluster in 1.25.
380388
2. Upgrade to 1.26.
@@ -383,82 +391,60 @@ A manual test will be performed, as follows:
383391
5. Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy.
384392
6. Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
385393
7. Upgrade to 1.26.
386-
8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for
387-
StatefulSet B uses the default behavior.
394+
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior.
395+
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior
396+
397+
TODO:
398+
A manual test will be performed, as follows:
399+
400+
1. Create a cluster in 1.26.
401+
2. Upgrade to 1.27.
402+
3. Create Deployment A and PDB A targeting the pods of Deployment A using the `AlwaysAllow` UnhealthyPodEvictionPolicy.
403+
4. Downgrade to 1.26.
404+
5. Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy (PDBUnhealthyPodEvictionPolicy feature gate disabled by default).
405+
6. Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
406+
7. Upgrade to 1.27.
407+
8. Verify that eviction of pods for Deployment A uses the `AlwaysAllow` UnhealthyPodEvictionPolicy and eviction of pods for
408+
StatefulSet B uses the default behavior.
388409

389410
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
390411

391412
N/A
392413

393414
### Monitoring Requirements
394415

395-
<!--
396-
This section must be completed when targeting beta to a release.
397-
-->
398-
399416
###### How can an operator determine if the feature is in use by workloads?
400417

401-
<!--
402-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
403-
checking if there are objects with field X set) may be a last resort. Avoid
404-
logs or events for this purpose.
405-
-->
418+
By checking `.spec.unhealthyPodEvictionPolicy` field of the PodDisruptionBudget.
419+
Pods belonging to this PDB should be evicted according to this policy.
406420

407421
###### How can someone using this feature know that it is working for their instance?
408422

409-
<!--
410-
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
411-
for each individual pod.
412-
Pick one more of these and delete the rest.
413-
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
414-
and operation of this feature.
415-
Recall that end users cannot usually observe component logs or access metrics.
416-
-->
417-
418-
- [ ] Events
419-
- Event Reason:
420-
- [ ] API .status
421-
- Condition name:
422-
- Other field:
423-
- [ ] Other (treat as last resort)
424-
- Details:
423+
- [x] Other (treat as last resort)
424+
- Details: kube-apiserver logs and audit logs that track eviction requests can be examined to see
425+
if the `UnhealthyPodEvictionPolicy` feature is working properly.
425426

426427
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
427428

428-
<!--
429-
This is your opportunity to define what "normal" quality of service looks like
430-
for a feature.
431-
432-
It's impossible to provide comprehensive guidance, but at the very
433-
high level (needs more precise definitions) those may be things like:
434-
- per-day percentage of API calls finishing with 5XX errors <= 1%
435-
- 99% percentile over day of absolute value from (job creation time minus expected
436-
job creation time) for cron job <= 10%
437-
- 99.9% of /health requests per day finish with 200 code
438-
439-
These goals will help you determine what you need to measure (SLIs) in the next
440-
question.
441-
-->
429+
This feature should not have an impact on the eviction request latency or availability.
430+
Eviction requests should follow the [existing latency SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos)
431+
for serving mutating or read-only API calls.
442432

443433
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
444434

445-
<!--
446-
Pick one more of these and delete the rest.
447-
-->
435+
The following indicators should conform to the existing kube-apiserver SLIs.
448436

449-
- [ ] Metrics
450-
- Metric name:
451-
- [Optional] Aggregation method:
452-
- Components exposing the metric:
453-
- [ ] Other (treat as last resort)
454-
- Details:
437+
- [x] Metrics
438+
- Metric name: apiserver_request_total
439+
- [Optional] Aggregation method: resource = "pods", subresource = "eviction"
440+
- Components exposing the metric: kube-apiserver
441+
- Metric name: apiserver_request_duration_seconds
442+
- [Optional] Aggregation method: resource = "pods", subresource = "eviction"
443+
- Components exposing the metric: kube-apiserver
455444

456445
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
457446

458-
<!--
459-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
460-
implementation difficulties, etc.).
461-
-->
447+
No
462448

463449
### Dependencies
464450

@@ -530,6 +516,7 @@ For each of them, fill in the following information by copying the below templat
530516
- 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24.
531517
- 2022-11-11: Initial alpha implementation merged into 1.26
532518
- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy)
519+
- 2023-02-06: Update for beta promotion
533520

534521
## Drawbacks
535522

keps/sig-apps/3017-pod-healthy-policy-for-pdb/kep.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ approvers:
2424
see-also:
2525
replaces:
2626

27-
stage: alpha
27+
stage: beta
2828

29-
latest-milestone: "v1.26"
29+
latest-milestone: "v1.27"
3030

3131
# The milestone at which this feature was, or is targeted to be, at each stage.
3232
milestone:
@@ -40,5 +40,4 @@ feature-gates:
4040
- name: PodDisruptionBudgetPodHealthyPolicy
4141
components:
4242
- kube-apiserver
43-
- kube-controller-manager
4443
disable-supported: true

0 commit comments

Comments
 (0)