You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
###### How can a rollout or rollback fail? Can it impact already running workloads?
367
370
368
-
It does not change the default behavior. Users will have to specify a policy
369
-
on the PDB for behavior to be affected.
371
+
Bugs could affect `/evictions` endpoint which would return server error in that case.
372
+
It cannot directly affect workloads, but could potentially cause node drain to stall,
373
+
which would have an effect on the cluster during an upgrade.
374
+
375
+
When the rollback occurs, existing filled `.spec.unhealthyPodEvictionPolicy` fields will be ignored
376
+
and the old eviction behavior will be enforced for these PDBs.
370
377
371
378
###### What specific metrics should inform a rollback?
372
379
373
-
Unexpected failing eviction requests.
380
+
Failing eviction requests could be an indicator. `apiserver_request_total{resource = "pods", subresource = "eviction"}` metric
381
+
can be observed to detect increased rate of failing evictions.
374
382
375
383
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
376
384
377
-
A manual test will be performed, as follows:
385
+
A manual test was performed, as follows:
378
386
379
387
1. Create a cluster in 1.25.
380
388
2. Upgrade to 1.26.
@@ -383,82 +391,60 @@ A manual test will be performed, as follows:
383
391
5. Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy.
384
392
6. Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
385
393
7. Upgrade to 1.26.
386
-
8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for
387
-
StatefulSet B uses the default behavior.
394
+
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior.
395
+
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior
396
+
397
+
TODO:
398
+
A manual test will be performed, as follows:
399
+
400
+
1. Create a cluster in 1.26.
401
+
2. Upgrade to 1.27.
402
+
3. Create Deployment A and PDB A targeting the pods of Deployment A using the `AlwaysAllow` UnhealthyPodEvictionPolicy.
403
+
4. Downgrade to 1.26.
404
+
5. Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy (PDBUnhealthyPodEvictionPolicy feature gate disabled by default).
405
+
6. Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
406
+
7. Upgrade to 1.27.
407
+
8. Verify that eviction of pods for Deployment A uses the `AlwaysAllow` UnhealthyPodEvictionPolicy and eviction of pods for
408
+
StatefulSet B uses the default behavior.
388
409
389
410
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
390
411
391
412
N/A
392
413
393
414
### Monitoring Requirements
394
415
395
-
<!--
396
-
This section must be completed when targeting beta to a release.
397
-
-->
398
-
399
416
###### How can an operator determine if the feature is in use by workloads?
400
417
401
-
<!--
402
-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
403
-
checking if there are objects with field X set) may be a last resort. Avoid
404
-
logs or events for this purpose.
405
-
-->
418
+
By checking `.spec.unhealthyPodEvictionPolicy` field of the PodDisruptionBudget.
419
+
Pods belonging to this PDB should be evicted according to this policy.
406
420
407
421
###### How can someone using this feature know that it is working for their instance?
408
422
409
-
<!--
410
-
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
411
-
for each individual pod.
412
-
Pick one more of these and delete the rest.
413
-
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
414
-
and operation of this feature.
415
-
Recall that end users cannot usually observe component logs or access metrics.
416
-
-->
417
-
418
-
-[ ] Events
419
-
- Event Reason:
420
-
-[ ] API .status
421
-
- Condition name:
422
-
- Other field:
423
-
-[ ] Other (treat as last resort)
424
-
- Details:
423
+
-[x] Other (treat as last resort)
424
+
- Details: kube-apiserver logs and audit logs that track eviction requests can be examined to see
425
+
if the `UnhealthyPodEvictionPolicy` feature is working properly.
425
426
426
427
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
427
428
428
-
<!--
429
-
This is your opportunity to define what "normal" quality of service looks like
430
-
for a feature.
431
-
432
-
It's impossible to provide comprehensive guidance, but at the very
433
-
high level (needs more precise definitions) those may be things like:
434
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
435
-
- 99% percentile over day of absolute value from (job creation time minus expected
436
-
job creation time) for cron job <= 10%
437
-
- 99.9% of /health requests per day finish with 200 code
438
-
439
-
These goals will help you determine what you need to measure (SLIs) in the next
440
-
question.
441
-
-->
429
+
This feature should not have an impact on the eviction request latency or availability.
430
+
Eviction requests should follow the [existing latency SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos)
431
+
for serving mutating or read-only API calls.
442
432
443
433
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
444
434
445
-
<!--
446
-
Pick one more of these and delete the rest.
447
-
-->
435
+
The following indicators should conform to the existing kube-apiserver SLIs.
0 commit comments