Skip to content

Commit 1f17939

Browse files
authored
Merge pull request kubernetes#4649 from MinpengJin/log-sacledown-featuregate-stable
Update random pod scaledown KEP for stable
2 parents 4243722 + 1c2d21c commit 1f17939

File tree

3 files changed

+52
-65
lines changed

3 files changed

+52
-65
lines changed

keps/prod-readiness/sig-apps/2185.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@wojtek-t"
44
beta:
55
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md

Lines changed: 46 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,8 @@ Beta (v1.22):
208208
- Enable LogarithmicScaleDown feature gate by default
209209
- Enable `sorting_deletion_age_ratio` metric
210210

211-
Stable (v1.23):
212-
- Remove LogarithmicScaleDown feature gate
211+
Stable (v1.31):
212+
- Lock LogarithmicScaleDown feature gate to true
213213
- Make this behavior standard
214214

215215
### Upgrade / Downgrade Strategy
@@ -230,9 +230,7 @@ behavior reduces the risk that it is an expectation from other components.
230230

231231
### Feature Enablement and Rollback
232232

233-
_This section must be completed when targeting alpha to a release._
234-
235-
* **How can this feature be enabled / disabled in a live cluster?**
233+
###### How can this feature be enabled / disabled in a live cluster?
236234
- [x] Feature gate (also fill in values in `kep.yaml`)
237235
- Feature gate name: LogarithmicScaleDown
238236
- Components depending on the feature gate: kube-controller-manager
@@ -243,53 +241,58 @@ _This section must be completed when targeting alpha to a release._
243241
- Will enabling / disabling the feature require downtime or reprovisioning
244242
of a node?
245243

246-
* **Does enabling the feature change any default behavior?**
244+
###### Does enabling the feature change any default behavior?
247245
Yes, this changes the default assumption that the youngest pod in a replica set
248246
will always be the one evicted. However, it still groups pods by their age and picks
249247
from the youngest group.
250248

251-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
252-
the enablement)?**
249+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
253250
Yes. Existing workloads should see no change when disabling this feature.
254251

255-
* **What happens if we reenable the feature if it was previously rolled back?**
252+
###### What happens if we reenable the feature if it was previously rolled back?
256253
Assumptions that the newest pod will be deleted first may break.
257254

258-
* **Are there any tests for feature enablement/disablement?**
255+
###### Are there any tests for feature enablement/disablement?
259256
Tests for feature disablement shouldn't be necessary, as this is already an assumed
260257
(but not documented) controller behavior.
261258

262259
### Rollout, Upgrade and Rollback Planning
263260

264-
_This section must be completed when targeting beta graduation to a release._
265-
266-
* **How can a rollout fail? Can it impact already running workloads?**
261+
###### How can a rollout or rollback fail? Can it impact already running workloads?
267262
This should not affect running workloads, though there is the possibility that the logic
268263
panics which would cause kube-controller-manager to crash
269264

270-
* **What specific metrics should inform a rollback?**
265+
###### What specific metrics should inform a rollback?
271266
Increased pod deletions could indicate runaway/hot-loop failures in the scaledown logic.
272267
Availability of applications may also be affected. Though the intent of this is to provide
273268
better available through more distributed victim selection, in cases of desired binpacking
274269
pods may remain running on undesired nodes.
275270

276-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
277-
This will be manually tested before the graduation to beta
271+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
272+
This is purely in-memory change for the controller, so upgrade/downgrade doesn't really change anything.
278273

279-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
280-
fields of API types, flags, etc.?**
274+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
281275
No
282276

283277
### Monitoring Requirements
284278

285-
_This section must be completed when targeting beta graduation to a release._
286-
287-
* **How can an operator determine if the feature is in use by workloads?**
288-
The scaledown behavior of all replicasets will be affected by this featuregate being
289-
enabled, so somehow monitoring them will be necessary to determine it
290-
291-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
292-
the health of the service?**
279+
###### How can an operator determine if the feature is in use by workloads?
280+
The feature is global, so it's always going to be used on any downscale.
281+
282+
###### How can someone using this feature know that it is working for their instance?
283+
- [ ] Events
284+
- Event Reason:
285+
- [ ] API .status
286+
- Condition name:
287+
- Other field:
288+
- [x] Other (treat as last resort)
289+
- Details:
290+
A ReplicaSet with two ready pods whose Pod Cost annotation is not set,
291+
if the logarithmic values of the pod ready times are identical,
292+
the pod with the smaller UID will be downscaled first rather than
293+
the latest ready one
294+
295+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
293296
- [x] Metrics
294297
- Metric name: sorting_deletion_age_ratio
295298
- [Optional] Aggregation method:
@@ -302,71 +305,52 @@ algorithm falls back to age. (Pod age is the final criteria in the sorting algor
302305
want to measure this ratio for deletions which don't use this feature, as those may validly fall
303306
outside the desired range).
304307

305-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
308+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
306309
There should be no values `>2` in the above metric when the Pod Cost annotation is unset
307310
(see https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost) and
308311
the pod's deletion was based on a timestamp comparison (rather than, for example, pod state).
309312

310-
* **Are there any missing metrics that would be useful to have to improve observability
311-
of this feature?**
312-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
313-
implementation difficulties, etc.).
313+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
314+
No, we didn't find any other gaps that could be covered by metrics.
314315

315316
### Dependencies
316317

317-
_This section must be completed when targeting beta graduation to a release._
318-
319-
* **Does this feature depend on any specific services running in the cluster?**
318+
###### Does this feature depend on any specific services running in the cluster?
320319
No, it is part of the controller-manager
321320

322321
### Scalability
323322

324-
_For alpha, this section is encouraged: reviewers should consider these questions
325-
and attempt to answer them._
326-
327-
_For beta, this section is required: reviewers must answer these questions._
328-
329-
_For GA, this section is required: approvers should be able to confirm the
330-
previous answers based on experience in the field._
331-
332-
* **Will enabling / using this feature result in any new API calls?**
323+
###### Will enabling / using this feature result in any new API calls?
333324
No
334325

335-
* **Will enabling / using this feature result in introducing new API types?**
326+
###### Will enabling / using this feature result in introducing new API types?
336327
No
337328

338-
* **Will enabling / using this feature result in any new calls to the cloud
339-
provider?**
329+
###### Will enabling / using this feature result in any new calls to the cloud provider?
340330
No
341331

342-
* **Will enabling / using this feature result in increasing size or count of
343-
the existing API objects?**
332+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
344333
No
345334

346-
* **Will enabling / using this feature result in increasing time taken by any
347-
operations covered by [existing SLIs/SLOs]?**
335+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
348336
No
349337

350-
* **Will enabling / using this feature result in non-negligible increase of
351-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
338+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
352339
No, perhaps minimal increase in calculating the buckets for pod age
353340

354-
### Troubleshooting
355-
356-
The Troubleshooting section currently serves the `Playbook` role. We may consider
357-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
358-
details). For now, we leave it here.
341+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
342+
No
359343

360-
_This section must be completed when targeting beta graduation to a release._
344+
### Troubleshooting
361345

362-
* **How does this feature react if the API server and/or etcd is unavailable?**
346+
###### How does this feature react if the API server and/or etcd is unavailable?
363347
N/a - this is not a feature of running workloads. The main controller will not work and
364348
be unable to scale up or down if API or etcd are unavailable.
365349

366-
* **What are other known failure modes?**
350+
###### What are other known failure modes?
367351
n/a
368352

369-
* **What steps should be taken if SLOs are not being met to determine the problem?**
353+
###### What steps should be taken if SLOs are not being met to determine the problem?
370354
n/a
371355

372356
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
@@ -376,6 +360,7 @@ n/a
376360

377361
- 2021-01-06: Initial KEP submitted
378362
- 2021-05-07: Updated KEP for graduation to beta
363+
- 2024-05-21:Updated KEP for graduation to GA
379364

380365
## Drawbacks
381366

keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ see-also:
2020
- "/keps/sig-apps/1828-delete-priority-annotations"
2121
replaces:
2222

23-
stage: beta
24-
latest-milestone: "v1.22"
23+
stage: stable
24+
latest-milestone: "v1.31"
2525
milestone:
2626
alpha: "v1.21"
2727
beta: "v1.22"
28-
stable: "v1.23"
28+
stable: "v1.31"
2929

3030
feature-gates:
3131
- name: LogarithmicScaleDown
@@ -35,4 +35,4 @@ disable-supported: true
3535

3636
# The following PRR answers are required at beta release
3737
metrics:
38-
- TBD
38+
- sorting_deletion_age_ratio

0 commit comments

Comments
 (0)