You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We expect no non-infra related flakes in the last month as a GA graduation criteria.
264
+
-->
265
+
266
+
N/A
267
+
268
+
--
269
+
270
+
This feature doesn't introduce any new API endpoints and doesn't interact with other components.
271
+
So, E2E tests doesn't add extra value to integration tests.
272
+
210
273
### Graduation Criteria
211
274
212
275
#### Alpha (v1.24):
@@ -292,13 +355,23 @@ rollout. Similarly, consider large clusters and how enablement/disablement
292
355
will rollout across nodes.
293
356
-->
294
357
358
+
It shouldn't impact already running workloads. It's an opt-in feature,
359
+
and users need to set `pod.spec.topologySpreadConstraints.minDomains` field to use this feature.
360
+
361
+
When this feature is disabled by the feature flag, the already created Pod's `pod.spec.topologySpreadConstraints.minDomains` field is preserved,
362
+
but, the newly created Pod's `pod.spec.topologySpreadConstraints.minDomains` field is silently dropped.
363
+
364
+
295
365
###### What specific metrics should inform a rollback?
296
366
297
367
<!--
298
368
What signals should users be paying attention to when the feature is young
299
369
that might indicate a serious problem?
300
370
-->
301
371
372
+
- A spike on metric `schedule_attempts_total{result="error|unschedulable"}` when pods using this feature are added.
373
+
- A spike on metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` or `scheduling_algorithm_duration_seconds` when pods using this feature are added.
374
+
302
375
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
303
376
304
377
<!--
@@ -307,12 +380,35 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
307
380
are missing a bunch of machinery and tooling and can't do that now.
308
381
-->
309
382
383
+
Yes. The behavior is changed as expected.
384
+
385
+
Test scenario:
386
+
1. start kube-apiserver v1.24 where `MinDomains` feature is disabled.
387
+
2. create three nodes and pods spread across nodes as 2/2/1
388
+
3. create new Pod that has a TopologySpreadConstraints: maxSkew is 1, topologyKey is `kubernetes.io/hostname`, and minDomains is 4 (larger than the number of domains (= 3)).
389
+
4. the Pod created in (3) is scheduled because `MinDomain` is disabled.
390
+
5. delete the Pod created in (3).
391
+
6. recreate kube-apiserver v1.25 where `MinDomains` feature is enabled.
392
+
7. create the same Pod as (3).
393
+
8. the Pod created in (7) isn't scheduled because `MinDomain` is enabled and minDomains is larger than the number of domains (= 3)).
394
+
9. delete the Pod created in (7).
395
+
10. recreate kube-apiserver v1.24 where `MinDomains` feature is disabled.
396
+
11. create the same Pod as (3).
397
+
12. the Pod created in (11) is scheduled because `MinDomain` is disabled.
398
+
13. delete the Pod created in (11).
399
+
14. recreate kube-apiserver v1.25 where `MinDomains` feature is enabled.
400
+
15. create the same Pod as (3).
401
+
16. the Pod created in (15) isn't scheduled because `MinDomain` is enabled and minDomains is larger than the number of domains (= 3)).
402
+
17. delete the Pod created in (15).
403
+
310
404
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
311
405
312
406
<!--
313
407
Even if applying deprecation policies, they may still surprise some users.
314
408
-->
315
409
410
+
No.
411
+
316
412
### Monitoring Requirements
317
413
318
414
<!--
@@ -327,6 +423,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
327
423
logs or events for this purpose.
328
424
-->
329
425
426
+
The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set.
427
+
330
428
###### How can someone using this feature know that it is working for their instance?
331
429
332
430
<!--
@@ -338,13 +436,13 @@ and operation of this feature.
338
436
Recall that end users cannot usually observe component logs or access metrics.
339
437
-->
340
438
341
-
- [ ] Events
342
-
- Event Reason:
343
-
- [ ] API .status
344
-
- Condition name:
345
-
- Other field:
346
-
- [ ] Other (treat as last resort)
347
-
- Details:
439
+
- [x] Other (treat as last resort)
440
+
- Details:
441
+
The feature MinDomains in Pod Topology Sprad plugin doesn't cause any logs, any events, any pod status updates.
442
+
If a Pod using `pod.spec.topologySpreadConstraints.minDomains` was successfully assigned a Node,
443
+
nodeName will be updated.
444
+
And if not, `PodScheduled` condition will be false and an event will be recorded with a detailed message
445
+
describing the reason including the failed filters. (Pod Topology Spread plugin could be one of them.)
348
446
349
447
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
350
448
@@ -363,18 +461,18 @@ These goals will help you determine what you need to measure (SLIs) in the next
363
461
question.
364
462
-->
365
463
464
+
- Metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` <= 100ms on 90-percentile.
465
+
366
466
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments