23
23
- [ Handling Node Updates] ( #handling-node-updates )
24
24
- [ Future Expansion] ( #future-expansion )
25
25
- [ Test Plan] ( #test-plan )
26
- - [ Controller Unit Tests] ( #controller-unit-tests )
27
- - [ Kube-Proxy Unit Tests] ( #kube-proxy-unit-tests )
28
- - [ e2e Tests] ( #e2e-tests )
26
+ - [ Unit tests] ( #unit-tests )
27
+ - [ Controller Unit Tests] ( #controller-unit-tests )
28
+ - [ Kube-Proxy Unit Tests] ( #kube-proxy-unit-tests )
29
+ - [ Integration tests] ( #integration-tests )
30
+ - [ e2e tests] ( #e2e-tests )
29
31
- [ Observability] ( #observability )
32
+ - [ Events] ( #events )
33
+ - [ Logic] ( #logic )
34
+ - [ Sample Events] ( #sample-events )
35
+ - [ Documentation] ( #documentation )
36
+ - [ Limitations] ( #limitations )
30
37
- [ Graduation Criteria] ( #graduation-criteria )
31
38
- [ Version Skew Strategy] ( #version-skew-strategy )
32
39
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -368,12 +375,18 @@ In the future we may expand this functionality if needed. This could include:
368
375
369
376
- A new `RequireZone` algorithm that would keep endpoints in EndpointSlices for
370
377
the same zone they are in.
371
- - A new option to specify a minimum threshold for the `PreferZone` approach.
378
+ - A new option to specify a minimum threshold for the `Auto` (PreferZone)
379
+ approach.
372
380
- Support for region based hints.
373
381
374
382
# ## Test Plan
375
383
376
- # ### Controller Unit Tests
384
+ # ### Unit tests
385
+
386
+ - `k8s.io/pkg/controller/endpointslice` : ` 2022-10-05` - `73.1`
387
+ - `k8s.io/pkg/controller/endpointslice/topologycache` : ` 2022-10-05` - `75.4`
388
+
389
+ # #### Controller Unit Tests
377
390
| Test Description | Expected Result |
378
391
| :--- | :--- |
379
392
| Feature On, 2+ zones | Hints set |
@@ -393,31 +406,29 @@ In the future we may expand this functionality if needed. This could include:
393
406
| Endpoint additions that require redistribution | Hints updated |
394
407
| Endpoint removals that require redistribution | Hints updated |
395
408
396
- # ### Kube-Proxy Unit Tests
409
+ # #### Kube-Proxy Unit Tests
397
410
| Test Description | Expected Result |
398
411
| :--- | :--- |
399
412
| Feature On, hints matching zone | Endpoints filtered |
400
413
| Feature On, ExternalTrafficPolicy == 'Local', hints matching zone | Endpoints not filtered |
401
414
| Feature Off, hints matching zone | Endpoints not filtered |
402
415
| Feature On, no hints matching zone | Endpoints not filtered |
403
416
404
- # ## e2e Tests
405
- This represents the largest and most uncertain part of the testing effort. We
406
- need to find a way to run e2e tests on multizone clusters. To limit flakiness,
407
- those clusters likely need to have a consistent distribution of nodes across
408
- zones. This will enable us to write predictable tests for topology aware
409
- routing.
417
+ # ### Integration tests
418
+
419
+ N/A
410
420
411
- At a minimum, we likely want the following test :
421
+ # ### e2e tests
412
422
413
- - 3 zone cluster, with 1 equivalent node per zone
414
- - Deploy a single pod to each node with a daemonset
415
- - Create a Service that targets that daemonset
416
- - Make requests from each zone and ensure that the request is routed to a pod in
417
- the same zone
423
+ This feature has e2e test coverage with the ["Topology Hints"
424
+ test](https://github.com/kubernetes/kubernetes/blob/fbb6ccc0c62d2431dc3a18a4130d7fbae5c05acd/test/e2e/network/topology_hints.go#L43).
425
+ This is currently limited to a periodic run due to the nature of requiring a
426
+ multizone cluster to run. It has been [remarkably
427
+ stable](https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20multizone)
428
+ with 100% green runs.
418
429
419
- We'll likely need more tests to properly vet this feature, but this one should
420
- be straightforward to write and unlikely to be flaky .
430
+ As a prerequisite for GA, we will ensure that this test runs as a presubmit
431
+ if any code changes in kube-proxy or the EndpointSlice controller .
421
432
422
433
# ## Observability
423
434
We can reuse some of the metrics of EndpointSlice Controller that we already
@@ -454,13 +465,74 @@ EndpointSliceSyncs = metrics.NewCounterVec(
454
465
455
466
` ` `
456
467
468
+ # ## Events
469
+ A common point of frustration among initial users of this feature was how
470
+ difficult it was to tell if this feature was enabled and working as intended.
471
+ Due to the nature of this design, even when a user opts in to the `Auto` mode
472
+ that is no guarantee that the controller logic will determine that there are
473
+ a sufficient number of endpoints to allocate them proportionally to each zone
474
+ in the cluster.
475
+
476
+ To make this feature easier to understand and use, the EndpointSlice controller
477
+ will publish events for a Service to describe if the feature has been enabled,
478
+ and if not, why not.
479
+
480
+ # ### Logic
481
+
482
+ The EndpointSlice controller will track the known state of this feature for
483
+ each Service. When that state or the reason for it changes, the EndpointSlice
484
+ controller will publish a new Event to reflect the updated status of this
485
+ feature.
486
+
487
+ # ### Sample Events
488
+
489
+ | Type | Reason | Message |
490
+ |-|-|-|
491
+ | Normal | TopologyAwareRoutingEnabled | Topology Aware Routing has been enabled |
492
+ | Normal | TopologyAwareRoutingDisabled | Topology Aware Routing configuration was removed |
493
+ | Warning | TopologyAwareRoutingDisabled | Insufficient number of Endpoints (n), impossible to safely allocate proportionally |
494
+ | Warning | TopologyAwareRoutingDisabled | 1 or more Endpoints do not have a Zone specified |
495
+ | Warning | TopologyAwareRoutingDisabled | 1 or more Nodes do not have allocatable CPU specified |
496
+ | Warning | TopologyAwareRoutingDisabled | Nodes only ready in 1 zone |
497
+
498
+ # ### Documentation
499
+
500
+ The Topology Aware Hints documentation will be updated to describe the reason
501
+ each of these events may have been triggered, along with steps that can be taken
502
+ to recover from that state.
503
+
504
+ # ### Limitations
505
+
506
+ Although the events described above should dramatically simplify the use of this
507
+ feature, there is a tiny edge case that will not be covered. If any
508
+ EndpointSlices for a Service do not include Hints, Kube-Proxy will not implement
509
+ this feature. This would happen if a user created custom EndpointSlices _and_
510
+ enabled Topology Aware Hints _and_ failed to set Hints on their custom
511
+ EndpointSlices. This seems very unlikely, but is mentioned here for the sake of
512
+ completeness.
513
+
457
514
# ## Graduation Criteria
458
515
**Alpha:**
459
516
- Basic functionality covered with unit tests described above.
460
517
461
518
**Beta:**
462
519
- Tests expanded to include e2e coverage described above.
463
520
521
+ **GA:**
522
+ - Feedback from real world usage shows that feature is working as intended
523
+ - Events are triggered on each Service to provide users with clear information
524
+ on when the feature transitioned between enabled and disabled states.
525
+ - Test coverage in EndpointSlice strategy to ensure that the Hints field is
526
+ dropped when the feature gate is not enabled.
527
+ - Test coverage in EndpointSlice controller for the transition from enabled to
528
+ disabled.
529
+ - Ensure that existing Topology Hints e2e test runs as a presubmit if any code
530
+ changes in kube-proxy or the EndpointSlice controller.
531
+ - Topology Hints e2e tests will graduate to conformance tests.
532
+ - Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling
533
+ (and scheduling) that allows users to proportionally distribute endpoints
534
+ across zones.
535
+
464
536
# ## Version Skew Strategy
465
537
This KEP requires updates to both the EndpointSlice Controller and kube-proxy.
466
538
Thus there could be two potential version skew scenarios :
@@ -499,8 +571,13 @@ enabled even if the annotation has been set on the Service.
499
571
EndpointSlices for Services that have this feature enabled.
500
572
501
573
* **Are there any tests for feature enablement/disablement?**
502
- This feature is not yet implemented but per-Service enablement/disablement is
503
- covered in depth as part of the test plan.
574
+ Per Service enablement and disablement is covered in depth by unit tests. As a
575
+ prerequisite for graduation to GA, we will also add the following :
576
+
577
+ - Test coverage in EndpointSlice strategy to ensure that the Hints field is
578
+ dropped when the feature gate is not enabled.
579
+ - Test coverage in EndpointSlice controller for the transition from enabled to
580
+ disabled.
504
581
505
582
# ## Rollout, Upgrade and Rollback Planning
506
583
@@ -517,8 +594,10 @@ enabled even if the annotation has been set on the Service.
517
594
with before the feature was enabled.
518
595
519
596
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
520
- This feature is not yet implemented but per-Service enablement/disablement is
521
- covered in depth as part of the test plan.
597
+ Per-Service enablement/disablement is covered in depth and feature gate
598
+ enablement and disablement will be covered before the feature graduates to GA.
599
+ In addition, manual testing covering combinations of
600
+ upgrade->downgrade->upgrade cycles will be completed prior to GA graduation.
522
601
523
602
* **Is the rollout accompanied by any deprecations and/or removals of features,
524
603
APIs, fields of API types, flags, etc.?**
@@ -567,6 +646,10 @@ enabled even if the annotation has been set on the Service.
567
646
additional calls to the EndpointSlice API, but expect the increase to be
568
647
minimal.
569
648
649
+ The EndpointSlice controller will begin publishing Events for each Service
650
+ that has opted in to this feature when this transitions between enablement
651
+ states.
652
+
570
653
* **Will enabling / using this feature result in introducing new API types?**
571
654
No.
572
655
@@ -612,6 +695,13 @@ enabled even if the annotation has been set on the Service.
612
695
613
696
- KEP Merged : February 2021
614
697
- Alpha release : Kubernetes 1.21
698
+ - Beta Release : Kubernetes 1.23[^1]
699
+ - Feature Gate on-by default, feature available by default : 1.24
700
+
701
+ [^1] : This was intended to also flip the feature gate to enabled by default, but
702
+ unfortunately that part was missed in 1.23. See
703
+ [#108747](https://github.com/kubernetes/kubernetes/pull/108747) for more
704
+ information.
615
705
616
706
# # Drawbacks
617
707
1. Increased complexity in EndpointSlice controller
0 commit comments