Skip to content

Commit 1264d3d

Browse files
authored
Merge pull request kubernetes#3572 from robscott/topology-ga-update
Updating Topology Aware Hints KEP with GA Graduation Criteria
2 parents bf5c3c8 + 1cd2e5e commit 1264d3d

File tree

3 files changed

+119
-27
lines changed

3 files changed

+119
-27
lines changed

keps/prod-readiness/sig-network/2433.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@wojtek-t"
44
beta:
55
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-network/2433-topology-aware-hints/README.md

Lines changed: 114 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,17 @@
2323
- [Handling Node Updates](#handling-node-updates)
2424
- [Future Expansion](#future-expansion)
2525
- [Test Plan](#test-plan)
26-
- [Controller Unit Tests](#controller-unit-tests)
27-
- [Kube-Proxy Unit Tests](#kube-proxy-unit-tests)
28-
- [e2e Tests](#e2e-tests)
26+
- [Unit tests](#unit-tests)
27+
- [Controller Unit Tests](#controller-unit-tests)
28+
- [Kube-Proxy Unit Tests](#kube-proxy-unit-tests)
29+
- [Integration tests](#integration-tests)
30+
- [e2e tests](#e2e-tests)
2931
- [Observability](#observability)
32+
- [Events](#events)
33+
- [Logic](#logic)
34+
- [Sample Events](#sample-events)
35+
- [Documentation](#documentation)
36+
- [Limitations](#limitations)
3037
- [Graduation Criteria](#graduation-criteria)
3138
- [Version Skew Strategy](#version-skew-strategy)
3239
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -368,12 +375,18 @@ In the future we may expand this functionality if needed. This could include:
368375

369376
- A new `RequireZone` algorithm that would keep endpoints in EndpointSlices for
370377
the same zone they are in.
371-
- A new option to specify a minimum threshold for the `PreferZone` approach.
378+
- A new option to specify a minimum threshold for the `Auto` (PreferZone)
379+
approach.
372380
- Support for region based hints.
373381

374382
### Test Plan
375383

376-
#### Controller Unit Tests
384+
#### Unit tests
385+
386+
- `k8s.io/pkg/controller/endpointslice`: `2022-10-05` - `73.1`
387+
- `k8s.io/pkg/controller/endpointslice/topologycache`: `2022-10-05` - `75.4`
388+
389+
##### Controller Unit Tests
377390
| Test Description | Expected Result |
378391
| :--- | :--- |
379392
| Feature On, 2+ zones | Hints set |
@@ -393,31 +406,29 @@ In the future we may expand this functionality if needed. This could include:
393406
| Endpoint additions that require redistribution | Hints updated |
394407
| Endpoint removals that require redistribution | Hints updated |
395408

396-
#### Kube-Proxy Unit Tests
409+
##### Kube-Proxy Unit Tests
397410
| Test Description | Expected Result |
398411
| :--- | :--- |
399412
| Feature On, hints matching zone | Endpoints filtered |
400413
| Feature On, ExternalTrafficPolicy == 'Local', hints matching zone | Endpoints not filtered |
401414
| Feature Off, hints matching zone | Endpoints not filtered |
402415
| Feature On, no hints matching zone | Endpoints not filtered |
403416

404-
### e2e Tests
405-
This represents the largest and most uncertain part of the testing effort. We
406-
need to find a way to run e2e tests on multizone clusters. To limit flakiness,
407-
those clusters likely need to have a consistent distribution of nodes across
408-
zones. This will enable us to write predictable tests for topology aware
409-
routing.
417+
#### Integration tests
418+
419+
N/A
410420

411-
At a minimum, we likely want the following test:
421+
#### e2e tests
412422

413-
- 3 zone cluster, with 1 equivalent node per zone
414-
- Deploy a single pod to each node with a daemonset
415-
- Create a Service that targets that daemonset
416-
- Make requests from each zone and ensure that the request is routed to a pod in
417-
the same zone
423+
This feature has e2e test coverage with the ["Topology Hints"
424+
test](https://github.com/kubernetes/kubernetes/blob/fbb6ccc0c62d2431dc3a18a4130d7fbae5c05acd/test/e2e/network/topology_hints.go#L43).
425+
This is currently limited to a periodic run due to the nature of requiring a
426+
multizone cluster to run. It has been [remarkably
427+
stable](https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20multizone)
428+
with 100% green runs.
418429

419-
We'll likely need more tests to properly vet this feature, but this one should
420-
be straightforward to write and unlikely to be flaky.
430+
As a prerequisite for GA, we will ensure that this test runs as a presubmit
431+
if any code changes in kube-proxy or the EndpointSlice controller.
421432

422433
### Observability
423434
We can reuse some of the metrics of EndpointSlice Controller that we already
@@ -454,13 +465,74 @@ EndpointSliceSyncs = metrics.NewCounterVec(
454465
455466
```
456467

468+
### Events
469+
A common point of frustration among initial users of this feature was how
470+
difficult it was to tell if this feature was enabled and working as intended.
471+
Due to the nature of this design, even when a user opts in to the `Auto` mode
472+
that is no guarantee that the controller logic will determine that there are
473+
a sufficient number of endpoints to allocate them proportionally to each zone
474+
in the cluster.
475+
476+
To make this feature easier to understand and use, the EndpointSlice controller
477+
will publish events for a Service to describe if the feature has been enabled,
478+
and if not, why not.
479+
480+
#### Logic
481+
482+
The EndpointSlice controller will track the known state of this feature for
483+
each Service. When that state or the reason for it changes, the EndpointSlice
484+
controller will publish a new Event to reflect the updated status of this
485+
feature.
486+
487+
#### Sample Events
488+
489+
| Type | Reason | Message |
490+
|-|-|-|
491+
| Normal | TopologyAwareRoutingEnabled | Topology Aware Routing has been enabled |
492+
| Normal | TopologyAwareRoutingDisabled | Topology Aware Routing configuration was removed |
493+
| Warning | TopologyAwareRoutingDisabled | Insufficient number of Endpoints (n), impossible to safely allocate proportionally |
494+
| Warning | TopologyAwareRoutingDisabled | 1 or more Endpoints do not have a Zone specified |
495+
| Warning | TopologyAwareRoutingDisabled | 1 or more Nodes do not have allocatable CPU specified |
496+
| Warning | TopologyAwareRoutingDisabled | Nodes only ready in 1 zone |
497+
498+
#### Documentation
499+
500+
The Topology Aware Hints documentation will be updated to describe the reason
501+
each of these events may have been triggered, along with steps that can be taken
502+
to recover from that state.
503+
504+
#### Limitations
505+
506+
Although the events described above should dramatically simplify the use of this
507+
feature, there is a tiny edge case that will not be covered. If any
508+
EndpointSlices for a Service do not include Hints, Kube-Proxy will not implement
509+
this feature. This would happen if a user created custom EndpointSlices _and_
510+
enabled Topology Aware Hints _and_ failed to set Hints on their custom
511+
EndpointSlices. This seems very unlikely, but is mentioned here for the sake of
512+
completeness.
513+
457514
### Graduation Criteria
458515
**Alpha:**
459516
- Basic functionality covered with unit tests described above.
460517

461518
**Beta:**
462519
- Tests expanded to include e2e coverage described above.
463520

521+
**GA:**
522+
- Feedback from real world usage shows that feature is working as intended
523+
- Events are triggered on each Service to provide users with clear information
524+
on when the feature transitioned between enabled and disabled states.
525+
- Test coverage in EndpointSlice strategy to ensure that the Hints field is
526+
dropped when the feature gate is not enabled.
527+
- Test coverage in EndpointSlice controller for the transition from enabled to
528+
disabled.
529+
- Ensure that existing Topology Hints e2e test runs as a presubmit if any code
530+
changes in kube-proxy or the EndpointSlice controller.
531+
- Topology Hints e2e tests will graduate to conformance tests.
532+
- Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling
533+
(and scheduling) that allows users to proportionally distribute endpoints
534+
across zones.
535+
464536
### Version Skew Strategy
465537
This KEP requires updates to both the EndpointSlice Controller and kube-proxy.
466538
Thus there could be two potential version skew scenarios:
@@ -499,8 +571,13 @@ enabled even if the annotation has been set on the Service.
499571
EndpointSlices for Services that have this feature enabled.
500572

501573
* **Are there any tests for feature enablement/disablement?**
502-
This feature is not yet implemented but per-Service enablement/disablement is
503-
covered in depth as part of the test plan.
574+
Per Service enablement and disablement is covered in depth by unit tests. As a
575+
prerequisite for graduation to GA, we will also add the following:
576+
577+
- Test coverage in EndpointSlice strategy to ensure that the Hints field is
578+
dropped when the feature gate is not enabled.
579+
- Test coverage in EndpointSlice controller for the transition from enabled to
580+
disabled.
504581

505582
### Rollout, Upgrade and Rollback Planning
506583

@@ -517,8 +594,10 @@ enabled even if the annotation has been set on the Service.
517594
with before the feature was enabled.
518595

519596
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
520-
This feature is not yet implemented but per-Service enablement/disablement is
521-
covered in depth as part of the test plan.
597+
Per-Service enablement/disablement is covered in depth and feature gate
598+
enablement and disablement will be covered before the feature graduates to GA.
599+
In addition, manual testing covering combinations of
600+
upgrade->downgrade->upgrade cycles will be completed prior to GA graduation.
522601

523602
* **Is the rollout accompanied by any deprecations and/or removals of features,
524603
APIs, fields of API types, flags, etc.?**
@@ -567,6 +646,10 @@ enabled even if the annotation has been set on the Service.
567646
additional calls to the EndpointSlice API, but expect the increase to be
568647
minimal.
569648

649+
The EndpointSlice controller will begin publishing Events for each Service
650+
that has opted in to this feature when this transitions between enablement
651+
states.
652+
570653
* **Will enabling / using this feature result in introducing new API types?**
571654
No.
572655

@@ -612,6 +695,13 @@ enabled even if the annotation has been set on the Service.
612695

613696
- KEP Merged: February 2021
614697
- Alpha release: Kubernetes 1.21
698+
- Beta Release: Kubernetes 1.23[^1]
699+
- Feature Gate on-by default, feature available by default: 1.24
700+
701+
[^1]: This was intended to also flip the feature gate to enabled by default, but
702+
unfortunately that part was missed in 1.23. See
703+
[#108747](https://github.com/kubernetes/kubernetes/pull/108747) for more
704+
information.
615705

616706
## Drawbacks
617707
1. Increased complexity in EndpointSlice controller

keps/sig-network/2433-topology-aware-hints/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,18 +23,18 @@ replaces:
2323
- "github.com/kubernetes/enhancements/tree/master/keps/sig-network/536-topology-aware-routing"
2424

2525
# The target maturity stage in the current dev cycle for this KEP.
26-
stage: beta
26+
stage: stable
2727

2828
# The most recent milestone for which work toward delivery of this KEP has been
2929
# done. This can be the current (upcoming) milestone, if it is being actively
3030
# worked on.
31-
latest-milestone: "v1.23"
31+
latest-milestone: "v1.26"
3232

3333
# The milestone at which this feature was, or is targeted to be, at each stage.
3434
milestone:
3535
alpha: "v1.21"
3636
beta: "v1.23"
37-
stable: "v1.25"
37+
stable: "v1.26"
3838

3939
# The following PRR answers are required at alpha release
4040
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)