-
Notifications
You must be signed in to change notification settings - Fork 353
docs(MADR): add MADR-083 for Envoy Zone-Aware Routing #14532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # Envoy Zone Aware Routing Integration | ||
|
|
||
| **Status:** accepted | ||
|
|
||
| ## Context and Problem Statement | ||
|
|
||
| Kuma’s existing **locality-aware load balancing** uses Envoy’s priority-based routing via `MeshLoadBalancingStrategy` to favor dataplanes sharing the same `kuma.io/zone` tag. However, it does **not** account for the relative number of instances per availability zone when distributing traffic. In environments where tasks are unevenly distributed across zones, this can overload individual instances and degrade performance. | ||
|
|
||
| **Example scenario:** | ||
| - **Source service:** 4 tasks in AZ-A, 1 in AZ-B | ||
| - **Destination service:** 1 task in AZ-A, 4 in AZ-B | ||
| Priority-based routing would still favor AZ-A—even though it has a single instance—resulting in uneven per-instance load. Envoy’s **zone aware routing** would balance evenly across all endpoints while favoring the local zone. | ||
|
|
||
| ## Decision Drivers | ||
|
|
||
| - **True zone-awareness:** factor zone-level instance counts into routing decisions. | ||
| - **Minimal infrastructure changes:** no additional proxies or sidecars. | ||
| - **Leverage Envoy’s native implementation:** avoid reimplementing complex load-balancing logic. | ||
| - **Compatibility:** support both Kubernetes and universal modes. | ||
| - **Maintain existing `kuma.io/zone` tags:** avoid forcing users to reconfigure zones. | ||
|
|
||
| ## Considered Options | ||
|
|
||
| 1. **Enhance priority-based strategy** | ||
| Extend `MeshLoadBalancingStrategy` to calculate custom weights mimicking zone-aware behavior. | ||
| *Pros:* No external flags. *Cons:* Reinventing Envoy’s algorithm, maintenance burden. | ||
|
|
||
| 2. **Expose Envoy’s `zone_aware_lb_config` directly** | ||
| Add a new `type: ZoneAware` mode in `MeshLoadBalancingStrategy` that sets `common_lb_config.zone_aware_lb_config`. | ||
| *Pros:* Minimal CP logic; leverages Envoy. *Cons:* Requires documentation and policy extension. | ||
|
|
||
| ## Decision Outcome | ||
|
|
||
| Chosen option: **Expose Envoy’s `zone_aware_lb_config` directly** as a new `ZoneAware` mode in `MeshLoadBalancingStrategy`. | ||
|
|
||
| ### Implementation Details | ||
|
|
||
| - Extend `MeshLoadBalancingStrategy.spec.default.localityAwareness.localZone` to accept: | ||
| - `type: ZoneAware` | ||
| - `minClusterSize`: mapped to Envoy’s `zone_aware_lb_config.min_cluster_size` | ||
| - `zoneIdentifier`: tag key for zone (e.g., `topology/availability-zone`) | ||
| - Kuma CP will inject into each dataplane’s bootstrap: | ||
slonka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ```yaml | ||
| cluster_manager: | ||
| local_cluster_name: {{ service_name }} | ||
| node: | ||
| locality: | ||
| zone: {{ zone_tag }} | ||
slonka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
| - Kuma CP will inject into each relevant cluster: | ||
| ```yaml | ||
| common_lb_config: | ||
| zone_aware_lb_config: | ||
| min_cluster_size: {{ minClusterSize }} | ||
| ``` | ||
|
|
||
| ### Example Configuration | ||
|
|
||
| ```yaml | ||
| type: MeshLoadBalancingStrategy | ||
| name: zone-aware | ||
| mesh: default | ||
| spec: | ||
| targetRef: | ||
| kind: Dataplane | ||
| labels: | ||
| app: frontend | ||
| to: | ||
| - targetRef: | ||
| kind: MeshService | ||
| name: backend | ||
| sectionName: http | ||
| default: | ||
| loadBalancer: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: default
spec:
to:
- targetRef:
kind: MeshService
name: backend
default:
localityAwareness:
localZone:
affinityTags:
- key: kubernetes.io/hostname
weight: 9000
- key: topology.kubernetes.io/zone
weight: 9Your suggested API change doesn't reflect the fact 2 approaches are mutually exclusive, i.e. we'd have to add extra validation to reject policies like this (because it's not possible to use zone_aware_lb and locality_weighted_lb at the same time): type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: default
spec:
to:
- targetRef:
kind: MeshService
name: backend
default:
loadBalancer:
type: ZoneAware
zoneAware:
# Minimum total endpoints before zone-aware logic activates
minClusterSize: 1
# Tag used on dataplanes to identify their zone
zoneIdentifier: topology/availability-zone
localityAwareness:
localZone:
affinityTags:
- key: kubernetes.io/hostname
weight: 9000
- key: topology.kubernetes.io/zone
weight: 9I think MeshLoadBalancingStrategy policy API should make it clear to users that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lobkovilya i was also concerned about this issue, but since the existing policy structure does not have generic keys like 'type' that takes kind of enums, I'm not able to solve this issue
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lobkovilya any suggestions here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lobkovilya any update here? |
||
| type: ZoneAware | ||
| zoneAware: | ||
| # Minimum total endpoints before zone-aware logic activates | ||
| minClusterSize: 1 | ||
| # Tag used on dataplanes to identify their zone | ||
| zoneIdentifier: topology/availability-zone | ||
| ``` | ||
|
|
||
| ## Consequences | ||
|
|
||
| - **Backward compatibility:** Existing weighted strategies remain unchanged. | ||
| - **Policy changes:** Users opt in via `type: ZoneAware`. | ||
| - **Documentation:** Need examples, guides, and upgrade notes. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also do you know what @lukidzi wanted to achieve with this
subZoneIdentifiermentioned in the issue?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slonka 'upstream.zone_routing.force_local_zone' is a runtime config and as far as i understand, runtime config does not support cluster level configurations.
Is it correct or I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure about why @lukidzi suggested subzoneid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukidzi - can you weight in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukidzi any update here?