Change NodeSelectionDelegate signature to allow more flexible strategies#13
Merged
Change NodeSelectionDelegate signature to allow more flexible strategies#13
Conversation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
1 task
colega
reviewed
Nov 26, 2025
colega
approved these changes
Nov 26, 2025
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
pracucci
added a commit
to grafana/dskit
that referenced
this pull request
Nov 26, 2025
…ers but no alive bridges (#800) **What this PR does**: When zone-aware routing is enabled, if there is a zone with members but no alive bridges we end up in a network partitioning state. I think we can make the routing logic more resilient by checking for this specific condition, and internally disable the zone-aware routing if we detect a zone without alive bridges. The new logic is definitely slower than the previous one, but in absolute terms I don't think it's prohibitive. `SelectNodes()` gets called once every gossip interval (200ms) and takes 0.17ms with 10K nodes and 2 zones (10K nodes is a pretty large scale): ``` goos: darwin goarch: arm64 pkg: github.com/grafana/dskit/kv/memberlist cpu: Apple M3 Pro BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11 7033 170166 ns/op 81922 B/op 1 allocs/op BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11 6940 170834 ns/op 81922 B/op 1 allocs/op BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11 6925 171649 ns/op 81922 B/op 1 allocs/op BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11 6906 213222 ns/op 81922 B/op 1 allocs/op BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11 6957 172906 ns/op 81922 B/op 1 allocs/op PASS ok github.com/grafana/dskit/kv/memberlist 9.339s ``` This PR is based on grafana/memberlist#13. **Which issue(s) this PR fixes**: N/A **Checklist** - [x] Tests updated --------- Signed-off-by: Marco Pracucci <marco@pracucci.com>
4 tasks
pracucci
added a commit
to grafana/mimir
that referenced
this pull request
Nov 28, 2025
…t no alive bridges (#13664) #### What this PR does Update memberlist and dskit to get: - grafana/memberlist#13 - grafana/dskit#800 - grafana/dskit#811 I've updated `MimirMemberlistBridgeZoneUnavailable` alert `for` duration because after this PR having no running bridges in a zone will be less severe. #### Which issue(s) this PR fixes or relates to N/A #### Checklist - [ ] Tests updated. - [ ] Documentation added. - [x] `CHANGELOG.md` updated - the order of entries should be `[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry is not needed, please add the `changelog-not-needed` label to the PR. - [ ] [`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md) updated with experimental features. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Update memberlist/dskit and etcd stack, adopt new memberlist node-selection API, improve networking/transport, and relax the bridge-unavailable alert timing. > > - **Memberlist / Gossip**: > - Export `NodeState` and replace `NodeSelectionDelegate.SelectNode` with `SelectNodes(selected, preferred)`; update k-random node selection accordingly. > - **Etcd (vendor → v3.6.x) / Client**: > - Upgrade APIs/protos (new fields like `hash_revision`, snapshot `version`, status `storageVersion/dbSizeQuota/downgradeInfo`). > - Add `Maintenance.Downgrade` and `SnapshotWithVersion`; refine retry/backoff, auth token refresh, error handling (`ContextError`). > - TLS/transport tweaks (default min TLS 1.2, local addr support). > - **Networking/Transport**: > - Update `go-proxyproto` (new `ConnPolicyFunc`, improved Accept/reader, write path) and keep-alive handling with OS-specific fixes; socket option handling for non-Linux. > - **Alerts**: > - Relax `MimirMemberlistBridgeZoneUnavailable` alert `for` duration. > - **Misc**: > - Vendor bumps (dskit, gomemcache) and related code adjustments. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 7568146. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
I want to improve the zone-aware routing implemented in dskit but to do so I need to have the full view over the available nodes when selecting the ones to gossip to. For this reason, in this PR I'm changing
NodeSelectionDelegatesignature to take in input the entire list of nodes, and then filter them, allowing to select 1 (optional) preferred node.To avoid introducing a performance regression when there's no delegate, I decided to expose
NodeState(here I attempted another approach without exposing it, but it was introducing a perf regression inkRandomNodes()when there was no delegate or the delegate was a no-op).In this PR,
kRandomNodes()is still blazing fast:Look at grafana/dskit#800 to see why I need this change.