Skip to content

Change NodeSelectionDelegate signature to allow more flexible strategies#13

Merged
pracucci merged 2 commits intomasterfrom
more-flexible-node-selection-v2
Nov 26, 2025
Merged

Change NodeSelectionDelegate signature to allow more flexible strategies#13
pracucci merged 2 commits intomasterfrom
more-flexible-node-selection-v2

Conversation

@pracucci
Copy link

@pracucci pracucci commented Nov 26, 2025

Description

I want to improve the zone-aware routing implemented in dskit but to do so I need to have the full view over the available nodes when selecting the ones to gossip to. For this reason, in this PR I'm changing NodeSelectionDelegate signature to take in input the entire list of nodes, and then filter them, allowing to select 1 (optional) preferred node.

To avoid introducing a performance regression when there's no delegate, I decided to expose NodeState (here I attempted another approach without exposing it, but it was introducing a perf regression in kRandomNodes() when there was no delegate or the delegate was a no-op).

In this PR, kRandomNodes() is still blazing fast:

goos: darwin
goarch: arm64
pkg: github.com/hashicorp/memberlist
cpu: Apple M3 Pro
BenchmarkKRandomNodes/without_delegate-11         	 9570109	       109.4 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/without_delegate-11         	10710229	       109.7 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/without_delegate-11         	10996206	       109.7 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/without_delegate-11         	10831244	       110.0 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/without_delegate-11         	10894020	       108.6 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/without_delegate-11         	11064045	       108.4 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	12300616	        97.49 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	12155103	        97.97 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	11920869	        99.25 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	12310774	        99.01 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	12070784	        98.64 ns/op	     288 B/op	       1 allocs/op
BenchmarkKRandomNodes/with_delegate-11            	11979469	       152.2 ns/op	     288 B/op	       1 allocs/op
PASS
ok  	github.com/hashicorp/memberlist	16.493s
PASS
ok  	github.com/hashicorp/memberlist/internal/retry	0.217s

Look at grafana/dskit#800 to see why I need this change.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
@pracucci pracucci merged commit 6f9f62a into master Nov 26, 2025
11 of 13 checks passed
pracucci added a commit to grafana/dskit that referenced this pull request Nov 26, 2025
…ers but no alive bridges (#800)

**What this PR does**:

When zone-aware routing is enabled, if there is a zone with members but
no alive bridges we end up in a network partitioning state. I think we
can make the routing logic more resilient by checking for this specific
condition, and internally disable the zone-aware routing if we detect a
zone without alive bridges.

The new logic is definitely slower than the previous one, but in
absolute terms I don't think it's prohibitive. `SelectNodes()` gets
called once every gossip interval (200ms) and takes 0.17ms with 10K
nodes and 2 zones (10K nodes is a pretty large scale):

```
goos: darwin
goarch: arm64
pkg: github.com/grafana/dskit/kv/memberlist
cpu: Apple M3 Pro
BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11    	    7033	    170166 ns/op	   81922 B/op	       1 allocs/op
BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11    	    6940	    170834 ns/op	   81922 B/op	       1 allocs/op
BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11    	    6925	    171649 ns/op	   81922 B/op	       1 allocs/op
BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11    	    6906	    213222 ns/op	   81922 B/op	       1 allocs/op
BenchmarkZoneAwareNodeSelectionDelegate_SelectNodes-11    	    6957	    172906 ns/op	   81922 B/op	       1 allocs/op
PASS
ok  	github.com/grafana/dskit/kv/memberlist	9.339s
```

This PR is based on grafana/memberlist#13.

**Which issue(s) this PR fixes**:

N/A

**Checklist**
- [x] Tests updated

---------

Signed-off-by: Marco Pracucci <marco@pracucci.com>
pracucci added a commit to grafana/mimir that referenced this pull request Nov 28, 2025
…t no alive bridges (#13664)

#### What this PR does

Update memberlist and dskit to get:

- grafana/memberlist#13
- grafana/dskit#800
- grafana/dskit#811

I've updated `MimirMemberlistBridgeZoneUnavailable` alert `for` duration
because after this PR having no running bridges in a zone will be less
severe.

#### Which issue(s) this PR fixes or relates to

N/A

#### Checklist

- [ ] Tests updated.
- [ ] Documentation added.
- [x] `CHANGELOG.md` updated - the order of entries should be
`[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry
is not needed, please add the `changelog-not-needed` label to the PR.
- [ ]
[`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md)
updated with experimental features.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Update memberlist/dskit and etcd stack, adopt new memberlist
node-selection API, improve networking/transport, and relax the
bridge-unavailable alert timing.
> 
> - **Memberlist / Gossip**:
> - Export `NodeState` and replace `NodeSelectionDelegate.SelectNode`
with `SelectNodes(selected, preferred)`; update k-random node selection
accordingly.
> - **Etcd (vendor → v3.6.x) / Client**:
> - Upgrade APIs/protos (new fields like `hash_revision`, snapshot
`version`, status `storageVersion/dbSizeQuota/downgradeInfo`).
> - Add `Maintenance.Downgrade` and `SnapshotWithVersion`; refine
retry/backoff, auth token refresh, error handling (`ContextError`).
>   - TLS/transport tweaks (default min TLS 1.2, local addr support).
> - **Networking/Transport**:
> - Update `go-proxyproto` (new `ConnPolicyFunc`, improved
Accept/reader, write path) and keep-alive handling with OS-specific
fixes; socket option handling for non-Linux.
> - **Alerts**:
>   - Relax `MimirMemberlistBridgeZoneUnavailable` alert `for` duration.
> - **Misc**:
>   - Vendor bumps (dskit, gomemcache) and related code adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
7568146. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants