Skip to content

Conversation

@mhamza15
Copy link
Collaborator

@mhamza15 mhamza15 commented Dec 9, 2025

Description

This PR is a continuation of #18552. It's been slightly refactored to match the changes introduced in #18787.

This PR introduces a "session" tablet balancer, which will direct a session to the same tablet for the duration of the session. This protects against cases where a session reads a value from one tablet, and then gets routed to a different, higher-lag tablet and reads an older value.

The session balancer initially used consistent hashing, but was later switched to use rendezvous hashing for simplicity. The tradeoff there is that rendezvous hashing has an O(n) runtime, n being the number of tablets per target (keyspace + shard + tablet type + cell). As the number of tablets per target is expected to be small, the performance should be reasonable.

In order for the tablet balancer to have access to the session UUID, the QueryService interface was updated to include a new Session interface, which provides the session UUID and the execute options. The old options parameter was removed and is now pulled from the session. As a result, you'll see a lot of changes to all the implementations of QueryService to fit the new methods, and to wrap execute options in sessions in cases where a session isn't already provided (mostly tests).

Benchmark using sysbench oltp_read_only (8 threads, 3 replicas per cell):

Balancer mode p50 latency (ms) p95 latency (ms) qps
Cell 48.34 121.08 2011.13
Session 51.02 (+5.5%) 118.92 (-1.8%) 2093.08 (+4.1%)
Random 56.84 (+17.6%) 132.49 (+9.4%) 1966.61 (-2.2%)

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Add `PickOpts` which allow balancers to accept options specific to their
implementation. This allows the `Pick` signature not to get overly long
as implementations and their options are added.

Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
This reverts commit bd6a1b792d69b238df90b5bf979e0911087ada16.

Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Update the signature of the wrapper func to accept a new `WrapOpts`
struct, which currently contains `ExecuteOptions`, which now contains
the session UUID so that it can be passed into `Pick` for the balancer.

Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Comment on lines +98 to +99
return false
}, 10*time.Millisecond, 1*time.Millisecond, "Connection should switch to a different tablet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't break the serializable guarantees of the new mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. This test case mimics the case a tablet goes down/goes out of serving, ensuring that the connections move over accordingly. The alternative is to error until the tablet is back, but that is likely less than ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we shouldn't return an error forever, we should only do so once. The client should know that their session — which provided serializable consistency — has been terminated and that they need to reconnect in order to get a new view with the same serializable property. We can't silently move their session to another replica that could be minutes behind the previous one IMO. Doesn't that defeat the purpose here, since you can never really count on the core property holding true and have to maintain checks and handling for going back in time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can ensure that the newly selected replica is at least at the same GTID snapshot/position that the old one was, then it would be fine (it's OK to go forward in time).

Or do you think I'm missing or misunderstanding something here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope you're not missing anything, those are all fair points. I suppose it comes down to how much of a guarantee we'd like to make to clients. In my head at least, I was envisioning it being a large step up from the normal selection logic of a random tablet in the local cell, but not quite a 100% guarantee of consistency across the board. Even if we do stick with the approach I've already taken, we'd definitely need to clarify in the docs that in the case of tablets being added or removed, there is a possibility to break the consistency guarantee.

Do you feel strongly about "going all the way" in terms of consistency? I'll give it some deeper thought on my end as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that this feature came from your own experience? If so, what would you prefer?

As long as we clearly document it, I think it would be OK. It would also be something we could change later on.

Also may be worth bringing up with the community in the Vitess Slack to see what others think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a desire for this behavior at GitHub from Arthur and I, but I wouldn't quite say there was an explicit need for this, more of a nice to have. I'd imagine we would have initially sacrificed the additional consistency for the ease of adoption/less disruption that the "automatic connection shifting" would've afforded us. I think this can be a start, with eyes on guaranteeing more consistency if the need arises in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the docs: vitessio/website@f207553

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At GitHub (and I'm sure at many other places), queries made outside of transactions are automatically retried a few times in case of a failure at the connection level.

This means that the application is actually unaware that a query failed, and due to the auto-reconnect that's happening, the consistency guarantee is already sacrificed for replica connections in these cases for better availability / transparent error handling.

A host going down for maintenance or unexpectedly becoming unavailable was a pretty rare occurrence when compared to the frequency of queries being executed.

Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 requested a review from mattlord December 16, 2025 14:59
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can change the options as noted, that will get rid of a lot more code changes here. Let me know what you think. It's looking good so far!

// by a weighted random sample so that over time the system will achieve the
// desired balanced allocation.
func (b *flowBalancer) Pick(target *querypb.Target, tablets []*discovery.TabletHealth) *discovery.TabletHealth {
func (b *flowBalancer) Pick(target *querypb.Target, tablets []*discovery.TabletHealth, _ PickOpts) *discovery.TabletHealth {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this pattern for these options?

vitess/go/vt/topo/locks.go

Lines 297 to 344 in 749fe54

// lockOptions configure a Lock call. lockOptions are set by the LockOption
// values passed to the lock functions.
type lockOptions struct {
lockType LockType
ttl time.Duration
}
// LockOption configures how we perform the locking operation.
type LockOption interface {
apply(*lockOptions)
}
// funcLockOption wraps a function that modifies lockOptions into an
// implementation of the LockOption interface.
type funcLockOption struct {
f func(*lockOptions)
}
func (flo *funcLockOption) apply(lo *lockOptions) {
flo.f(lo)
}
func newFuncLockOption(f func(*lockOptions)) *funcLockOption {
return &funcLockOption{
f: f,
}
}
// WithTTL allows you to specify how long the underlying topo server
// implementation should hold the lock before releasing it — even if the caller
// has not explicitly released it. This provides a way to override the global
// ttl values that are set via --topo-consul-lock-session-ttl and
// --topo-etcd-lease-ttl.
// Note: This option is ignored by the ZooKeeper implementation as it does not
// support TTLs.
func WithTTL(ttl time.Duration) LockOption {
return newFuncLockOption(func(o *lockOptions) {
o.ttl = ttl
})
}
// WithType determines the type of lock we take. The options are defined
// by the LockType type.
func WithType(lt LockType) LockOption {
return newFuncLockOption(func(o *lockOptions) {
o.lockType = lt
})
}

Then these functions can take a variadic number of options and those that don't use any do not need to be modified. This is always preferable when passing options is not the norm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I like the functional options pattern

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f63dcd6

mhamza15 and others added 2 commits December 16, 2025 10:28
Co-authored-by: Matt Lord <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
@@ -0,0 +1,62 @@
/*
Copyright 2024 The Vitess Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New files should have the current year in the Copyright.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn Claude 👊 . Fixed!

Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work on this, @mhamza15 ! ❤️

Comment on lines +183 to +186
// Pick should prefer local cell
picked1 := b.Pick(target, tablets, WithSessionUUID("a"))
require.NotNil(t, picked1)
require.Equal(t, "local", picked1.Target.Cell)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth doing this in a loop 10 times or something, no? To better test the preference.

Copy link
Collaborator Author

@mhamza15 mhamza15 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good call, I did it in the end to end tests but it looks like I missed it in the unit tests.

@mhamza15
Copy link
Collaborator Author

Website docs: vitessio/website#2043

Signed-off-by: Mohamed Hamza <[email protected]>
@arthurschreiber
Copy link
Member

@mhamza15 Does this PR need to be backported to 22.0 or was that label added by mistake? 😅

@mhamza15
Copy link
Collaborator Author

mhamza15 commented Dec 18, 2025

@mhamza15 Does this PR need to be backported to 22.0 or was that label added by mistake? 😅

That might've been a mistake, @mattlord can confirm. Entirely new feature so no need to backport IMO

@mattlord mattlord removed the Backport to: release-22.0 Needs to be backport to release-22.0 label Dec 18, 2025
@mattlord
Copy link
Member

@mhamza15 Does this PR need to be backported to 22.0 or was that label added by mistake? 😅

That might've been a mistake, @mattlord can confirm. Entirely new feature so no need to backport IMO

Yeah, that was a mistake. I didn't even notice 🤦‍♂️

@mhamza15 mhamza15 removed the NeedsWebsiteDocsUpdate What it says label Dec 18, 2025
@mhamza15 mhamza15 self-assigned this Dec 18, 2025
@mhamza15 mhamza15 merged commit 965822a into vitessio:main Dec 18, 2025
103 of 109 checks passed
@mhamza15 mhamza15 deleted the session-balancer branch December 18, 2025 16:45
@promptless
Copy link
Contributor

promptless bot commented Dec 18, 2025

📝 Documentation updates detected!

Updated existing suggestion: Add changelog entry for session balancer mode

mhamza15 added a commit to mhamza15/vitess that referenced this pull request Dec 23, 2025
vitessio#19007 updated the `QueryService`
interface with a new `Session` interface that contains the session UUID
and the existing `ExecuteOptions`. Since the `options` parameter was now
redundant and can be pulled from the `Session` directly, it was removed.
This caused a lot of unnecessary changes, but also introduced a race
condition in `scatter_conn.go`:

```diff
	allErrors := stc.multiGoTransaction(
		// ...
		func(rs *srvtopo.ResolvedShard, i int, info *shardActionInfo) (*shardActionInfo, error) {
			// ...

			// Goroutine A reading session
			if session != nil && session.Session != nil {
				opts = session.Options
			}

			// Goroutine B writing session
			if opts == nil && fetchLastInsertID {
				opts = &querypb.ExecuteOptions{FetchLastInsertId: fetchLastInsertID}
+				session = econtext.NewSafeSession(&vtgatepb.Session{Options: opts})
			}

			// ...
		}
	)
```

This PR removes `GetOptions` from the new `Session` interface and adds
back the `options` parameter to the `QueryService` interface. This
allows each goroutine in that `scatter_conn.go` logic to keep its own
local `opts` and not overwrite the shared `session`, reverting the logic
to how it was before.

Signed-off-by: Mohamed Hamza <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random Tablet Selection causes "time inconsistent" data reads

4 participants