`vttablet`: add proxy support to `FullStatus` RPC in `tabletmanager` #19058

timvaillancourt · 2025-12-19T19:30:15Z

Description

Add support for proxying FullStatus RPCs, in order for VTOrc to gain the ability to validate certain problems using many cells (locations) and detect network partitions - this will happen in future PRs (possibly v24), for now having this RPC in v24 will be helpful

Support for this is intentionally not added to vtctldclient GetFullStatus, because it's intended for internal VTOrc usage and I can't think of a good use case for someone at a CLI to need this

To add some safety, a request can only be proxied once (the ProxiedBy flag is added by the proxying tmserver) and you cannot proxy to "yourself", which would cause an infinite loop. Finally, a proxy timeout > the remote operation timeout of the proxying server returns an error. e2e tests are added to confirm these safety nets and that the proxying succeeds

Related Issue(s)

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

AI Disclosure

vitess-bot · 2025-12-19T19:30:39Z

Signed-off-by: Tim Vaillancourt <[email protected]>

codecov · 2026-01-09T23:33:49Z

Codecov Report

❌ Patch coverage is 29.41176% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.88%. Comparing base (32b8bd8) to head (c3dce0c).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/vttablet/grpctmserver/server.go	29.03%	22 Missing ⚠️
go/vt/vtorc/inst/tablet_dao.go	0.00%	6 Missing ⚠️
go/vt/vttablet/tabletmanager/tm_init.go	0.00%	2 Missing ⚠️
go/vt/vttablet/tmrpctest/test_tm_rpc.go	50.00%	2 Missing ⚠️
go/vt/vtcombo/tablet_map.go	0.00%	1 Missing ⚠️
go/vt/vtorc/inst/instance_dao.go	0.00%	1 Missing ⚠️
go/vt/vttablet/faketmclient/fake_client.go	0.00%	1 Missing ⚠️
go/vt/wrangler/testlib/fake_tablet.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #19058      +/-   ##
==========================================
- Coverage   69.89%   69.88%   -0.02%     
==========================================
  Files        1612     1613       +1     
  Lines      215826   216058     +232     
==========================================
+ Hits       150857   150997     +140     
- Misses      64969    65061      +92

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <[email protected]>

timvaillancourt · 2026-01-10T15:48:15Z

go/vt/vttablet/grpctmserver/server.go


+func (s *server) proxyFullStatus(ctx context.Context, request *tabletmanagerdatapb.FullStatusRequest) (*replicationdatapb.FullStatus, error) {
+	if s.tmc == nil {
+		return nil, vterrors.New(vtrpcpb.Code_FAILED_PRECONDITION, "no proxy tabletmanger client")


The init() func of this package creates the *server with a tmclient so this situation is kind of impossible

mattlord · 2026-01-10T16:16:33Z

proto/tabletmanagerdata.proto

+  // proxy_timeout_ms specifies the maximum number of milliseconds to wait for a
+  // proxied request to complete. Must be less than topo.RemoteOperationTimeout
+  // on the tablet proxying the request.
+  uint64 proxy_timeout_ms = 2;


Why would this not be a vttime.Duration?

mattlord · 2026-01-10T16:18:03Z

go/vt/vttablet/grpctmserver/server.go

+	// disallow timeouts larger than the local remote operation timeout
+	if request.ProxyTimeoutMs > uint64(topo.RemoteOperationTimeout.Milliseconds()) {
+		return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "cannot set a proxy timeout ms greater than %d", topo.RemoteOperationTimeout.Milliseconds())
+	}


I'm unsure of the value of allowing the caller to specify a timeout rather than using ~ the topo.RemoteOperationTimeout / 2?

mattlord

This seems OK to me, but in general I do not like intentionally adding dead/unused code. We are supposed to remove dead/unused code. 🙂

I would like to see you use this new work, in vtorc, in the same PR that you create the RPC changes. This does the following:

Avoids new dead code
Ensures that the underlying building blocks (RPC work in this case) actually work as needed in order to meet the larger feature/product goal. Without that we should have no confidence that this is what we need, and having to refactor it later will make things harder. Or maybe we don't even end up using it for a variety of reasons (see point 1).

Can you please implement the actual feature that you wish to add to Vitess in this PR along with the lower level RPC changes and anything else that is needed in order to implement the feature request? Then we can understand the problem, the reasons why this is an optimal solution, and confirm/demonstrate that it actually works as desired and solves the problem.

Thanks!

P.S. I think the general issue that we are trying to address here (which I have not seen laid out so I'm not sure) would typically be resolved with something like a gossip protocol. The caller having to decide if and when to proxy a request, and where to proxy it to, doesn't immediately feel to me like the best solution. But then again, I also don't see where we've clearly described the problem we want to solve and why we think this is the best solution. 🙂

timvaillancourt · 2026-01-29T06:41:17Z

@mattlord I think the issue summary breaks down what is trying to be solved here as it's not very complex, or at least I would struggle to explain it in more detail. The TL;DR is VTOrc is located in a single location and cannot be certain that a connection failure means something is down or unreachable. The driver for this problem is explained in a planetscale-internal issue I cannot link. I will copy this into an issue if that helps review the change

Can you please implement the actual feature that you wish to add to Vitess in this PR along with the lower level RPC changes and anything else that is needed in order to implement the feature request? Then we can understand the problem, the reasons why this is an optimal solution, and confirm/demonstrate that it actually works as desired and solves the problem.

Sure, I've prefer to not create mega/monolith PRs and have also seen others take this approach to implementing a feature, so I thought this was an acceptable approach. I can give this a shot in one huge PR 👍

P.S. I think the general issue that we are trying to address here (which I have not seen laid out so I'm not sure) would typically be resolved with something like a gossip protocol. The caller having to decide if and when to proxy a request, and where to proxy it to, doesn't immediately feel to me like the best solution. But then again, I also don't see where we've clearly described the problem we want to solve and why we think this is the best solution. 🙂

Yes, for quite sometime I've been planning to implement a gossip protocol within shards, to solve many problems at once (not just VTOrc). This wasn't a popular idea when it was originally proposed, but I'm glad to hear it come up from another maintainer

I have started this in a few POC branches, some I'm struggling to find now (on a 🚋 to FOSDEM!). At least in a transition phase to that model, the complexity of what is being solved here is no different to me: if we cannot reach a tablet we need to ask another if it's really dead. We can do that using a proxied call or asking the gossip state of another tablet, but either way we need to ask "someone else". We can move to relying purely on the gossip protocol and ask only a single tablet from each shard the state of the world, but that would be a massive change I think is not realistic to do in a big bang approach

The reason I approached this issue as a simple proxy request, for now, is it felt like a simpler, iterative approach vs. tie the gossip system I want to implement to this problem. But I would be on-board with tackling it all at once. The benefit this PR's approach has over a gossip protocol however, is a proxied request should be more accurate. A gossip system is asynchronous, meaning if you ask "is X tablet alive?", the state answering that question could be as stale as the last poll/push/update (even by milliseconds). In the synchronous proxy approach this PR partially implements, you get the exact current state without any drift, which is currently how VTOrc makes decisions. I'm not saying the asynchronous nature of a gossip system is a dealbreaker, but it IS technically less accurate; it's state of other tablets is eventually consistent

An RFC is on the way 👍

timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTorc Vitess Orchestrator integration Component: VTTablet labels Dec 19, 2025

github-actions bot added this to the v24.0.0 milestone Dec 19, 2025

vttablet: add proxy support to FullStatus RPC in tabletmanager

6da5c3e

Signed-off-by: Tim Vaillancourt <[email protected]>

timvaillancourt force-pushed the vttablet-proxy-rpcs branch from 6fb2987 to 6da5c3e Compare January 9, 2026 23:16

timvaillancourt added 2 commits January 10, 2026 02:06

improve test

86c9825

Signed-off-by: Tim Vaillancourt <[email protected]>

tweak

c3dce0c

Signed-off-by: Tim Vaillancourt <[email protected]>

timvaillancourt self-assigned this Jan 10, 2026

timvaillancourt marked this pull request as ready for review January 10, 2026 15:18

timvaillancourt requested review from beingnoble03, harshit-gangal, mattlord, rohit-nayak-ps and shlomi-noach as code owners January 10, 2026 15:18

timvaillancourt commented Jan 10, 2026

View reviewed changes

mattlord reviewed Jan 10, 2026

View reviewed changes

mattlord added the NeedsIssue A linked issue is missing for this Pull Request label Jan 10, 2026

timvaillancourt marked this pull request as draft February 3, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vttablet`: add proxy support to `FullStatus` RPC in `tabletmanager` #19058

`vttablet`: add proxy support to `FullStatus` RPC in `tabletmanager` #19058

Uh oh!

timvaillancourt commented Dec 19, 2025 •

edited

Loading

Uh oh!

vitess-bot bot commented Dec 19, 2025

Uh oh!

codecov bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

timvaillancourt Jan 10, 2026

Uh oh!

mattlord Jan 10, 2026 •

edited

Loading

Uh oh!

mattlord Jan 10, 2026 •

edited

Loading

Uh oh!

mattlord left a comment •

edited

Loading

Uh oh!

timvaillancourt commented Jan 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vttablet: add proxy support to FullStatus RPC in tabletmanager #19058

Are you sure you want to change the base?

vttablet: add proxy support to FullStatus RPC in tabletmanager #19058

Uh oh!

Conversation

timvaillancourt commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot bot commented Dec 19, 2025

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

codecov bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

timvaillancourt Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

mattlord Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattlord Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattlord left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvaillancourt commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`vttablet`: add proxy support to `FullStatus` RPC in `tabletmanager` #19058

`vttablet`: add proxy support to `FullStatus` RPC in `tabletmanager` #19058

timvaillancourt commented Dec 19, 2025 •

edited

Loading

codecov bot commented Jan 9, 2026 •

edited

Loading

mattlord Jan 10, 2026 •

edited

Loading

mattlord Jan 10, 2026 •

edited

Loading

mattlord left a comment •

edited

Loading

timvaillancourt commented Jan 29, 2026 •

edited

Loading