-
Notifications
You must be signed in to change notification settings - Fork 2.3k
vttablet: add proxy support to FullStatus RPC in tabletmanager
#19058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
vttablet: add proxy support to FullStatus RPC in tabletmanager
#19058
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Tim Vaillancourt <[email protected]>
6fb2987 to
6da5c3e
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #19058 +/- ##
==========================================
- Coverage 69.89% 69.88% -0.02%
==========================================
Files 1612 1613 +1
Lines 215826 216058 +232
==========================================
+ Hits 150857 150997 +140
- Misses 64969 65061 +92 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
|
|
||
| func (s *server) proxyFullStatus(ctx context.Context, request *tabletmanagerdatapb.FullStatusRequest) (*replicationdatapb.FullStatus, error) { | ||
| if s.tmc == nil { | ||
| return nil, vterrors.New(vtrpcpb.Code_FAILED_PRECONDITION, "no proxy tabletmanger client") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init() func of this package creates the *server with a tmclient so this situation is kind of impossible
| // proxy_timeout_ms specifies the maximum number of milliseconds to wait for a | ||
| // proxied request to complete. Must be less than topo.RemoteOperationTimeout | ||
| // on the tablet proxying the request. | ||
| uint64 proxy_timeout_ms = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this not be a vttime.Duration?
| // disallow timeouts larger than the local remote operation timeout | ||
| if request.ProxyTimeoutMs > uint64(topo.RemoteOperationTimeout.Milliseconds()) { | ||
| return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "cannot set a proxy timeout ms greater than %d", topo.RemoteOperationTimeout.Milliseconds()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of the value of allowing the caller to specify a timeout rather than using ~ the topo.RemoteOperationTimeout / 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems OK to me, but in general I do not like intentionally adding dead/unused code. We are supposed to remove dead/unused code. 🙂
I would like to see you use this new work, in vtorc, in the same PR that you create the RPC changes. This does the following:
- Avoids new dead code
- Ensures that the underlying building blocks (RPC work in this case) actually work as needed in order to meet the larger feature/product goal. Without that we should have no confidence that this is what we need, and having to refactor it later will make things harder. Or maybe we don't even end up using it for a variety of reasons (see point 1).
Can you please implement the actual feature that you wish to add to Vitess in this PR along with the lower level RPC changes and anything else that is needed in order to implement the feature request? Then we can understand the problem, the reasons why this is an optimal solution, and confirm/demonstrate that it actually works as desired and solves the problem.
Thanks!
P.S. I think the general issue that we are trying to address here (which I have not seen laid out so I'm not sure) would typically be resolved with something like a gossip protocol. The caller having to decide if and when to proxy a request, and where to proxy it to, doesn't immediately feel to me like the best solution. But then again, I also don't see where we've clearly described the problem we want to solve and why we think this is the best solution. 🙂
|
@mattlord I think the issue summary breaks down what is trying to be solved here as it's not very complex, or at least I would struggle to explain it in more detail. The TL;DR is VTOrc is located in a single location and cannot be certain that a connection failure means something is down or unreachable. The driver for this problem is explained in a planetscale-internal issue I cannot link. I will copy this into an issue if that helps review the change
Sure, I've prefer to not create mega/monolith PRs and have also seen others take this approach to implementing a feature, so I thought this was an acceptable approach. I can give this a shot in one huge PR 👍
Yes, for quite sometime I've been planning to implement a gossip protocol within shards, to solve many problems at once (not just VTOrc). This wasn't a popular idea when it was originally proposed, but I'm glad to hear it come up from another maintainer I have started this in a few POC branches, some I'm struggling to find now (on a 🚋 to FOSDEM!). At least in a transition phase to that model, the complexity of what is being solved here is no different to me: if we cannot reach a tablet we need to ask another if it's really dead. We can do that using a proxied call or asking the gossip state of another tablet, but either way we need to ask "someone else". We can move to relying purely on the gossip protocol and ask only a single tablet from each shard the state of the world, but that would be a massive change I think is not realistic to do in a big bang approach The reason I approached this issue as a simple proxy request, for now, is it felt like a simpler, iterative approach vs. tie the gossip system I want to implement to this problem. But I would be on-board with tackling it all at once. The benefit this PR's approach has over a gossip protocol however, is a proxied request should be more accurate. A gossip system is asynchronous, meaning if you ask "is X tablet alive?", the state answering that question could be as stale as the last poll/push/update (even by milliseconds). In the synchronous proxy approach this PR partially implements, you get the exact current state without any drift, which is currently how VTOrc makes decisions. I'm not saying the asynchronous nature of a gossip system is a dealbreaker, but it IS technically less accurate; it's state of other tablets is eventually consistent An RFC is on the way 👍 |
Description
Add support for proxying

FullStatusRPCs, in order for VTOrc to gain the ability to validate certain problems using many cells (locations) and detect network partitions - this will happen in future PRs (possibly v24), for now having this RPC in v24 will be helpfulSupport for this is intentionally not added to
vtctldclient GetFullStatus, because it's intended for internal VTOrc usage and I can't think of a good use case for someone at a CLI to need thisTo add some safety, a request can only be proxied once (the
ProxiedByflag is added by the proxying tmserver) and you cannot proxy to "yourself", which would cause an infinite loop. Finally, a proxy timeout > the remote operation timeout of the proxying server returns an error. e2e tests are added to confirm these safety nets and that the proxying succeedsRelated Issue(s)
Checklist
Deployment Notes
AI Disclosure