EmergencyReparentShard: require stop replication error to be from PRIMARY#19515
EmergencyReparentShard: require stop replication error to be from PRIMARY#19515timvaillancourt wants to merge 14 commits intovitessio:mainfrom
EmergencyReparentShard: require stop replication error to be from PRIMARY#19515Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19515 +/- ##
===========================================
+ Coverage 69.67% 90.98% +21.30%
===========================================
Files 1614 9 -1605
Lines 216793 1253 -215540
===========================================
- Hits 151044 1140 -149904
+ Misses 65749 113 -65636
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Updates EmergencyReparentShard’s stop-replication phase to only tolerate a single stop-replication error when it originates from the shard PRIMARY, tightening safety around partial failures during ERS.
Changes:
- Wrap stop-replication errors with tablet metadata and only ignore a lone error if it came from a
PRIMARY. - Adjust ERS-related tests to reflect the stricter behavior (including marking certain scenarios as failures).
- Fix up test fixtures to correctly mark the failed previous primary as
TabletType_PRIMARY.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| go/vt/vtctl/reparentutil/replication.go | Adds tablet-associated error wrapping and enforces “only PRIMARY may be the single tolerated failure” logic. |
| go/vt/vtctl/reparentutil/replication_test.go | Updates test case names/fixtures and expectations for the new PRIMARY-only tolerance. |
| go/vt/vtctl/reparentutil/emergency_reparenter_test.go | Ensures “failed previous primary” tablets are explicitly typed as PRIMARY in fixtures. |
| go/vt/vtctl/grpcvtctldserver/server_test.go | Updates ERS RPC test expectation to error under the new stricter stop-replication rules. |
| go/vt/vtctl/grpcvtctldserver/server_slow_test.go | Updates slow ERS RPC test expectation to error under the new stricter stop-replication rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
| Uid: 101, | ||
| }, | ||
| }}, | ||
| shouldErr: false, |
There was a problem hiding this comment.
Technically this always should have errored
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // If there are recorded errors, confirm there is a single error from the PRIMARY, | ||
| // as ERS currently only supports the PRIMARY tablet being down. This logic can be | ||
| // extended when more partial-failure cases are supportable. | ||
| if primaryAlias != nil && len(errRecorder.Errors) == 1 { | ||
| var tabletErr *tabletAliasError | ||
| if errors.As(errRecorder.Errors[0], &tabletErr) { | ||
| // Failure to reach the PRIMARY tablet is expected, return early. | ||
| if topoproto.TabletAliasEqual(primaryAlias, tabletErr.GetAlias()) { | ||
| return res, nil | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The new early-return path treats any single error from primaryAlias as acceptable and skips the haveRevoked safety check. This will also return success for cases where the PRIMARY is reachable but we failed to demote/stop it (e.g. StopReplicationAndGetStatus returns ERNotReplica and DemotePrimary fails), which can leave a writable primary-eligible tablet unrevoked. Consider narrowing the early-return to errors that indicate the PRIMARY is actually unreachable (timeouts/transport errors), or otherwise ensure we still fail (or verify revocation) when the PRIMARY error is from a failed demote/stop operation.
There was a problem hiding this comment.
That's a valid point, but really only these cases are possible:
- The RPC is processed by VTTablet and returns
ERNotReplica - The RPC fails from the client-side - could be due to a number of failure scenarios
In all the scenarios I can think of, we won't respond differently. All we care about is the PRIMARY returned nil, ERNotReplica or a long list of client-side errors, but the details aren't important - the goal is to get rid of the PRIMARY that returned the error
Description
In the
stopReplicationAndBuildStatusMapshelper-funcofEmergencyReparentShard, a single tablet failing to run theStopReplicationAndGetStatusis permittedLogically and in code-comments the code expects this single-failure to be the
PRIMARY, because ERS is currently designed to recover a failed-primary scenario. But it never actually validates that the single-failure is thePRIMARY, it checks for# errors - 1This PR updates the error checking in
stopReplicationAndBuildStatusMapsto actually-check the assumption that the single error is thePRIMARY. There are plans for ERS to support more failure cases in this area, but they'll be tackled in other PRs I already have underwayI can't think of a scenario (Claude either) where ERS would be called on a shard with a healthy tablet, but a single broken replica, but this hole in the logic may have allowed that scenario to cause a reparent that may ignore a tablet that may be most-advanced. That's a pretty specific case, but the rest of the ERS code wouldn't let this sort of thing slide.. we always err on the side of being certain we have the most advanced tablet
Related Issue(s)
Resolves #19521
Checklist
Deployment Notes
AI Disclosure