Skip to content

EmergencyReparentShard: require stop replication error to be from PRIMARY#19515

Open
timvaillancourt wants to merge 14 commits intovitessio:mainfrom
timvaillancourt:ers-explicit-tabletType-in-err
Open

EmergencyReparentShard: require stop replication error to be from PRIMARY#19515
timvaillancourt wants to merge 14 commits intovitessio:mainfrom
timvaillancourt:ers-explicit-tabletType-in-err

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Feb 27, 2026

Description

In the stopReplicationAndBuildStatusMaps helper-func of EmergencyReparentShard, a single tablet failing to run the StopReplicationAndGetStatus is permitted

Logically and in code-comments the code expects this single-failure to be the PRIMARY, because ERS is currently designed to recover a failed-primary scenario. But it never actually validates that the single-failure is the PRIMARY, it checks for # errors - 1

// In general we want to wait for n-1 tablets to respond, since we know the primary tablet is down.

This PR updates the error checking in stopReplicationAndBuildStatusMaps to actually-check the assumption that the single error is the PRIMARY. There are plans for ERS to support more failure cases in this area, but they'll be tackled in other PRs I already have underway

I can't think of a scenario (Claude either) where ERS would be called on a shard with a healthy tablet, but a single broken replica, but this hole in the logic may have allowed that scenario to cause a reparent that may ignore a tablet that may be most-advanced. That's a pretty specific case, but the rest of the ERS code wouldn't let this sort of thing slide.. we always err on the side of being certain we have the most advanced tablet

Related Issue(s)

Resolves #19521

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@github-actions github-actions bot added this to the v24.0.0 milestone Feb 27, 2026
@vitess-bot vitess-bot bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 27, 2026
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 27, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: vtctl and removed NeedsWebsiteDocsUpdate What it says NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 27, 2026
@timvaillancourt timvaillancourt self-assigned this Feb 27, 2026
@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.98%. Comparing base (70c7a72) to head (8395231).
⚠️ Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vtctl/reparentutil/replication.go 83.33% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19515       +/-   ##
===========================================
+ Coverage   69.67%   90.98%   +21.30%     
===========================================
  Files        1614        9     -1605     
  Lines      216793     1253   -215540     
===========================================
- Hits       151044     1140   -149904     
+ Misses      65749      113    -65636     
Flag Coverage Δ
partial 90.98% <84.21%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates EmergencyReparentShard’s stop-replication phase to only tolerate a single stop-replication error when it originates from the shard PRIMARY, tightening safety around partial failures during ERS.

Changes:

  • Wrap stop-replication errors with tablet metadata and only ignore a lone error if it came from a PRIMARY.
  • Adjust ERS-related tests to reflect the stricter behavior (including marking certain scenarios as failures).
  • Fix up test fixtures to correctly mark the failed previous primary as TabletType_PRIMARY.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
go/vt/vtctl/reparentutil/replication.go Adds tablet-associated error wrapping and enforces “only PRIMARY may be the single tolerated failure” logic.
go/vt/vtctl/reparentutil/replication_test.go Updates test case names/fixtures and expectations for the new PRIMARY-only tolerance.
go/vt/vtctl/reparentutil/emergency_reparenter_test.go Ensures “failed previous primary” tablets are explicitly typed as PRIMARY in fixtures.
go/vt/vtctl/grpcvtctldserver/server_test.go Updates ERS RPC test expectation to error under the new stricter stop-replication rules.
go/vt/vtctl/grpcvtctldserver/server_slow_test.go Updates slow ERS RPC test expectation to error under the new stricter stop-replication rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timvaillancourt timvaillancourt added Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Feb 27, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Uid: 101,
},
}},
shouldErr: false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this always should have errored

@timvaillancourt timvaillancourt removed the NeedsIssue A linked issue is missing for this Pull Request label Feb 27, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +375 to +386
// If there are recorded errors, confirm there is a single error from the PRIMARY,
// as ERS currently only supports the PRIMARY tablet being down. This logic can be
// extended when more partial-failure cases are supportable.
if primaryAlias != nil && len(errRecorder.Errors) == 1 {
var tabletErr *tabletAliasError
if errors.As(errRecorder.Errors[0], &tabletErr) {
// Failure to reach the PRIMARY tablet is expected, return early.
if topoproto.TabletAliasEqual(primaryAlias, tabletErr.GetAlias()) {
return res, nil
}
}
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new early-return path treats any single error from primaryAlias as acceptable and skips the haveRevoked safety check. This will also return success for cases where the PRIMARY is reachable but we failed to demote/stop it (e.g. StopReplicationAndGetStatus returns ERNotReplica and DemotePrimary fails), which can leave a writable primary-eligible tablet unrevoked. Consider narrowing the early-return to errors that indicate the PRIMARY is actually unreachable (timeouts/transport errors), or otherwise ensure we still fail (or verify revocation) when the PRIMARY error is from a failed demote/stop operation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@timvaillancourt timvaillancourt Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point, but really only these cases are possible:

  1. The RPC is processed by VTTablet and returns ERNotReplica
  2. The RPC fails from the client-side - could be due to a number of failure scenarios

In all the scenarios I can think of, we won't respond differently. All we care about is the PRIMARY returned nil, ERNotReplica or a long list of client-side errors, but the details aren't important - the goal is to get rid of the PRIMARY that returned the error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 Component: vtctl Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: stopReplicationAndBuildStatusMaps in ERS has weak PRIMARY check

2 participants