Skip to content

Bug Report: EmergencyReparentShard fails in GTID shard with any file-based replicas #19261

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

If a shard with GTID-based tablets also contains one or more file-based tablets, EmergencyReparentShard will fail with: "encountered mix of GTID-based and non GTID-based relay logs", leaving unplanned failures of the PRIMARY un-actioned. This can lead to an indefinite outage a human (or clever external automation) must manually respond to 👎

Desired outcome: if we are certain the file-based replica is NOT a semi-sync acker (shouldn't be, but theoretically possible), we should prioritize promoting a GTID-based replica vs. fail outright.

As a consequence of this proposed change, in some probably-rare scenarios this file-based replica may have an errant transaction post-reparent, as it's file-based tablets are not considered as a candidate. This errant is likely to occur regardless if a human responded to fix this scenario, unless the user DID intend to promote the file-based replica - but that's not a good idea

cc @dbussink seeing we recently discussed this

Reproduction Steps

  1. Create a typical GTID-based shard
  2. Add a file-based replica tablet
  3. Make the PRIMARY unavailable somehow
  4. Observe that EmergencyReparentShard (manual or from VTOrc) fails on: "encountered mix of GTID-based and non GTID-based relay logs"

Binary Version

main/v23+

Operating System and Environment details

Not applicable

Log Fragments

`"encountered mix of GTID-based and non GTID-based relay logs"`

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions