-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Overview of the Issue
If a shard with GTID-based tablets also contains one or more file-based tablets, EmergencyReparentShard will fail with: "encountered mix of GTID-based and non GTID-based relay logs", leaving unplanned failures of the PRIMARY un-actioned. This can lead to an indefinite outage a human (or clever external automation) must manually respond to 👎
Desired outcome: if we are certain the file-based replica is NOT a semi-sync acker (shouldn't be, but theoretically possible), we should prioritize promoting a GTID-based replica vs. fail outright.
As a consequence of this proposed change, in some probably-rare scenarios this file-based replica may have an errant transaction post-reparent, as it's file-based tablets are not considered as a candidate. This errant is likely to occur regardless if a human responded to fix this scenario, unless the user DID intend to promote the file-based replica - but that's not a good idea
cc @dbussink seeing we recently discussed this
Reproduction Steps
- Create a typical GTID-based shard
- Add a file-based replica tablet
- Make the
PRIMARYunavailable somehow - Observe that
EmergencyReparentShard(manual or from VTOrc) fails on:"encountered mix of GTID-based and non GTID-based relay logs"
Binary Version
main/v23+Operating System and Environment details
Not applicableLog Fragments
`"encountered mix of GTID-based and non GTID-based relay logs"`