Make asynchronous replica re-initialization reliable #8324

dyemanov · 2024-11-21T15:45:42Z

Currently when a physical backup is performed, journal segment is switched from N to N+1 at the backup start so that backup file is ensured to contain only data up to sequence N (including it). However, some long-running writeable transaction could already have some its changes stored in segments <= N while a commit event will be stored in some later segment. After re-initialization at the replica side, we continue with segment N+1 and (a) have older changes lost and (b) error "Transaction X is not found" usually happens. It means that the replica is inconsistent and must be re-initialized again. But if the primary is under high load, this may happen over and over.

The solution is to not delete segments <= N immediately, but instead scan them to find the active transactions at the end of N, calculate the new replication OAT, delete everything < OAT and replay the journal (active transactions only) starting with OAT, then proceed normally with N+1 and beyond.

dyemanov · 2024-11-21T15:47:02Z

It appears something went wrong with the diff, sorry. Will fix ASAP.

dyemanov · 2024-11-21T15:48:08Z

Wrong branch was initially selected, the patch is against v5 but can be (should be, I'd say) back- and front-ported.

pavel-zotov · 2024-12-10T15:02:49Z

::: QA NOTE :::
Implemented within the group of tests related to replication, see:
functional/replication/test_make_async_reinit_reliable.py

Make automatic online re-initialization reliable

f72afcd

dyemanov self-assigned this Nov 21, 2024

dyemanov changed the base branch from master to v5.0-release November 21, 2024 15:47

dyemanov merged commit 921a5eb into v5.0-release Nov 25, 2024
22 of 25 checks passed

dyemanov deleted the reliable-replica-reinit branch December 10, 2024 06:58

dyemanov added a commit that referenced this pull request Dec 10, 2024

Make automatic online re-initialization reliable (#8324)

a8c5b9f

dyemanov added a commit that referenced this pull request Dec 10, 2024

Make automatic online re-initialization reliable (#8324)

cc44002

dyemanov added fix-version: 6.0 Alpha 1 fix-version: 4.0.6 labels Dec 10, 2024

pavel-zotov added the qa: done with caveats label Dec 10, 2024

pavel-zotov added qa: done successfully and removed qa: done with caveats labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make asynchronous replica re-initialization reliable #8324

Make asynchronous replica re-initialization reliable #8324

Uh oh!

dyemanov commented Nov 21, 2024

Uh oh!

dyemanov commented Nov 21, 2024

Uh oh!

dyemanov commented Nov 21, 2024

Uh oh!

Uh oh!

pavel-zotov commented Dec 10, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Make asynchronous replica re-initialization reliable #8324

Make asynchronous replica re-initialization reliable #8324

Uh oh!

Conversation

dyemanov commented Nov 21, 2024

Uh oh!

dyemanov commented Nov 21, 2024

Uh oh!

dyemanov commented Nov 21, 2024

Uh oh!

Uh oh!

pavel-zotov commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pavel-zotov commented Dec 10, 2024 •

edited

Loading