Skip to content

Commit 2823adf

Browse files
authored
Make sure CHECKPOINT is execute after promote (patroni#3368)
It was possible that `Rewind._checkpoint_task` wasn't reset on demote if CHECKPOINT wasn't yet finished, what resulted in using stale `result` when the next promote is triggered. It is not easy to reproduce, but steps are the following: 1. failover to node1. 2. while CHECKPOINT after promote is still running, switchover to node2. 3. next failover/switchover to node1 results in missing CHECKPOINT. To solve the problem we take following measures: 1. call `Rewind.reset_state()` before promote. 2. reset `Rewind._checkpoint_task` from trigger_check_diverged_lsn(). Besides that, we didn't check that CHECKPOINT during 2 actually finished successfully. If check implemented correctly chances to hit the problem would have been much smaller. However, there was still race condition, if switchover was triggered right after CHECKPOINT task finished. Close patroni#3367
1 parent 30915f3 commit 2823adf

File tree

2 files changed

+12
-3
lines changed

2 files changed

+12
-3
lines changed

patroni/ha.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1121,6 +1121,7 @@ def enforce_primary_role(self, message: str, promote_message: str) -> str:
11211121
self._failsafe.set_is_active(0)
11221122

11231123
def before_promote():
1124+
self._rewind.reset_state() # make sure we will trigger checkpoint after promote
11241125
self.notify_mpp_coordinator('before_promote')
11251126

11261127
with self._async_response:

patroni/postgresql/rewind.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,8 @@ def can_rewind_or_reinitialize_allowed(self) -> bool:
7373
def trigger_check_diverged_lsn(self) -> None:
7474
if self.can_rewind_or_reinitialize_allowed and self._state != REWIND_STATUS.NEED:
7575
self._state = REWIND_STATUS.CHECK
76+
with self._checkpoint_task_lock:
77+
self._checkpoint_task = None
7678

7779
@staticmethod
7880
def check_leader_is_not_in_recovery(conn_kwargs: Dict[str, Any]) -> Optional[bool]:
@@ -306,10 +308,16 @@ def ensure_checkpoint_after_promote(self, wakeup: Callable[..., Any]) -> None:
306308
if self._state != REWIND_STATUS.CHECKPOINT and self._postgresql.is_primary():
307309
with self._checkpoint_task_lock:
308310
if self._checkpoint_task:
311+
result = None
312+
309313
with self._checkpoint_task:
310-
if self._checkpoint_task.result is not None:
311-
self._state = REWIND_STATUS.CHECKPOINT
312-
self._checkpoint_task = None
314+
result = self._checkpoint_task.result
315+
316+
if result is True:
317+
self._state = REWIND_STATUS.CHECKPOINT
318+
319+
if result is not None:
320+
self._checkpoint_task = None
313321
elif self._postgresql.get_primary_timeline() == self._postgresql.pg_control_timeline():
314322
self._state = REWIND_STATUS.CHECKPOINT
315323
else:

0 commit comments

Comments
 (0)