cybertec-postgresql
diff --git a/‎.github/workflows/release.yaml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/release.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/workflows/tests.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/tests.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/patroni_configuration.rst‎
Lines changed: 11 additions & 11 deletions b/‎docs/patroni_configuration.rst‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎docs/releases.rst‎
Lines changed: 60 additions & 0 deletions b/‎docs/releases.rst‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎docs/rest_api.rst‎
Lines changed: 4 additions & 2 deletions b/‎docs/rest_api.rst‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎patroni/api.py‎
Lines changed: 56 additions & 10 deletions b/‎patroni/api.py‎
Lines changed: 56 additions & 10 deletions
diff --git a/‎patroni/dcs/etcd.py‎
Lines changed: 34 additions & 22 deletions b/‎patroni/dcs/etcd.py‎
Lines changed: 34 additions & 22 deletions
diff --git a/‎patroni/dcs/etcd3.py‎
Lines changed: 18 additions & 4 deletions b/‎patroni/dcs/etcd3.py‎
Lines changed: 18 additions & 4 deletions
@@ -34,10 +34,10 @@ jobs:
 
     - name: Publish distribution to Test PyPI
       if: github.event_name == 'push'
-      uses: pypa/gh-action-pypi-publish@v1.9.0
+      uses: pypa/gh-action-pypi-publish@v1.12.4
       with:
         repository_url: https://test.pypi.org/legacy/
 
     - name: Publish distribution to PyPI
       if: github.event_name == 'release'
-      uses: pypa/gh-action-pypi-publish@v1.9.0
+      uses: pypa/gh-action-pypi-publish@v1.12.4
@@ -198,7 +198,7 @@ jobs:
 
     - uses: jakebailey/pyright-action@v2
       with:
-        version: 1.1.394
+        version: 1.1.401
 
   ydiff:
     name: Test compatibility with the latest version of ydiff
 
@@ -38,21 +38,21 @@ Important rules
 PostgreSQL parameters controlled by Patroni
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Some of the PostgreSQL parameters **must hold the same values on the primary and the replicas**. For those, **values set either in the local patroni configuration files or via the environment variables take no effect**. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default values:
+Some of the PostgreSQL parameters **must hold the same values on the primary and the replicas**. For those, **values set either in the local patroni configuration files or via the environment variables take no effect**. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default and minimal values:
 
-- **max_connections**: 100
-- **max_locks_per_transaction**: 64
-- **max_worker_processes**: 8
-- **max_prepared_transactions**: 0
-- **wal_level**: hot_standby
-- **track_commit_timestamp**: off
+- **max_connections**: default value 100, minimal value 25
+- **max_locks_per_transaction**: default value 64, minimal value 32
+- **max_worker_processes**: default value 8, minimal value 2
+- **max_prepared_transactions**: default value 0, minimal value 0
+- **wal_level**: default value hot_standby, accepted values: hot_standby, replica, logical
+- **track_commit_timestamp**: default value off 
 
 For the parameters below, PostgreSQL does not require equal values among the primary and all the replicas. However, considering the possibility of a replica to become the primary at any time, it doesn't really make sense to set them differently; therefore, **Patroni restricts setting their values to the** :ref:`dynamic configuration <dynamic_configuration>`.
 
-- **max_wal_senders**: 10
-- **max_replication_slots**: 10
-- **wal_keep_segments**: 8
-- **wal_keep_size**: 128MB
+- **max_wal_senders**: default value 10, minimal value 3
+- **max_replication_slots**: default value 10, minimal value 4
+- **wal_keep_segments**: default value 8, minimal value 1
+- **wal_keep_size**: default value 128MB, minimal value 16MB
 - **wal_log_hints**: on
 
 These parameters are validated to ensure they are sane, or meet a minimum value.
 
@@ -3,6 +3,66 @@
 Release notes
 =============
 
+Version 4.0.6
+-------------
+
+Released 2025-06-06
+
+**Bugfixes**
+
+- Fix bug in failover from a leader with a higher priority (Alexander Kukushkin)
+
+  Make sure Patroni ignores the former leader with higher priority when it reports the same ``LSN`` as the current node.
+
+- Fix permissions for the ``postgresql.conf`` file created outside of ``PGDATA`` (Michael Banck)
+
+  Respect the system-wide umask value when creating the ``postgresql.conf`` file outside of the ``PGDATA`` directory.
+
+- Fix bug with switchover in ``synchronous_mode=quorum`` (Alexander Kukushkin)
+
+  Do not check quorum requirements when a candidate is specified.
+
+- Ignore stale Etcd nodes by comparing cluster term (Alexander Kukushkin)
+
+  Memorize the last known "raft_term" of the Etcd cluster, and when executing client requests, compare it with the "raft_term" reported by an Etcd node.
+
+- Update PostgreSQL configuration files on ``SIGHUP`` (Alexander Kukushkin)
+
+  Previously, Patroni was only replacing PostgreSQL configuration files if a change in global or local configuration was detected.
+
+- Properly handle ``Unavailable`` exception raised by ``etcd3`` (Alexander Kukushkin)
+
+  Patroni used to retry such requests on the same ``etcd3`` node, while switching to another node is a better strategy.
+
+- Improve ``etcd3`` lease handling (Alexander Kukushkin)
+
+  Make sure Patroni refreshes the ``etcd3`` lease at least once per HA loop.
+
+- Recheck annotations on 409 status code when attempting to acquire leader lock (Alexander Kukushkin)
+
+  Implement the same behavior as was done for the leader object read in Patroni version 4.0.3.
+
+- Consider ``replay_lsn`` when advancing slots (Polina Bungina)
+
+  Do not try to advance slots on replicas past the ``replay_lsn``. Additionally, advance the slot to the ``replay_lsn`` position if it is already past the ``confirmed_flush_lsn`` of this slot on the replica but the replica has still not replayed the actual ``LSN`` at which this slot is on the primary.
+
+- Make sure ``CHECKPOINT`` is executed after promote (Alexander Kukushkin)
+
+  It was possible that checkpoint task wasn't reset on demote because ``CHECKPOINT`` wasn't yet finished. This resulted in using a stale ``result`` when the next promote is triggered.
+
+- Avoid running "offline" demotion concurrently (Alexander Kukushkin)
+
+  In case of a slow shutdown, it might happen that the next heartbeat loop hits the DCS error handling method again, resulting in ``AsyncExecutor is busy, demoting from the main thread`` warning and starting offline demotion again.
+
+- Normalize the ``data_dir`` value before renaming the data directory on initialization failure (Waynerv)
+
+  Prevent a trailing slash in the ``data_dir`` parameter value from breaking the renaming process after an initialization failure.
+
+- Check that ``synchronous_standby_names`` contains the expected value (Alexander Kukushkin)
+
+  Previously, the mechanism implementing the state machine for non-quorum synchronous replication didn't check the actual value of ``synchronous_standby_names``, what resulted in a stale value of ``synchronous_standby_names`` being used when ``pg_stat_replication`` is a subset of ``synchronous_standby_names``.
+
+
 Version 4.0.5
 -------------
 
 
@@ -63,9 +63,11 @@ For all health check ``GET`` requests Patroni returns a JSON document with the s
 
 - ``GET /liveness``: returns HTTP status code **200** if Patroni heartbeat loop is properly running and **503** if the last run was more than ``ttl`` seconds ago on the primary or ``2*ttl`` on the replica. Could be used for ``livenessProbe``.
 
-- ``GET /readiness``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
+- ``GET /readiness?lag=<max-lag>&mode=apply|write``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up, replicating and not too far behind the leader. The lag parameter sets how far a standby is allowed to be behind, it defaults to ``maximum_lag_on_failover``. Lag can be specified in bytes or in human readable values, for e.g. 16kB, 64MB, 1GB. Mode sets whether the WAL needs to be replayed (apply) or just received (write). The default is apply.
 
-Both, ``readiness`` and ``liveness`` endpoints are very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
+  When used as Kubernetes ``readinessProbe`` it will make sure freshly started pods only become ready when they have caught up to the leader. This combined with a PodDisruptionBudget will protect against leader being terminated too early during a rolling restart of nodes. It will also make sure that replicas that cannot keep up with replication do not service read-only traffic. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
+
+The ``liveness`` endpoint is very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
 
 .. code-block:: yaml
 
 
@@ -448,27 +448,73 @@ def do_GET_liveness(self) -> None:
         status_code = 200 if patroni.ha.is_paused() or patroni.next_run + liveness_threshold > time.time() else 503
         self._write_status_code_only(status_code)
 
+    def _readiness(self) -> Optional[str]:
+        """Check if readiness conditions are met.
+
+        :returns: None if node can be considered ready or diagnostic message if not."""
+
+        patroni = self.server.patroni
+        if patroni.ha.is_leader():
+            # We only become leader after bootstrap or once up as a standby, so we are definitely ready.
+            return
+
+        # When postgres is not running we are not ready.
+        if patroni.postgresql.state != PostgresqlState.RUNNING:
+            return 'PostgreSQL is not running'
+
+        postgres = self.get_postgresql_status(True)
+        latest_end_lsn = postgres.get('latest_end_lsn', 0)
+
+        if postgres.get('replication_state') != 'streaming':
+            return 'PostgreSQL replication state is not streaming'
+
+        cluster = patroni.dcs.cluster
+
+        if not cluster and not latest_end_lsn:
+            if patroni.ha.failsafe_is_active():
+                return
+            return 'DCS is not accessible'
+
+        leader_optime = max(cluster and cluster.status.last_lsn or 0, latest_end_lsn)
+
+        mode = 'write' if self.path_query.get('mode', [None])[0] == 'write' else 'apply'
+        location = 'received_location' if mode == 'write' else 'replayed_location'
+        lag = leader_optime - postgres.get('xlog', {}).get(location, 0)
+
+        max_replica_lag = parse_int(self.path_query.get('lag', [None])[0], 'B')
+        if max_replica_lag is None:
+            max_replica_lag = global_config.maximum_lag_on_failover
+
+        if lag > max_replica_lag:
+            return f'Replication {mode} lag {lag} exceeds maximum allowable {max_replica_lag}'
+
     def do_GET_readiness(self) -> None:
         """Handle a ``GET`` request to ``/readiness`` path.
 
+            * Query parameters:
+
+                * ``lag``: only accept replication lag up to ``lag``. Accepts either an :class:`int`, which
+                    represents lag in bytes, or a :class:`str` representing lag in human-readable format (e.g.
+                    ``10MB``).
+                * ``mode``: allowed values ``write``, ``apply``. Base replication lag off of received WAL or
+                    replayed WAL. Defaults to ``apply``.
+
         Write a simple HTTP response which HTTP status can be:
 
             * ``200``:
 
-                * If this Patroni node holds the DCS leader lock; or
-                * If this PostgreSQL instance is up and running;
+                * If this Patroni node considers itself the leader; or
+                * If PostgreSQL is running, replicating and not lagging;
 
             * ``503``: if none of the previous conditions apply.
 
         """
-        patroni = self.server.patroni
-        if patroni.ha.is_leader():
-            status_code = 200
-        elif patroni.postgresql.state == PostgresqlState.RUNNING:
-            status_code = 200 if patroni.dcs.cluster else 503
-        else:
-            status_code = 503
-        self._write_status_code_only(status_code)
+        failure_reason = self._readiness()
+
+        if failure_reason:
+            logger.debug("Readiness check failure: %s", failure_reason)
+
+        self._write_status_code_only(200 if not failure_reason else 503)
 
     def do_GET_patroni(self) -> None:
         """Handle a ``GET`` request to ``/patroni`` path.
 
@@ -101,40 +101,28 @@ def _do_resolve(host: str, port: int) -> List[_AddrInfo]:
             return []
 
 
-class AbstractEtcdClientWithFailover(abc.ABC, etcd.Client):
+class StaleEtcdNodeGuard(object):
 
-    ERROR_CLS: Type[Exception]
+    def __init__(self) -> None:
+        self._reset_cluster_raft_term()
 
-    def __init__(self, config: Dict[str, Any], dns_resolver: DnsCachingResolver, cache_ttl: int = 300) -> None:
+    def _reset_cluster_raft_term(self) -> None:
         self._cluster_id = None
         self._raft_term = 0
-        self._dns_resolver = dns_resolver
-        self.set_machines_cache_ttl(cache_ttl)
-        self._machines_cache_updated = 0
-        kwargs = {p: config.get(p) for p in ('host', 'port', 'protocol', 'use_proxies', 'version_prefix',
-                                             'username', 'password', 'cert', 'ca_cert') if config.get(p)}
-        super(AbstractEtcdClientWithFailover, self).__init__(read_timeout=config['retry_timeout'], **kwargs)
-        # For some reason python3-etcd on debian and ubuntu are not based on the latest version
-        # Workaround for the case when https://github.com/jplana/python-etcd/pull/196 is not applied
-        self.http.connection_pool_kw.pop('ssl_version', None)
-        self._config = config
-        self._load_machines_cache()
-        self._allow_reconnect = True
-        # allow passing retry argument to api_execute in params
-        self._comparison_conditions.add('retry')
-        self._read_options.add('retry')
-        self._del_conditions.add('retry')
 
     def _check_cluster_raft_term(self, cluster_id: Optional[str], value: Union[None, str, int]) -> None:
         """Check that observed Raft Term in Etcd cluster is increasing.
 
-        If we observe that the new value is smaller than the previously known one, it could be an
-        indicator that we connected to a stale node and should switch to some other node.
-        However, we need to reset the memorized value when we notice that Cluster ID changed.
+        :param cluster_id: last observed Etcd Cluster ID
+        :param raft_term: last observed Raft Term
+
+        :raises:
+            :exc::`StaleEtcdNode` if last observed *raft_term* is smaller than previously known *raft_term*.
         """
         if not (cluster_id and value):
             return
 
+        # We need to reset the memorized value when we notice that Cluster ID changed.
         if self._cluster_id and self._cluster_id != cluster_id:
             logger.warning('Etcd Cluster ID changed from %s to %s', self._cluster_id, cluster_id)
             self._raft_term = 0
@@ -151,6 +139,30 @@ def _check_cluster_raft_term(self, cluster_id: Optional[str], value: Union[None,
             raise StaleEtcdNode
         self._raft_term = raft_term
 
+
+class AbstractEtcdClientWithFailover(abc.ABC, etcd.Client, StaleEtcdNodeGuard):
+
+    ERROR_CLS: Type[Exception]
+
+    def __init__(self, config: Dict[str, Any], dns_resolver: DnsCachingResolver, cache_ttl: int = 300) -> None:
+        StaleEtcdNodeGuard.__init__(self)
+        self._dns_resolver = dns_resolver
+        self.set_machines_cache_ttl(cache_ttl)
+        self._machines_cache_updated = 0
+        kwargs = {p: config.get(p) for p in ('host', 'port', 'protocol', 'use_proxies', 'version_prefix',
+                                             'username', 'password', 'cert', 'ca_cert') if config.get(p)}
+        super(AbstractEtcdClientWithFailover, self).__init__(read_timeout=config['retry_timeout'], **kwargs)
+        # For some reason python3-etcd on debian and ubuntu are not based on the latest version
+        # Workaround for the case when https://github.com/jplana/python-etcd/pull/196 is not applied
+        self.http.connection_pool_kw.pop('ssl_version', None)
+        self._config = config
+        self._load_machines_cache()
+        self._allow_reconnect = True
+        # allow passing retry argument to api_execute in params
+        self._comparison_conditions.add('retry')
+        self._read_options.add('retry')
+        self._del_conditions.add('retry')
+
     def _calculate_timeouts(self, etcd_nodes: int, timeout: Optional[float] = None) -> Tuple[int, float, int]:
         """Calculate a request timeout and number of retries per single etcd node.
         In case if the timeout per node is too small (less than one second) we will reduce the number of nodes.
 
@@ -24,7 +24,8 @@
 from ..utils import deep_compare, enable_keepalive, iter_response_objects, RetryFailedError, USER_AGENT
 from . import catch_return_false_exception, Cluster, ClusterConfig, \
     Failover, Leader, Member, Status, SyncState, TimelineHistory
-from .etcd import AbstractEtcd, AbstractEtcdClientWithFailover, catch_etcd_errors, DnsCachingResolver, Retry
+from .etcd import AbstractEtcd, AbstractEtcdClientWithFailover, catch_etcd_errors, \
+    DnsCachingResolver, Retry, StaleEtcdNode, StaleEtcdNodeGuard
 
 logger = logging.getLogger(__name__)
 
@@ -432,10 +433,11 @@ def watchprefix(self, key: str, start_revision: Optional[str] = None,
         return self.watchrange(key, prefix_range_end(key), start_revision, filters, read_timeout)
 
 
-class KVCache(Thread):
+class KVCache(StaleEtcdNodeGuard, Thread):
 
     def __init__(self, dcs: 'Etcd3', client: 'PatroniEtcd3Client') -> None:
-        super(KVCache, self).__init__()
+        Thread.__init__(self)
+        StaleEtcdNodeGuard.__init__(self)
         self.daemon = True
         self._dcs = dcs
         self._client = client
@@ -505,7 +507,10 @@ def _process_message(self, message: Dict[str, Any]) -> None:
         logger.debug('Received message: %s', message)
         if 'error' in message:
             raise _raise_for_data(message)
-        events: List[Dict[str, Any]] = message.get('result', {}).get('events', [])
+        result = message.get('result', EMPTY_DICT)
+        header = result.get('header', EMPTY_DICT)
+        self._check_cluster_raft_term(header.get('cluster_id'), header.get('raft_term'))
+        events: List[Dict[str, Any]] = result.get('events', [])
         for event in events:
             self._process_event(event)
 
@@ -539,8 +544,11 @@ def _do_watch(self, revision: str) -> None:
 
     def _build_cache(self) -> None:
         result = self._dcs.retry(self._client.prefix, self._dcs.cluster_prefix)
+        header = result.get('header', EMPTY_DICT)
         with self._object_cache_lock:
+            self._reset_cluster_raft_term()
             self._object_cache = {node['key']: node for node in result.get('kvs', [])}
+            self._check_cluster_raft_term(header.get('cluster_id'), header.get('raft_term'))
         with self.condition:
             self._is_ready = True
             self.condition.notify()
@@ -586,6 +594,12 @@ def kill_stream(self) -> None:
 
     def is_ready(self) -> bool:
         """Must be called only when holding the lock on `condition`"""
+        if self._is_ready:
+            try:
+                self._client._check_cluster_raft_term(self._cluster_id, self._raft_term)
+            except StaleEtcdNode:
+                self._is_ready = False
+                self.kill_stream()
         return self._is_ready