Skip to content

Commit 7d21b2b

Browse files
committed
Merge remote-tracking branch 'zalando/master' into multisite
2 parents 917b086 + f2ebf7b commit 7d21b2b

24 files changed

+399
-139
lines changed

.github/workflows/release.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@ jobs:
3434

3535
- name: Publish distribution to Test PyPI
3636
if: github.event_name == 'push'
37-
uses: pypa/gh-action-pypi-publish@v1.9.0
37+
uses: pypa/gh-action-pypi-publish@v1.12.4
3838
with:
3939
repository_url: https://test.pypi.org/legacy/
4040

4141
- name: Publish distribution to PyPI
4242
if: github.event_name == 'release'
43-
uses: pypa/gh-action-pypi-publish@v1.9.0
43+
uses: pypa/gh-action-pypi-publish@v1.12.4

.github/workflows/tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ jobs:
198198

199199
- uses: jakebailey/pyright-action@v2
200200
with:
201-
version: 1.1.394
201+
version: 1.1.401
202202

203203
ydiff:
204204
name: Test compatibility with the latest version of ydiff

docs/patroni_configuration.rst

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -38,21 +38,21 @@ Important rules
3838
PostgreSQL parameters controlled by Patroni
3939
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4040

41-
Some of the PostgreSQL parameters **must hold the same values on the primary and the replicas**. For those, **values set either in the local patroni configuration files or via the environment variables take no effect**. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default values:
41+
Some of the PostgreSQL parameters **must hold the same values on the primary and the replicas**. For those, **values set either in the local patroni configuration files or via the environment variables take no effect**. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default and minimal values:
4242

43-
- **max_connections**: 100
44-
- **max_locks_per_transaction**: 64
45-
- **max_worker_processes**: 8
46-
- **max_prepared_transactions**: 0
47-
- **wal_level**: hot_standby
48-
- **track_commit_timestamp**: off
43+
- **max_connections**: default value 100, minimal value 25
44+
- **max_locks_per_transaction**: default value 64, minimal value 32
45+
- **max_worker_processes**: default value 8, minimal value 2
46+
- **max_prepared_transactions**: default value 0, minimal value 0
47+
- **wal_level**: default value hot_standby, accepted values: hot_standby, replica, logical
48+
- **track_commit_timestamp**: default value off
4949

5050
For the parameters below, PostgreSQL does not require equal values among the primary and all the replicas. However, considering the possibility of a replica to become the primary at any time, it doesn't really make sense to set them differently; therefore, **Patroni restricts setting their values to the** :ref:`dynamic configuration <dynamic_configuration>`.
5151

52-
- **max_wal_senders**: 10
53-
- **max_replication_slots**: 10
54-
- **wal_keep_segments**: 8
55-
- **wal_keep_size**: 128MB
52+
- **max_wal_senders**: default value 10, minimal value 3
53+
- **max_replication_slots**: default value 10, minimal value 4
54+
- **wal_keep_segments**: default value 8, minimal value 1
55+
- **wal_keep_size**: default value 128MB, minimal value 16MB
5656
- **wal_log_hints**: on
5757

5858
These parameters are validated to ensure they are sane, or meet a minimum value.

docs/releases.rst

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,66 @@
33
Release notes
44
=============
55

6+
Version 4.0.6
7+
-------------
8+
9+
Released 2025-06-06
10+
11+
**Bugfixes**
12+
13+
- Fix bug in failover from a leader with a higher priority (Alexander Kukushkin)
14+
15+
Make sure Patroni ignores the former leader with higher priority when it reports the same ``LSN`` as the current node.
16+
17+
- Fix permissions for the ``postgresql.conf`` file created outside of ``PGDATA`` (Michael Banck)
18+
19+
Respect the system-wide umask value when creating the ``postgresql.conf`` file outside of the ``PGDATA`` directory.
20+
21+
- Fix bug with switchover in ``synchronous_mode=quorum`` (Alexander Kukushkin)
22+
23+
Do not check quorum requirements when a candidate is specified.
24+
25+
- Ignore stale Etcd nodes by comparing cluster term (Alexander Kukushkin)
26+
27+
Memorize the last known "raft_term" of the Etcd cluster, and when executing client requests, compare it with the "raft_term" reported by an Etcd node.
28+
29+
- Update PostgreSQL configuration files on ``SIGHUP`` (Alexander Kukushkin)
30+
31+
Previously, Patroni was only replacing PostgreSQL configuration files if a change in global or local configuration was detected.
32+
33+
- Properly handle ``Unavailable`` exception raised by ``etcd3`` (Alexander Kukushkin)
34+
35+
Patroni used to retry such requests on the same ``etcd3`` node, while switching to another node is a better strategy.
36+
37+
- Improve ``etcd3`` lease handling (Alexander Kukushkin)
38+
39+
Make sure Patroni refreshes the ``etcd3`` lease at least once per HA loop.
40+
41+
- Recheck annotations on 409 status code when attempting to acquire leader lock (Alexander Kukushkin)
42+
43+
Implement the same behavior as was done for the leader object read in Patroni version 4.0.3.
44+
45+
- Consider ``replay_lsn`` when advancing slots (Polina Bungina)
46+
47+
Do not try to advance slots on replicas past the ``replay_lsn``. Additionally, advance the slot to the ``replay_lsn`` position if it is already past the ``confirmed_flush_lsn`` of this slot on the replica but the replica has still not replayed the actual ``LSN`` at which this slot is on the primary.
48+
49+
- Make sure ``CHECKPOINT`` is executed after promote (Alexander Kukushkin)
50+
51+
It was possible that checkpoint task wasn't reset on demote because ``CHECKPOINT`` wasn't yet finished. This resulted in using a stale ``result`` when the next promote is triggered.
52+
53+
- Avoid running "offline" demotion concurrently (Alexander Kukushkin)
54+
55+
In case of a slow shutdown, it might happen that the next heartbeat loop hits the DCS error handling method again, resulting in ``AsyncExecutor is busy, demoting from the main thread`` warning and starting offline demotion again.
56+
57+
- Normalize the ``data_dir`` value before renaming the data directory on initialization failure (Waynerv)
58+
59+
Prevent a trailing slash in the ``data_dir`` parameter value from breaking the renaming process after an initialization failure.
60+
61+
- Check that ``synchronous_standby_names`` contains the expected value (Alexander Kukushkin)
62+
63+
Previously, the mechanism implementing the state machine for non-quorum synchronous replication didn't check the actual value of ``synchronous_standby_names``, what resulted in a stale value of ``synchronous_standby_names`` being used when ``pg_stat_replication`` is a subset of ``synchronous_standby_names``.
64+
65+
666
Version 4.0.5
767
-------------
868

docs/rest_api.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,11 @@ For all health check ``GET`` requests Patroni returns a JSON document with the s
6363

6464
- ``GET /liveness``: returns HTTP status code **200** if Patroni heartbeat loop is properly running and **503** if the last run was more than ``ttl`` seconds ago on the primary or ``2*ttl`` on the replica. Could be used for ``livenessProbe``.
6565

66-
- ``GET /readiness``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
66+
- ``GET /readiness?lag=<max-lag>&mode=apply|write``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up, replicating and not too far behind the leader. The lag parameter sets how far a standby is allowed to be behind, it defaults to ``maximum_lag_on_failover``. Lag can be specified in bytes or in human readable values, for e.g. 16kB, 64MB, 1GB. Mode sets whether the WAL needs to be replayed (apply) or just received (write). The default is apply.
6767

68-
Both, ``readiness`` and ``liveness`` endpoints are very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
68+
When used as Kubernetes ``readinessProbe`` it will make sure freshly started pods only become ready when they have caught up to the leader. This combined with a PodDisruptionBudget will protect against leader being terminated too early during a rolling restart of nodes. It will also make sure that replicas that cannot keep up with replication do not service read-only traffic. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
69+
70+
The ``liveness`` endpoint is very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
6971

7072
.. code-block:: yaml
7173

patroni/api.py

Lines changed: 56 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -448,27 +448,73 @@ def do_GET_liveness(self) -> None:
448448
status_code = 200 if patroni.ha.is_paused() or patroni.next_run + liveness_threshold > time.time() else 503
449449
self._write_status_code_only(status_code)
450450

451+
def _readiness(self) -> Optional[str]:
452+
"""Check if readiness conditions are met.
453+
454+
:returns: None if node can be considered ready or diagnostic message if not."""
455+
456+
patroni = self.server.patroni
457+
if patroni.ha.is_leader():
458+
# We only become leader after bootstrap or once up as a standby, so we are definitely ready.
459+
return
460+
461+
# When postgres is not running we are not ready.
462+
if patroni.postgresql.state != PostgresqlState.RUNNING:
463+
return 'PostgreSQL is not running'
464+
465+
postgres = self.get_postgresql_status(True)
466+
latest_end_lsn = postgres.get('latest_end_lsn', 0)
467+
468+
if postgres.get('replication_state') != 'streaming':
469+
return 'PostgreSQL replication state is not streaming'
470+
471+
cluster = patroni.dcs.cluster
472+
473+
if not cluster and not latest_end_lsn:
474+
if patroni.ha.failsafe_is_active():
475+
return
476+
return 'DCS is not accessible'
477+
478+
leader_optime = max(cluster and cluster.status.last_lsn or 0, latest_end_lsn)
479+
480+
mode = 'write' if self.path_query.get('mode', [None])[0] == 'write' else 'apply'
481+
location = 'received_location' if mode == 'write' else 'replayed_location'
482+
lag = leader_optime - postgres.get('xlog', {}).get(location, 0)
483+
484+
max_replica_lag = parse_int(self.path_query.get('lag', [None])[0], 'B')
485+
if max_replica_lag is None:
486+
max_replica_lag = global_config.maximum_lag_on_failover
487+
488+
if lag > max_replica_lag:
489+
return f'Replication {mode} lag {lag} exceeds maximum allowable {max_replica_lag}'
490+
451491
def do_GET_readiness(self) -> None:
452492
"""Handle a ``GET`` request to ``/readiness`` path.
453493
494+
* Query parameters:
495+
496+
* ``lag``: only accept replication lag up to ``lag``. Accepts either an :class:`int`, which
497+
represents lag in bytes, or a :class:`str` representing lag in human-readable format (e.g.
498+
``10MB``).
499+
* ``mode``: allowed values ``write``, ``apply``. Base replication lag off of received WAL or
500+
replayed WAL. Defaults to ``apply``.
501+
454502
Write a simple HTTP response which HTTP status can be:
455503
456504
* ``200``:
457505
458-
* If this Patroni node holds the DCS leader lock; or
459-
* If this PostgreSQL instance is up and running;
506+
* If this Patroni node considers itself the leader; or
507+
* If PostgreSQL is running, replicating and not lagging;
460508
461509
* ``503``: if none of the previous conditions apply.
462510
463511
"""
464-
patroni = self.server.patroni
465-
if patroni.ha.is_leader():
466-
status_code = 200
467-
elif patroni.postgresql.state == PostgresqlState.RUNNING:
468-
status_code = 200 if patroni.dcs.cluster else 503
469-
else:
470-
status_code = 503
471-
self._write_status_code_only(status_code)
512+
failure_reason = self._readiness()
513+
514+
if failure_reason:
515+
logger.debug("Readiness check failure: %s", failure_reason)
516+
517+
self._write_status_code_only(200 if not failure_reason else 503)
472518

473519
def do_GET_patroni(self) -> None:
474520
"""Handle a ``GET`` request to ``/patroni`` path.

patroni/dcs/etcd.py

Lines changed: 34 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -101,40 +101,28 @@ def _do_resolve(host: str, port: int) -> List[_AddrInfo]:
101101
return []
102102

103103

104-
class AbstractEtcdClientWithFailover(abc.ABC, etcd.Client):
104+
class StaleEtcdNodeGuard(object):
105105

106-
ERROR_CLS: Type[Exception]
106+
def __init__(self) -> None:
107+
self._reset_cluster_raft_term()
107108

108-
def __init__(self, config: Dict[str, Any], dns_resolver: DnsCachingResolver, cache_ttl: int = 300) -> None:
109+
def _reset_cluster_raft_term(self) -> None:
109110
self._cluster_id = None
110111
self._raft_term = 0
111-
self._dns_resolver = dns_resolver
112-
self.set_machines_cache_ttl(cache_ttl)
113-
self._machines_cache_updated = 0
114-
kwargs = {p: config.get(p) for p in ('host', 'port', 'protocol', 'use_proxies', 'version_prefix',
115-
'username', 'password', 'cert', 'ca_cert') if config.get(p)}
116-
super(AbstractEtcdClientWithFailover, self).__init__(read_timeout=config['retry_timeout'], **kwargs)
117-
# For some reason python3-etcd on debian and ubuntu are not based on the latest version
118-
# Workaround for the case when https://github.com/jplana/python-etcd/pull/196 is not applied
119-
self.http.connection_pool_kw.pop('ssl_version', None)
120-
self._config = config
121-
self._load_machines_cache()
122-
self._allow_reconnect = True
123-
# allow passing retry argument to api_execute in params
124-
self._comparison_conditions.add('retry')
125-
self._read_options.add('retry')
126-
self._del_conditions.add('retry')
127112

128113
def _check_cluster_raft_term(self, cluster_id: Optional[str], value: Union[None, str, int]) -> None:
129114
"""Check that observed Raft Term in Etcd cluster is increasing.
130115
131-
If we observe that the new value is smaller than the previously known one, it could be an
132-
indicator that we connected to a stale node and should switch to some other node.
133-
However, we need to reset the memorized value when we notice that Cluster ID changed.
116+
:param cluster_id: last observed Etcd Cluster ID
117+
:param raft_term: last observed Raft Term
118+
119+
:raises:
120+
:exc::`StaleEtcdNode` if last observed *raft_term* is smaller than previously known *raft_term*.
134121
"""
135122
if not (cluster_id and value):
136123
return
137124

125+
# We need to reset the memorized value when we notice that Cluster ID changed.
138126
if self._cluster_id and self._cluster_id != cluster_id:
139127
logger.warning('Etcd Cluster ID changed from %s to %s', self._cluster_id, cluster_id)
140128
self._raft_term = 0
@@ -151,6 +139,30 @@ def _check_cluster_raft_term(self, cluster_id: Optional[str], value: Union[None,
151139
raise StaleEtcdNode
152140
self._raft_term = raft_term
153141

142+
143+
class AbstractEtcdClientWithFailover(abc.ABC, etcd.Client, StaleEtcdNodeGuard):
144+
145+
ERROR_CLS: Type[Exception]
146+
147+
def __init__(self, config: Dict[str, Any], dns_resolver: DnsCachingResolver, cache_ttl: int = 300) -> None:
148+
StaleEtcdNodeGuard.__init__(self)
149+
self._dns_resolver = dns_resolver
150+
self.set_machines_cache_ttl(cache_ttl)
151+
self._machines_cache_updated = 0
152+
kwargs = {p: config.get(p) for p in ('host', 'port', 'protocol', 'use_proxies', 'version_prefix',
153+
'username', 'password', 'cert', 'ca_cert') if config.get(p)}
154+
super(AbstractEtcdClientWithFailover, self).__init__(read_timeout=config['retry_timeout'], **kwargs)
155+
# For some reason python3-etcd on debian and ubuntu are not based on the latest version
156+
# Workaround for the case when https://github.com/jplana/python-etcd/pull/196 is not applied
157+
self.http.connection_pool_kw.pop('ssl_version', None)
158+
self._config = config
159+
self._load_machines_cache()
160+
self._allow_reconnect = True
161+
# allow passing retry argument to api_execute in params
162+
self._comparison_conditions.add('retry')
163+
self._read_options.add('retry')
164+
self._del_conditions.add('retry')
165+
154166
def _calculate_timeouts(self, etcd_nodes: int, timeout: Optional[float] = None) -> Tuple[int, float, int]:
155167
"""Calculate a request timeout and number of retries per single etcd node.
156168
In case if the timeout per node is too small (less than one second) we will reduce the number of nodes.

patroni/dcs/etcd3.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@
2424
from ..utils import deep_compare, enable_keepalive, iter_response_objects, RetryFailedError, USER_AGENT
2525
from . import catch_return_false_exception, Cluster, ClusterConfig, \
2626
Failover, Leader, Member, Status, SyncState, TimelineHistory
27-
from .etcd import AbstractEtcd, AbstractEtcdClientWithFailover, catch_etcd_errors, DnsCachingResolver, Retry
27+
from .etcd import AbstractEtcd, AbstractEtcdClientWithFailover, catch_etcd_errors, \
28+
DnsCachingResolver, Retry, StaleEtcdNode, StaleEtcdNodeGuard
2829

2930
logger = logging.getLogger(__name__)
3031

@@ -432,10 +433,11 @@ def watchprefix(self, key: str, start_revision: Optional[str] = None,
432433
return self.watchrange(key, prefix_range_end(key), start_revision, filters, read_timeout)
433434

434435

435-
class KVCache(Thread):
436+
class KVCache(StaleEtcdNodeGuard, Thread):
436437

437438
def __init__(self, dcs: 'Etcd3', client: 'PatroniEtcd3Client') -> None:
438-
super(KVCache, self).__init__()
439+
Thread.__init__(self)
440+
StaleEtcdNodeGuard.__init__(self)
439441
self.daemon = True
440442
self._dcs = dcs
441443
self._client = client
@@ -505,7 +507,10 @@ def _process_message(self, message: Dict[str, Any]) -> None:
505507
logger.debug('Received message: %s', message)
506508
if 'error' in message:
507509
raise _raise_for_data(message)
508-
events: List[Dict[str, Any]] = message.get('result', {}).get('events', [])
510+
result = message.get('result', EMPTY_DICT)
511+
header = result.get('header', EMPTY_DICT)
512+
self._check_cluster_raft_term(header.get('cluster_id'), header.get('raft_term'))
513+
events: List[Dict[str, Any]] = result.get('events', [])
509514
for event in events:
510515
self._process_event(event)
511516

@@ -539,8 +544,11 @@ def _do_watch(self, revision: str) -> None:
539544

540545
def _build_cache(self) -> None:
541546
result = self._dcs.retry(self._client.prefix, self._dcs.cluster_prefix)
547+
header = result.get('header', EMPTY_DICT)
542548
with self._object_cache_lock:
549+
self._reset_cluster_raft_term()
543550
self._object_cache = {node['key']: node for node in result.get('kvs', [])}
551+
self._check_cluster_raft_term(header.get('cluster_id'), header.get('raft_term'))
544552
with self.condition:
545553
self._is_ready = True
546554
self.condition.notify()
@@ -586,6 +594,12 @@ def kill_stream(self) -> None:
586594

587595
def is_ready(self) -> bool:
588596
"""Must be called only when holding the lock on `condition`"""
597+
if self._is_ready:
598+
try:
599+
self._client._check_cluster_raft_term(self._cluster_id, self._raft_term)
600+
except StaleEtcdNode:
601+
self._is_ready = False
602+
self.kill_stream()
589603
return self._is_ready
590604

591605

0 commit comments

Comments
 (0)