Skip to content

Commit 30915f3

Browse files
authored
Readiness API checks for replication on replicas (patroni#3250)
Readiness probes on Kubernetes are used for a few different things. One is for determining pod disruption budget. Previous implementation considered replicas ready as soon as PostgreSQL was started. This could cause issues with async replication and rolling upgrades, where the primary would be shut down before replica had even the time to start replicating. Second important use is for determining which pods get included as endpoints in a service. For both cases we would like to see the pod only be considered ready when it is replicating and not too far behind. Also changes readiness to consider replicas as ready when failsafe is active. In that case lag is ignored.
1 parent 88b0010 commit 30915f3

File tree

3 files changed

+96
-15
lines changed

3 files changed

+96
-15
lines changed

docs/rest_api.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,11 @@ For all health check ``GET`` requests Patroni returns a JSON document with the s
6363

6464
- ``GET /liveness``: returns HTTP status code **200** if Patroni heartbeat loop is properly running and **503** if the last run was more than ``ttl`` seconds ago on the primary or ``2*ttl`` on the replica. Could be used for ``livenessProbe``.
6565

66-
- ``GET /readiness``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
66+
- ``GET /readiness?lag=<max-lag>&mode=apply|write``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up, replicating and not too far behind the leader. The lag parameter sets how far a standby is allowed to be behind, it defaults to ``maximum_lag_on_failover``. Lag can be specified in bytes or in human readable values, for e.g. 16kB, 64MB, 1GB. Mode sets whether the WAL needs to be replayed (apply) or just received (write). The default is apply.
6767

68-
Both, ``readiness`` and ``liveness`` endpoints are very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
68+
When used as Kubernetes ``readinessProbe`` it will make sure freshly started pods only become ready when they have caught up to the leader. This combined with a PodDisruptionBudget will protect against leader being terminated too early during a rolling restart of nodes. It will also make sure that replicas that cannot keep up with replication do not service read-only traffic. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
69+
70+
The ``liveness`` endpoint is very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
6971

7072
.. code-block:: yaml
7173

patroni/api.py

Lines changed: 56 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -448,27 +448,73 @@ def do_GET_liveness(self) -> None:
448448
status_code = 200 if patroni.ha.is_paused() or patroni.next_run + liveness_threshold > time.time() else 503
449449
self._write_status_code_only(status_code)
450450

451+
def _readiness(self) -> Optional[str]:
452+
"""Check if readiness conditions are met.
453+
454+
:returns: None if node can be considered ready or diagnostic message if not."""
455+
456+
patroni = self.server.patroni
457+
if patroni.ha.is_leader():
458+
# We only become leader after bootstrap or once up as a standby, so we are definitely ready.
459+
return
460+
461+
# When postgres is not running we are not ready.
462+
if patroni.postgresql.state != PostgresqlState.RUNNING:
463+
return 'PostgreSQL is not running'
464+
465+
postgres = self.get_postgresql_status(True)
466+
latest_end_lsn = postgres.get('latest_end_lsn', 0)
467+
468+
if postgres.get('replication_state') != 'streaming':
469+
return 'PostgreSQL replication state is not streaming'
470+
471+
cluster = patroni.dcs.cluster
472+
473+
if not cluster and not latest_end_lsn:
474+
if patroni.ha.failsafe_is_active():
475+
return
476+
return 'DCS is not accessible'
477+
478+
leader_optime = max(cluster and cluster.status.last_lsn or 0, latest_end_lsn)
479+
480+
mode = 'write' if self.path_query.get('mode', [None])[0] == 'write' else 'apply'
481+
location = 'received_location' if mode == 'write' else 'replayed_location'
482+
lag = leader_optime - postgres.get('xlog', {}).get(location, 0)
483+
484+
max_replica_lag = parse_int(self.path_query.get('lag', [None])[0], 'B')
485+
if max_replica_lag is None:
486+
max_replica_lag = global_config.maximum_lag_on_failover
487+
488+
if lag > max_replica_lag:
489+
return f'Replication {mode} lag {lag} exceeds maximum allowable {max_replica_lag}'
490+
451491
def do_GET_readiness(self) -> None:
452492
"""Handle a ``GET`` request to ``/readiness`` path.
453493
494+
* Query parameters:
495+
496+
* ``lag``: only accept replication lag up to ``lag``. Accepts either an :class:`int`, which
497+
represents lag in bytes, or a :class:`str` representing lag in human-readable format (e.g.
498+
``10MB``).
499+
* ``mode``: allowed values ``write``, ``apply``. Base replication lag off of received WAL or
500+
replayed WAL. Defaults to ``apply``.
501+
454502
Write a simple HTTP response which HTTP status can be:
455503
456504
* ``200``:
457505
458-
* If this Patroni node holds the DCS leader lock; or
459-
* If this PostgreSQL instance is up and running;
506+
* If this Patroni node considers itself the leader; or
507+
* If PostgreSQL is running, replicating and not lagging;
460508
461509
* ``503``: if none of the previous conditions apply.
462510
463511
"""
464-
patroni = self.server.patroni
465-
if patroni.ha.is_leader():
466-
status_code = 200
467-
elif patroni.postgresql.state == PostgresqlState.RUNNING:
468-
status_code = 200 if patroni.dcs.cluster else 503
469-
else:
470-
status_code = 503
471-
self._write_status_code_only(status_code)
512+
failure_reason = self._readiness()
513+
514+
if failure_reason:
515+
logger.debug("Readiness check failure: %s", failure_reason)
516+
517+
self._write_status_code_only(200 if not failure_reason else 503)
472518

473519
def do_GET_patroni(self) -> None:
474520
"""Handle a ``GET`` request to ``/patroni`` path.

tests/test_api.py

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -350,11 +350,44 @@ def test_do_GET_liveness(self, mock_dcs):
350350
self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /liveness HTTP/1.0'))
351351

352352
def test_do_GET_readiness(self):
353-
self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
353+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
354354
with patch.object(MockHa, 'is_leader', Mock(return_value=True)):
355-
self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
355+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
356356
with patch.object(MockPostgresql, 'state', PropertyMock(return_value=PostgresqlState.STOPPED)):
357-
self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
357+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
358+
359+
# Replica not streaming results in error
360+
with patch.object(MockPostgresql, 'replication_state_from_parameters', Mock(return_value=None)), \
361+
patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
362+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
363+
response_mock.assert_called_with(503)
364+
365+
def patch_query(latest_lsn, received_location, replayed_location):
366+
return patch.object(MockConnection, 'query', Mock(return_value=[
367+
(postmaster_start_time, 0, '', replayed_location, '', False, postmaster_start_time, latest_lsn,
368+
None, None, received_location, '[]')]))
369+
370+
# Replica lagging on replay
371+
with patch_query(latest_lsn=120, received_location=115, replayed_location=100), \
372+
patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
373+
MockRestApiServer(RestApiHandler, 'GET /readiness?lag=10&mode=write HTTP/1.0')
374+
response_mock.assert_called_with(200)
375+
response_mock.reset_mock()
376+
MockRestApiServer(RestApiHandler, 'GET /readiness?lag=10 HTTP/1.0')
377+
response_mock.assert_called_with(503)
378+
379+
# DCS not available
380+
MockPatroni.dcs.cluster = None
381+
with patch_query(None, None, None), \
382+
patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
383+
# Failsafe active
384+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
385+
response_mock.assert_called_with(200)
386+
response_mock.reset_mock()
387+
# Failsafe disabled:
388+
with patch.object(MockHa, 'failsafe_is_active', Mock(return_value=False)):
389+
MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
390+
response_mock.assert_called_with(503)
358391

359392
@patch.object(MockPostgresql, 'state', PropertyMock(return_value=PostgresqlState.STOPPED))
360393
def test_do_GET_patroni(self):

0 commit comments

Comments
 (0)