Readiness API checks for replication on replicas (patroni#3250)

ants · web-flow · commit 30915f333095 · 2025-05-30T17:32:58.000+02:00
Readiness probes on Kubernetes are used for a few different things. One
is for determining pod disruption budget. Previous implementation
considered replicas ready as soon as PostgreSQL was started. This could
cause issues with async replication and rolling upgrades, where the
primary would be shut down before replica had even the time to start
replicating. Second important use is for determining which pods get
included as endpoints in a service. For both cases we would like to see
the pod only be considered ready when it is replicating and not too far
behind.

Also changes readiness to consider replicas as ready when failsafe is
active. In that case lag is ignored.
diff --git a/docs/rest_api.rst b/docs/rest_api.rst
@@ -63,9 +63,11 @@ For all health check ``GET`` requests Patroni returns a JSON document with the s
 
 - ``GET /liveness``: returns HTTP status code **200** if Patroni heartbeat loop is properly running and **503** if the last run was more than ``ttl`` seconds ago on the primary or ``2*ttl`` on the replica. Could be used for ``livenessProbe``.
 
-- ``GET /readiness``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
+- ``GET /readiness?lag=<max-lag>&mode=apply|write``: returns HTTP status code **200** when the Patroni node is running as the leader or when PostgreSQL is up, replicating and not too far behind the leader. The lag parameter sets how far a standby is allowed to be behind, it defaults to ``maximum_lag_on_failover``. Lag can be specified in bytes or in human readable values, for e.g. 16kB, 64MB, 1GB. Mode sets whether the WAL needs to be replayed (apply) or just received (write). The default is apply.
 
-Both, ``readiness`` and ``liveness`` endpoints are very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
+  When used as Kubernetes ``readinessProbe`` it will make sure freshly started pods only become ready when they have caught up to the leader. This combined with a PodDisruptionBudget will protect against leader being terminated too early during a rolling restart of nodes. It will also make sure that replicas that cannot keep up with replication do not service read-only traffic. The endpoint could be used for ``readinessProbe`` when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
+
+The ``liveness`` endpoint is very light-weight and not executing any SQL. Probes should be configured in such a way that they start failing about time when the leader key is expiring. With the default value of ``ttl``, which is ``30s`` example probes would look like:
 
 .. code-block:: yaml
 
diff --git a/patroni/api.py b/patroni/api.py
@@ -448,27 +448,73 @@ def do_GET_liveness(self) -> None:
         status_code = 200 if patroni.ha.is_paused() or patroni.next_run + liveness_threshold > time.time() else 503
         self._write_status_code_only(status_code)
 
+    def _readiness(self) -> Optional[str]:
+        """Check if readiness conditions are met.
+
+        :returns: None if node can be considered ready or diagnostic message if not."""
+
+        patroni = self.server.patroni
+        if patroni.ha.is_leader():
+            # We only become leader after bootstrap or once up as a standby, so we are definitely ready.
+            return
+
+        # When postgres is not running we are not ready.
+        if patroni.postgresql.state != PostgresqlState.RUNNING:
+            return 'PostgreSQL is not running'
+
+        postgres = self.get_postgresql_status(True)
+        latest_end_lsn = postgres.get('latest_end_lsn', 0)
+
+        if postgres.get('replication_state') != 'streaming':
+            return 'PostgreSQL replication state is not streaming'
+
+        cluster = patroni.dcs.cluster
+
+        if not cluster and not latest_end_lsn:
+            if patroni.ha.failsafe_is_active():
+                return
+            return 'DCS is not accessible'
+
+        leader_optime = max(cluster and cluster.status.last_lsn or 0, latest_end_lsn)
+
+        mode = 'write' if self.path_query.get('mode', [None])[0] == 'write' else 'apply'
+        location = 'received_location' if mode == 'write' else 'replayed_location'
+        lag = leader_optime - postgres.get('xlog', {}).get(location, 0)
+
+        max_replica_lag = parse_int(self.path_query.get('lag', [None])[0], 'B')
+        if max_replica_lag is None:
+            max_replica_lag = global_config.maximum_lag_on_failover
+
+        if lag > max_replica_lag:
+            return f'Replication {mode} lag {lag} exceeds maximum allowable {max_replica_lag}'
+
     def do_GET_readiness(self) -> None:
         """Handle a ``GET`` request to ``/readiness`` path.
 
+            * Query parameters:
+
+                * ``lag``: only accept replication lag up to ``lag``. Accepts either an :class:`int`, which
+                    represents lag in bytes, or a :class:`str` representing lag in human-readable format (e.g.
+                    ``10MB``).
+                * ``mode``: allowed values ``write``, ``apply``. Base replication lag off of received WAL or
+                    replayed WAL. Defaults to ``apply``.
+
         Write a simple HTTP response which HTTP status can be:
 
             * ``200``:
 
-                * If this Patroni node holds the DCS leader lock; or
-                * If this PostgreSQL instance is up and running;
+                * If this Patroni node considers itself the leader; or
+                * If PostgreSQL is running, replicating and not lagging;
 
             * ``503``: if none of the previous conditions apply.
 
         """
-        patroni = self.server.patroni
-        if patroni.ha.is_leader():
-            status_code = 200
-        elif patroni.postgresql.state == PostgresqlState.RUNNING:
-            status_code = 200 if patroni.dcs.cluster else 503
-        else:
-            status_code = 503
-        self._write_status_code_only(status_code)
+        failure_reason = self._readiness()
+
+        if failure_reason:
+            logger.debug("Readiness check failure: %s", failure_reason)
+
+        self._write_status_code_only(200 if not failure_reason else 503)
 
     def do_GET_patroni(self) -> None:
         """Handle a ``GET`` request to ``/patroni`` path.
diff --git a/tests/test_api.py b/tests/test_api.py
@@ -350,11 +350,44 @@ def test_do_GET_liveness(self, mock_dcs):
         self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /liveness HTTP/1.0'))
 
     def test_do_GET_readiness(self):
-        self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
+        MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
         with patch.object(MockHa, 'is_leader', Mock(return_value=True)):
-            self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
+            MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
         with patch.object(MockPostgresql, 'state', PropertyMock(return_value=PostgresqlState.STOPPED)):
-            self.assertIsNotNone(MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0'))
+            MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
+
+        # Replica not streaming results in error
+        with patch.object(MockPostgresql, 'replication_state_from_parameters', Mock(return_value=None)), \
+                patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
+            MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
+            response_mock.assert_called_with(503)
+
+        def patch_query(latest_lsn, received_location, replayed_location):
+            return patch.object(MockConnection, 'query', Mock(return_value=[
+                (postmaster_start_time, 0, '', replayed_location, '', False, postmaster_start_time, latest_lsn,
+                 None, None, received_location, '[]')]))
+
+        # Replica lagging on replay
+        with patch_query(latest_lsn=120, received_location=115, replayed_location=100), \
+                patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
+            MockRestApiServer(RestApiHandler, 'GET /readiness?lag=10&mode=write HTTP/1.0')
+            response_mock.assert_called_with(200)
+            response_mock.reset_mock()
+            MockRestApiServer(RestApiHandler, 'GET /readiness?lag=10 HTTP/1.0')
+            response_mock.assert_called_with(503)
+
+        # DCS not available
+        MockPatroni.dcs.cluster = None
+        with patch_query(None, None, None), \
+                patch.object(RestApiHandler, '_write_status_code_only') as response_mock:
+            # Failsafe active
+            MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
+            response_mock.assert_called_with(200)
+            response_mock.reset_mock()
+            # Failsafe disabled:
+            with patch.object(MockHa, 'failsafe_is_active', Mock(return_value=False)):
+                MockRestApiServer(RestApiHandler, 'GET /readiness HTTP/1.0')
+                response_mock.assert_called_with(503)
 
     @patch.object(MockPostgresql, 'state', PropertyMock(return_value=PostgresqlState.STOPPED))
     def test_do_GET_patroni(self):