RS: Lag-aware DB availability API (#2058)

rrelledge · web-flow · commit 4a13d94b5bf3 · 2025-10-21T13:30:01.000-05:00
* DOC-5567 RS: Added lag-awareness to DB availability REST API references * DOC-5567 RS: Added lag-aware checks to DB availability doc * RS: Updated status code links in cluster actions REST API reference * RS: Added new 406 status code and missing 404 code to initiate cluster-wide action REST API request reference * DOC-4699 Added version change for change_master cluster action behavior to RS Gilboa release notes * DOC-5567 Feedback update to remove override option from the adjust availability lag tolerance threshold section and change the section to focus on changing the default only
diff --git a/content/operate/rs/monitoring/db-availability.md b/content/operate/rs/monitoring/db-availability.md
@@ -45,6 +45,50 @@ Returns HTTP status code 200 OK if all primary (master) shards are reachable fro
 
 If the local database endpoint is unavailable, returns an error status code and a JSON object that contains [`error_code` and `description` fields]({{<relref "/operate/rs/references/rest-api/requests/bdbs/availability#get-endpoint-error-codes">}}).
 
+## Use lag-aware availability checks for disaster recovery {#lag-aware}
+
+The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized.
+
+### Change default availability lag tolerance threshold
+
+The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold.
+
+To change the default threshold for the entire cluster, set `availability_lag_tolerance_ms` with an [update cluster]({{<relref "/operate/rs/references/rest-api/requests/cluster#put-cluster">}}) request:
+
+```sh
+PUT /v1/cluster
+{ "availability_lag_tolerance_ms": 100 }
+```
+
+### Lag-aware database availability checks
+
+To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/bdbs/<database_id>/availability?extend_check=lag
+```
+
+To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/bdbs/<database_id>/availability?extend_check=lag&availability_lag_tolerance_ms=100
+```
+
+### Lag-aware endpoint availability checks
+
+To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag
+```
+
+To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
+```
+
+
 ## Availability by database status
 
 The following table shows the relationship between a database's status and availability. For more details about the database status values, see [BDB status field]({{<relref "/operate/rs/references/rest-api/objects/bdb/status">}}).
diff --git a/content/operate/rs/references/rest-api/objects/cluster/_index.md b/content/operate/rs/references/rest-api/objects/cluster/_index.md
@@ -16,6 +16,7 @@ An API object that represents the cluster.
 | Name | Type/Value | Description |
 |------|------------|-------------|
 | alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings |
+| <span class="break-all">availability_lag_tolerance_ms</span> | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during [lag-aware database availability checks]({{<relref "/operate/rs/monitoring/db-availability#lag-aware">}}). |
 | bigstore_driver | "speedb"<br />"rocksdb" | Storage engine for [Auto Tiering]({{<relref "/operate/rs/databases/auto-tiering">}}) |
 | <span class="break-all">cluster_ssh_public_key</span> | string | Cluster's autogenerated SSH public key |
 | cm_port | integer, (range: 1024-65535) | UI HTTPS listening port |
diff --git a/content/operate/rs/references/rest-api/requests/bdbs/availability.md b/content/operate/rs/references/rest-api/requests/bdbs/availability.md
@@ -32,12 +32,26 @@ Verifies the local database endpoint is available. This request does not redirec
 
 ### Request {#get-endpoint-request}
 
-#### Example HTTP request
+#### Example HTTP requests
+
+To check database endpoint availability without any additional checks:
 
 ```sh
 GET /v1/local/bdbs/1/endpoint/availability
 ```
 
+To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag
+```
+
+To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
+```
+
 #### Headers
 
 | Key | Value | Description |
@@ -51,6 +65,13 @@ GET /v1/local/bdbs/1/endpoint/availability
 |-------|------|-------------|
 | uid | integer | The unique ID of the database. |
 
+#### Query parameters
+
+| Field | Type | Description |
+|-------|------|-------------|
+| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
+| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |
+
 ### Response {#get-endpoint-response}
 
 Returns the status code `200 OK` if the local database endpoint is available.
@@ -74,6 +95,8 @@ The following are possible `error_code` values:
 | Code | Description |
 |------|-------------|
 | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database endpoint is available. |
+| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
+| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
 | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database endpoint is unavailable. |
 
 
@@ -97,12 +120,27 @@ Gets the availability status of a database.
 
 ### Request {#get-db-request}
 
-#### Example HTTP request
+#### Example HTTP requests
+
+
+To check database availability without any additional checks:
 
 ```sh
 GET /v1/bdbs/1/availability
 ```
 
+To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/bdbs/1/availability?extend_check=lag
+```
+
+To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:
+
+```sh
+GET /v1/bdbs/1/availability?extend_check=lag&availability_lag_tolerance_ms=100
+```
+
 #### Headers
 
 | Key | Value | Description |
@@ -116,6 +154,13 @@ GET /v1/bdbs/1/availability
 |-------|------|-------------|
 | uid | integer | The unique ID of the database. |
 
+#### Query parameters
+
+| Field | Type | Description |
+|-------|------|-------------|
+| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
+| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |
+
 ### Response {#get-db-response}
 
 Returns the status code `200 OK` if the database is available.
@@ -139,4 +184,6 @@ The following are possible `error_code` values:
 | Code | Description |
 |------|-------------|
 | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database is available. |
+| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
+| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
 | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database is unavailable or doesn't have quorum. |