diff --git a/content/operate/rs/monitoring/db-availability.md b/content/operate/rs/monitoring/db-availability.md index db3faea66..7d98bca9c 100644 --- a/content/operate/rs/monitoring/db-availability.md +++ b/content/operate/rs/monitoring/db-availability.md @@ -45,6 +45,50 @@ Returns HTTP status code 200 OK if all primary (master) shards are reachable fro If the local database endpoint is unavailable, returns an error status code and a JSON object that contains [`error_code` and `description` fields]({{}}). +## Use lag-aware availability checks for disaster recovery {#lag-aware} + +The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized. + +### Change default availability lag tolerance threshold + +The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold. + +To change the default threshold for the entire cluster, set `availability_lag_tolerance_ms` with an [update cluster]({{}}) request: + +```sh +PUT /v1/cluster +{ "availability_lag_tolerance_ms": 100 } +``` + +### Lag-aware database availability checks + +To perform a lag-aware database availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs//availability?extend_check=lag +``` + +To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs//availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + +### Lag-aware endpoint availability checks + +To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs//endpoint/availability?extend_check=lag +``` + +To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs//endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + + ## Availability by database status The following table shows the relationship between a database's status and availability. For more details about the database status values, see [BDB status field]({{}}). diff --git a/content/operate/rs/references/rest-api/objects/cluster/_index.md b/content/operate/rs/references/rest-api/objects/cluster/_index.md index 9b2b85ed7..453db7b75 100644 --- a/content/operate/rs/references/rest-api/objects/cluster/_index.md +++ b/content/operate/rs/references/rest-api/objects/cluster/_index.md @@ -16,6 +16,7 @@ An API object that represents the cluster. | Name | Type/Value | Description | |------|------------|-------------| | alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings | +| availability_lag_tolerance_ms | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during [lag-aware database availability checks]({{}}). | | bigstore_driver | 'speedb'
'rocksdb' | Storage engine for [Auto Tiering]({{}}) | | cluster_ssh_public_key | string | Cluster's autogenerated SSH public key | | cm_port | integer, (range: 1024-65535) | UI HTTPS listening port | diff --git a/content/operate/rs/references/rest-api/requests/bdbs/availability.md b/content/operate/rs/references/rest-api/requests/bdbs/availability.md index 9b39f8436..14e2ca569 100644 --- a/content/operate/rs/references/rest-api/requests/bdbs/availability.md +++ b/content/operate/rs/references/rest-api/requests/bdbs/availability.md @@ -32,12 +32,26 @@ Verifies the local database endpoint is available. This request does not redirec ### Request {#get-endpoint-request} -#### Example HTTP request +#### Example HTTP requests + +To check database endpoint availability without any additional checks: ```sh GET /v1/local/bdbs/1/endpoint/availability ``` +To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag +``` + +To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + #### Headers | Key | Value | Description | @@ -51,6 +65,13 @@ GET /v1/local/bdbs/1/endpoint/availability |-------|------|-------------| | uid | integer | The unique ID of the database. | +#### Query parameters + +| Field | Type | Description | +|-------|------|-------------| +| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)
Values:
**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. | +| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. | + ### Response {#get-endpoint-response} Returns the status code `200 OK` if the local database endpoint is available. @@ -74,6 +95,8 @@ The following are possible `error_code` values: | Code | Description | |------|-------------| | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database endpoint is available. | +| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. | | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database endpoint is unavailable. | @@ -97,12 +120,27 @@ Gets the availability status of a database. ### Request {#get-db-request} -#### Example HTTP request +#### Example HTTP requests + + +To check database availability without any additional checks: ```sh GET /v1/bdbs/1/availability ``` +To perform a lag-aware database availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs/1/availability?extend_check=lag +``` + +To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs/1/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + #### Headers | Key | Value | Description | @@ -116,6 +154,13 @@ GET /v1/bdbs/1/availability |-------|------|-------------| | uid | integer | The unique ID of the database. | +#### Query parameters + +| Field | Type | Description | +|-------|------|-------------| +| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)
Values:
**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. | +| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. | + ### Response {#get-db-response} Returns the status code `200 OK` if the database is available. @@ -139,4 +184,6 @@ The following are possible `error_code` values: | Code | Description | |------|-------------| | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database is available. | +| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. | | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database is unavailable or doesn't have quorum. |