From 7b1751dc2629daf138ffd30f7852a6c1e4be3ea8 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 28 Aug 2025 11:43:47 -0500 Subject: [PATCH 1/6] DOC-5567 RS: Added lag-awareness to DB availability REST API references --- .../rest-api/objects/cluster/_index.md | 1 + .../rest-api/requests/bdbs/availability.md | 51 ++++++++++++++++++- 2 files changed, 50 insertions(+), 2 deletions(-) diff --git a/content/operate/rs/references/rest-api/objects/cluster/_index.md b/content/operate/rs/references/rest-api/objects/cluster/_index.md index 38f488c667..70f208c06c 100644 --- a/content/operate/rs/references/rest-api/objects/cluster/_index.md +++ b/content/operate/rs/references/rest-api/objects/cluster/_index.md @@ -16,6 +16,7 @@ An API object that represents the cluster. | Name | Type/Value | Description | |------|------------|-------------| | alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings | +| availability_lag_tolerance_ms | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during lag-aware [database availability checks]({{}}). | | bigstore_driver | 'speedb'
'rocksdb' | Storage engine for [Auto Tiering]({{}}) | | cluster_ssh_public_key | string | Cluster's autogenerated SSH public key | | cm_port | integer, (range: 1024-65535) | UI HTTPS listening port | diff --git a/content/operate/rs/references/rest-api/requests/bdbs/availability.md b/content/operate/rs/references/rest-api/requests/bdbs/availability.md index 9b39f84360..14e2ca5696 100644 --- a/content/operate/rs/references/rest-api/requests/bdbs/availability.md +++ b/content/operate/rs/references/rest-api/requests/bdbs/availability.md @@ -32,12 +32,26 @@ Verifies the local database endpoint is available. This request does not redirec ### Request {#get-endpoint-request} -#### Example HTTP request +#### Example HTTP requests + +To check database endpoint availability without any additional checks: ```sh GET /v1/local/bdbs/1/endpoint/availability ``` +To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag +``` + +To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + #### Headers | Key | Value | Description | @@ -51,6 +65,13 @@ GET /v1/local/bdbs/1/endpoint/availability |-------|------|-------------| | uid | integer | The unique ID of the database. | +#### Query parameters + +| Field | Type | Description | +|-------|------|-------------| +| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)
Values:
**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. | +| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. | + ### Response {#get-endpoint-response} Returns the status code `200 OK` if the local database endpoint is available. @@ -74,6 +95,8 @@ The following are possible `error_code` values: | Code | Description | |------|-------------| | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database endpoint is available. | +| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. | | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database endpoint is unavailable. | @@ -97,12 +120,27 @@ Gets the availability status of a database. ### Request {#get-db-request} -#### Example HTTP request +#### Example HTTP requests + + +To check database availability without any additional checks: ```sh GET /v1/bdbs/1/availability ``` +To perform a lag-aware database availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs/1/availability?extend_check=lag +``` + +To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs/1/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + #### Headers | Key | Value | Description | @@ -116,6 +154,13 @@ GET /v1/bdbs/1/availability |-------|------|-------------| | uid | integer | The unique ID of the database. | +#### Query parameters + +| Field | Type | Description | +|-------|------|-------------| +| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)
Values:
**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. | +| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. | + ### Response {#get-db-response} Returns the status code `200 OK` if the database is available. @@ -139,4 +184,6 @@ The following are possible `error_code` values: | Code | Description | |------|-------------| | [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database is available. | +| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. | | [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database is unavailable or doesn't have quorum. | From 4b7e44f6088343311e21fb4aab2fa827e8574bd0 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 28 Aug 2025 12:29:41 -0500 Subject: [PATCH 2/6] DOC-5567 RS: Added lag-aware checks to DB availability doc --- .../operate/rs/monitoring/db-availability.md | 50 +++++++++++++++++++ .../rest-api/objects/cluster/_index.md | 2 +- 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/content/operate/rs/monitoring/db-availability.md b/content/operate/rs/monitoring/db-availability.md index db3faea66e..f8627d1102 100644 --- a/content/operate/rs/monitoring/db-availability.md +++ b/content/operate/rs/monitoring/db-availability.md @@ -45,6 +45,56 @@ Returns HTTP status code 200 OK if all primary (master) shards are reachable fro If the local database endpoint is unavailable, returns an error status code and a JSON object that contains [`error_code` and `description` fields]({{}}). +## Use lag-aware availability checks for disaster recovery {#lag-aware} + +The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized. + +### Adjust availability lag tolerance threshold + +The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold using one of the following methods: + +- Change the default threshold for the entire cluster by setting `availability_lag_tolerance_ms` with an [update cluster]({{}}) request. + + ```sh + PUT /v1/cluster + { "availability_lag_tolerance_ms": 100 } + ``` + +- Override the default threshold by adding the `availability_lag_tolerance_ms` query parameter to specific lag-aware [availability checks]({{}}). + + ```sh + GET /v1/bdbs//availability?extend_check=lag&availability_lag_tolerance_ms=100 + ``` + +### Lag-aware database availability checks + +To perform a lag-aware database availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs//availability?extend_check=lag +``` + +To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/bdbs//availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + +### Lag-aware endpoint availability checks + +To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs//endpoint/availability?extend_check=lag +``` + +To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold: + +```sh +GET /v1/local/bdbs//endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100 +``` + + ## Availability by database status The following table shows the relationship between a database's status and availability. For more details about the database status values, see [BDB status field]({{}}). diff --git a/content/operate/rs/references/rest-api/objects/cluster/_index.md b/content/operate/rs/references/rest-api/objects/cluster/_index.md index 70f208c06c..ba38826368 100644 --- a/content/operate/rs/references/rest-api/objects/cluster/_index.md +++ b/content/operate/rs/references/rest-api/objects/cluster/_index.md @@ -16,7 +16,7 @@ An API object that represents the cluster. | Name | Type/Value | Description | |------|------------|-------------| | alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings | -| availability_lag_tolerance_ms | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during lag-aware [database availability checks]({{}}). | +| availability_lag_tolerance_ms | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during [lag-aware database availability checks]({{}}). | | bigstore_driver | 'speedb'
'rocksdb' | Storage engine for [Auto Tiering]({{}}) | | cluster_ssh_public_key | string | Cluster's autogenerated SSH public key | | cm_port | integer, (range: 1024-65535) | UI HTTPS listening port | From 5eb55927551652874588c8c56483e45a73761c6b Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 28 Aug 2025 14:27:20 -0500 Subject: [PATCH 3/6] RS: Updated status code links in cluster actions REST API reference --- .../references/rest-api/requests/cluster/actions.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/operate/rs/references/rest-api/requests/cluster/actions.md b/content/operate/rs/references/rest-api/requests/cluster/actions.md index c58f3ad998..156bb5b7c1 100644 --- a/content/operate/rs/references/rest-api/requests/cluster/actions.md +++ b/content/operate/rs/references/rest-api/requests/cluster/actions.md @@ -58,8 +58,8 @@ Returns a JSON array of [action objects]({{< relref "/operate/rs/references/rest | Code | Description | |------|-------------| -| [200 OK](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1) | No error, response provides info about an ongoing action. | -| [404 Not Found](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.5) | Action does not exist (i.e. not currently running and no available status of last run). | +| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | No error, response provides info about an ongoing action. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Action does not exist (i.e. not currently running and no available status of last run). | ## Get cluster action {#get-cluster-action} @@ -103,8 +103,8 @@ Returns an [action object]({{< relref "/operate/rs/references/rest-api/objects/a | Code | Description | |------|-------------| -| [200 OK](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1) | No error, response provides info about an ongoing action. | -| [404 Not Found](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.5) | Action does not exist (i.e. not currently running and no available status of last run). | +| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | No error, response provides info about an ongoing action. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Action does not exist (i.e. not currently running and no available status of last run). | ## Initiate cluster-wide action {#post-cluster-action} @@ -194,5 +194,5 @@ Returns a status code. | Code | Description | |------|-------------| -| [200 OK](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1) | Action will be cancelled when possible. | -| [404 Not Found](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.5) | Action unknown or not currently running. | +| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Action will be cancelled when possible. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Action unknown or not currently running. | From edae0b969d8d0ed8c58b6b616cc6a37be26d064d Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 28 Aug 2025 14:28:50 -0500 Subject: [PATCH 4/6] RS: Added new 406 status code and missing 404 code to initiate cluster-wide action REST API request reference --- .../rest-api/requests/cluster/actions.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/content/operate/rs/references/rest-api/requests/cluster/actions.md b/content/operate/rs/references/rest-api/requests/cluster/actions.md index 156bb5b7c1..30a154d3c6 100644 --- a/content/operate/rs/references/rest-api/requests/cluster/actions.md +++ b/content/operate/rs/references/rest-api/requests/cluster/actions.md @@ -54,7 +54,7 @@ Returns a JSON array of [action objects]({{< relref "/operate/rs/references/rest } ``` -### Status codes {#get-all-status-codes} +### Status codes {#get-all-status-codes} | Code | Description | |------|-------------| @@ -99,7 +99,7 @@ Returns an [action object]({{< relref "/operate/rs/references/rest-api/objects/a } ``` -### Status codes {#get-status-codes} +### Status codes {#get-status-codes} | Code | Description | |------|-------------| @@ -153,13 +153,15 @@ Supported cluster actions: The body content may provide additional action details. Currently, it is not used. -### Status codes {#post-status-codes} +### Status codes {#post-status-codes} | Code | Description | |------|-------------| -| [200 OK](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1) | No error, action was initiated. | -| [400 Bad Request](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.1) | Bad action or content provided. | -| [409 Conflict](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.10) | A conflicting action is already in progress. | +| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | No error, action was initiated. | +| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Bad action or content provided. | +| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Node does not exist. | +| [406 Not Acceptable](https://www.rfc-editor.org/rfc/rfc9110.html#name-406-not-acceptable) | Node not bootstrapped. | +| [409 Conflict](https://www.rfc-editor.org/rfc/rfc9110.html#name-409-conflict) | A conflicting action is already in progress. | ## Cancel action {#delete-cluster-action} @@ -190,7 +192,7 @@ a previously executed and completed action. Returns a status code. -### Status codes {#delete-status-codes} +### Status codes {#delete-status-codes} | Code | Description | |------|-------------| From f1b3e83a7c052663a0c509e6f8e476a5a53966c4 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 28 Aug 2025 14:43:56 -0500 Subject: [PATCH 5/6] DOC-4699 Added version change for change_master cluster action behavior to RS Gilboa release notes --- content/operate/rs/release-notes/rs-8-0-releases/_index.md | 2 +- content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/operate/rs/release-notes/rs-8-0-releases/_index.md b/content/operate/rs/release-notes/rs-8-0-releases/_index.md index 4a18e1bfeb..ed7e0b241d 100644 --- a/content/operate/rs/release-notes/rs-8-0-releases/_index.md +++ b/content/operate/rs/release-notes/rs-8-0-releases/_index.md @@ -33,7 +33,7 @@ For more detailed release notes, select a build version from the following table ## Version changes -- TBA +- [`POST /v1/cluster/actions/change_master`]({{}}) REST API requests will no longer allow a node that exists but is not finished bootstrapping to become the primary node. Such requests will now return the status code `406 Not Acceptable`. ### Breaking changes diff --git a/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md b/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md index b528202148..542b11f4e6 100644 --- a/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md +++ b/content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md @@ -182,7 +182,7 @@ The following table shows which Redis modules are compatible with each Redis dat ## Version changes -- TBA +- [`POST /v1/cluster/actions/change_master`]({{}}) REST API requests will no longer allow a node that exists but is not finished bootstrapping to become the primary node. Such requests will now return the status code `406 Not Acceptable`. ### Breaking changes From 38255071dc17cdbad33243930c1dc9f56e6d5327 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Thu, 2 Oct 2025 10:55:39 -0500 Subject: [PATCH 6/6] DOC-5567 Feedback update to remove override option from the adjust availability lag tolerance threshold section and change the section to focus on changing the default only --- .../operate/rs/monitoring/db-availability.md | 20 +++++++------------ 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/content/operate/rs/monitoring/db-availability.md b/content/operate/rs/monitoring/db-availability.md index f8627d1102..7d98bca9cb 100644 --- a/content/operate/rs/monitoring/db-availability.md +++ b/content/operate/rs/monitoring/db-availability.md @@ -49,22 +49,16 @@ If the local database endpoint is unavailable, returns an error status code and The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized. -### Adjust availability lag tolerance threshold +### Change default availability lag tolerance threshold -The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold using one of the following methods: +The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold. -- Change the default threshold for the entire cluster by setting `availability_lag_tolerance_ms` with an [update cluster]({{}}) request. +To change the default threshold for the entire cluster, set `availability_lag_tolerance_ms` with an [update cluster]({{}}) request: - ```sh - PUT /v1/cluster - { "availability_lag_tolerance_ms": 100 } - ``` - -- Override the default threshold by adding the `availability_lag_tolerance_ms` query parameter to specific lag-aware [availability checks]({{}}). - - ```sh - GET /v1/bdbs//availability?extend_check=lag&availability_lag_tolerance_ms=100 - ``` +```sh +PUT /v1/cluster +{ "availability_lag_tolerance_ms": 100 } +``` ### Lag-aware database availability checks