Skip to content

Commit 4a13d94

Browse files
authored
RS: Lag-aware DB availability API (#2058)
* DOC-5567 RS: Added lag-awareness to DB availability REST API references * DOC-5567 RS: Added lag-aware checks to DB availability doc * RS: Updated status code links in cluster actions REST API reference * RS: Added new 406 status code and missing 404 code to initiate cluster-wide action REST API request reference * DOC-4699 Added version change for change_master cluster action behavior to RS Gilboa release notes * DOC-5567 Feedback update to remove override option from the adjust availability lag tolerance threshold section and change the section to focus on changing the default only
1 parent 446f07c commit 4a13d94

File tree

3 files changed

+94
-2
lines changed

3 files changed

+94
-2
lines changed

content/operate/rs/monitoring/db-availability.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,50 @@ Returns HTTP status code 200 OK if all primary (master) shards are reachable fro
4545

4646
If the local database endpoint is unavailable, returns an error status code and a JSON object that contains [`error_code` and `description` fields]({{<relref "/operate/rs/references/rest-api/requests/bdbs/availability#get-endpoint-error-codes">}}).
4747

48+
## Use lag-aware availability checks for disaster recovery {#lag-aware}
49+
50+
The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized.
51+
52+
### Change default availability lag tolerance threshold
53+
54+
The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold.
55+
56+
To change the default threshold for the entire cluster, set `availability_lag_tolerance_ms` with an [update cluster]({{<relref "/operate/rs/references/rest-api/requests/cluster#put-cluster">}}) request:
57+
58+
```sh
59+
PUT /v1/cluster
60+
{ "availability_lag_tolerance_ms": 100 }
61+
```
62+
63+
### Lag-aware database availability checks
64+
65+
To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:
66+
67+
```sh
68+
GET /v1/bdbs/<database_id>/availability?extend_check=lag
69+
```
70+
71+
To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:
72+
73+
```sh
74+
GET /v1/bdbs/<database_id>/availability?extend_check=lag&availability_lag_tolerance_ms=100
75+
```
76+
77+
### Lag-aware endpoint availability checks
78+
79+
To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:
80+
81+
```sh
82+
GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag
83+
```
84+
85+
To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:
86+
87+
```sh
88+
GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
89+
```
90+
91+
4892
## Availability by database status
4993
5094
The following table shows the relationship between a database's status and availability. For more details about the database status values, see [BDB status field]({{<relref "/operate/rs/references/rest-api/objects/bdb/status">}}).

content/operate/rs/references/rest-api/objects/cluster/_index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ An API object that represents the cluster.
1616
| Name | Type/Value | Description |
1717
|------|------------|-------------|
1818
| alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings |
19+
| <span class="break-all">availability_lag_tolerance_ms</span> | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during [lag-aware database availability checks]({{<relref "/operate/rs/monitoring/db-availability#lag-aware">}}). |
1920
| bigstore_driver | "speedb"<br />"rocksdb" | Storage engine for [Auto Tiering]({{<relref "/operate/rs/databases/auto-tiering">}}) |
2021
| <span class="break-all">cluster_ssh_public_key</span> | string | Cluster's autogenerated SSH public key |
2122
| cm_port | integer, (range: 1024-65535) | UI HTTPS listening port |

content/operate/rs/references/rest-api/requests/bdbs/availability.md

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,26 @@ Verifies the local database endpoint is available. This request does not redirec
3232

3333
### Request {#get-endpoint-request}
3434

35-
#### Example HTTP request
35+
#### Example HTTP requests
36+
37+
To check database endpoint availability without any additional checks:
3638

3739
```sh
3840
GET /v1/local/bdbs/1/endpoint/availability
3941
```
4042

43+
To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:
44+
45+
```sh
46+
GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag
47+
```
48+
49+
To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:
50+
51+
```sh
52+
GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
53+
```
54+
4155
#### Headers
4256

4357
| Key | Value | Description |
@@ -51,6 +65,13 @@ GET /v1/local/bdbs/1/endpoint/availability
5165
|-------|------|-------------|
5266
| uid | integer | The unique ID of the database. |
5367

68+
#### Query parameters
69+
70+
| Field | Type | Description |
71+
|-------|------|-------------|
72+
| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
73+
| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |
74+
5475
### Response {#get-endpoint-response}
5576

5677
Returns the status code `200 OK` if the local database endpoint is available.
@@ -74,6 +95,8 @@ The following are possible `error_code` values:
7495
| Code | Description |
7596
|------|-------------|
7697
| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database endpoint is available. |
98+
| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
99+
| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
77100
| [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database endpoint is unavailable. |
78101

79102

@@ -97,12 +120,27 @@ Gets the availability status of a database.
97120

98121
### Request {#get-db-request}
99122

100-
#### Example HTTP request
123+
#### Example HTTP requests
124+
125+
126+
To check database availability without any additional checks:
101127

102128
```sh
103129
GET /v1/bdbs/1/availability
104130
```
105131

132+
To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:
133+
134+
```sh
135+
GET /v1/bdbs/1/availability?extend_check=lag
136+
```
137+
138+
To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:
139+
140+
```sh
141+
GET /v1/bdbs/1/availability?extend_check=lag&availability_lag_tolerance_ms=100
142+
```
143+
106144
#### Headers
107145

108146
| Key | Value | Description |
@@ -116,6 +154,13 @@ GET /v1/bdbs/1/availability
116154
|-------|------|-------------|
117155
| uid | integer | The unique ID of the database. |
118156

157+
#### Query parameters
158+
159+
| Field | Type | Description |
160+
|-------|------|-------------|
161+
| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
162+
| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |
163+
119164
### Response {#get-db-response}
120165

121166
Returns the status code `200 OK` if the database is available.
@@ -139,4 +184,6 @@ The following are possible `error_code` values:
139184
| Code | Description |
140185
|------|-------------|
141186
| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database is available. |
187+
| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
188+
| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
142189
| [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database is unavailable or doesn't have quorum. |

0 commit comments

Comments
 (0)