Skip to content
Open
50 changes: 50 additions & 0 deletions content/operate/rs/monitoring/db-availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,56 @@ Returns HTTP status code 200 OK if all primary (master) shards are reachable fro

If the local database endpoint is unavailable, returns an error status code and a JSON object that contains [`error_code` and `description` fields]({{<relref "/operate/rs/references/rest-api/requests/bdbs/availability#get-endpoint-error-codes">}}).

## Use lag-aware availability checks for disaster recovery {#lag-aware}

The database availability API supports lag-aware availability checks that consider replication lag tolerance. You can reduce the risk of data inconsistencies during disaster recovery by incorporating lag-aware availability checks into your disaster recovery solution and ensuring failover-failback flows only occur when databases are accessible and sufficiently synchronized.

### Adjust availability lag tolerance threshold

The lag tolerance threshold is 100 milliseconds by default. Depending on factors such as workload, network conditions, and throughput, you might want to adjust the lag tolerance threshold using one of the following methods:

- Change the default threshold for the entire cluster by setting `availability_lag_tolerance_ms` with an [update cluster]({{<relref "/operate/rs/references/rest-api/requests/cluster#put-cluster">}}) request.

```sh
PUT /v1/cluster
{ "availability_lag_tolerance_ms": 100 }
```

- Override the default threshold by adding the `availability_lag_tolerance_ms` query parameter to specific lag-aware [availability checks]({{<relref "/operate/rs/references/rest-api/requests/bdbs/availability">}}).

```sh
GET /v1/bdbs/<database_id>/availability?extend_check=lag&availability_lag_tolerance_ms=100
```

### Lag-aware database availability checks

To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:

```sh
GET /v1/bdbs/<database_id>/availability?extend_check=lag
```

To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:

```sh
GET /v1/bdbs/<database_id>/availability?extend_check=lag&availability_lag_tolerance_ms=100
```

### Lag-aware endpoint availability checks

To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:

```sh
GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag
```

To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:

```sh
GET /v1/local/bdbs/<database_id>/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
```


## Availability by database status

The following table shows the relationship between a database's status and availability. For more details about the database status values, see [BDB status field]({{<relref "/operate/rs/references/rest-api/objects/bdb/status">}}).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ An API object that represents the cluster.
| Name | Type/Value | Description |
|------|------------|-------------|
| alert_settings | [alert_settings]({{< relref "/operate/rs/references/rest-api/objects/cluster/alert_settings" >}}) object | Cluster and node alert settings |
| <span class="break-all">availability_lag_tolerance_ms</span> | integer (default: 100) | The maximum replication lag in milliseconds tolerated between source and replicas during [lag-aware database availability checks]({{<relref "/operate/rs/monitoring/db-availability#lag-aware">}}). |
| bigstore_driver | 'speedb'<br />'rocksdb' | Storage engine for [Auto Tiering]({{<relref "/operate/rs/databases/auto-tiering">}}) |
| <span class="break-all">cluster_ssh_public_key</span> | string | Cluster's autogenerated SSH public key |
| cm_port | integer, (range: 1024-65535) | UI HTTPS listening port |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,26 @@ Verifies the local database endpoint is available. This request does not redirec

### Request {#get-endpoint-request}

#### Example HTTP request
#### Example HTTP requests

To check database endpoint availability without any additional checks:

```sh
GET /v1/local/bdbs/1/endpoint/availability
```

To perform a lag-aware database endpoint availability check using the cluster's default lag tolerance threshold:

```sh
GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag
```

To perform a lag-aware database endpoint availability check and override the cluster's default lag tolerance threshold:

```sh
GET /v1/local/bdbs/1/endpoint/availability?extend_check=lag&availability_lag_tolerance_ms=100
```

#### Headers

| Key | Value | Description |
Expand All @@ -51,6 +65,13 @@ GET /v1/local/bdbs/1/endpoint/availability
|-------|------|-------------|
| uid | integer | The unique ID of the database. |

#### Query parameters

| Field | Type | Description |
|-------|------|-------------|
| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |

### Response {#get-endpoint-response}

Returns the status code `200 OK` if the local database endpoint is available.
Expand All @@ -74,6 +95,8 @@ The following are possible `error_code` values:
| Code | Description |
|------|-------------|
| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database endpoint is available. |
| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
| [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database endpoint is unavailable. |


Expand All @@ -97,12 +120,27 @@ Gets the availability status of a database.

### Request {#get-db-request}

#### Example HTTP request
#### Example HTTP requests


To check database availability without any additional checks:

```sh
GET /v1/bdbs/1/availability
```

To perform a lag-aware database availability check using the cluster's default lag tolerance threshold:

```sh
GET /v1/bdbs/1/availability?extend_check=lag
```

To perform a lag-aware database availability check and override the cluster's default lag tolerance threshold:

```sh
GET /v1/bdbs/1/availability?extend_check=lag&availability_lag_tolerance_ms=100
```

#### Headers

| Key | Value | Description |
Expand All @@ -116,6 +154,13 @@ GET /v1/bdbs/1/availability
|-------|------|-------------|
| uid | integer | The unique ID of the database. |

#### Query parameters

| Field | Type | Description |
|-------|------|-------------|
| extend_check | list of comma-separated strings | List of additional availability checks to perform (optional)<br />Values:<br />**lag**: Enables lag-aware checks to assess replication health. Determines if a replica is sufficiently synced with the primary for failover/failback scenarios. |
| availability_lag_tolerance_ms | integer | Overrides the cluster's default lag tolerance threshold when using `extend_check=lag`. Recommended value: 100 milliseconds. |

### Response {#get-db-response}

Returns the status code `200 OK` if the database is available.
Expand All @@ -139,4 +184,6 @@ The following are possible `error_code` values:
| Code | Description |
|------|-------------|
| [200 OK](https://www.rfc-editor.org/rfc/rfc9110.html#name-200-ok) | Database is available. |
| [400 Bad Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-400-bad-request) | Invalid schema. |
| [404 Not Found](https://www.rfc-editor.org/rfc/rfc9110.html#name-404-not-found) | Database not found. |
| [503 Service Unavailable](https://www.rfc-editor.org/rfc/rfc9110.html#name-503-service-unavailable) | Database is unavailable or doesn't have quorum. |