Skip to content

Commit 92737ea

Browse files
committed
WIP
1 parent 02691ca commit 92737ea

File tree

1 file changed

+17
-7
lines changed

1 file changed

+17
-7
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55

66
A database can become unavailable due to issues on different system levels.
77
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
8-
It is also possible for databases to become quarantined due to a critical failure in the system, which may lead to unavailability even without the loss of servers.
98

109
This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads.
11-
However, if a database is not performing as expected for other reasons, this section cannot help.
10+
However, if a database is _unavailable_ because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
1211
By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
1312

1413
[NOTE]
@@ -31,12 +30,18 @@ Consequently, in a disaster where multiple servers go down, some databases may k
3130
There are three main steps to recovering a cluster from a disaster.
3231
Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
3332

33+
[NOTE]
34+
====
35+
Any potential quarantined databases need to be handled before executing this guide, see REF for more information.
36+
====
37+
3438
. Ensure the `system` database is available in the cluster.
3539
The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.
3640

37-
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements.
41+
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements
42+
This process starts the managing of databases by default.
3843

39-
. After the `system` database is available and the cluster's topology is satisfied, you can manage the databases.
44+
. After the `system` database is available, the cluster's topology is satisfied and the databases has been managed, continue managing databases and verify that they are available.
4045

4146
The steps are described in detail in the following sections.
4247

@@ -67,6 +72,7 @@ The server may have to be considered indefinitely lost.
6772
If the response contain a writer, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
6873
** Create a temporary user by running `CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'`.
6974
Check if the temporary user is created by running `SHOW USERS`. If it was created as expected, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
75+
** Use rafted status check as described in REF.
7076

7177
+
7278
. *Restore the `system` database.*
@@ -109,10 +115,13 @@ If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:c
109115
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
110116
. For each `CORDONED` server, make sure a new unconstrained server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] to add additional servers.
111117
If no servers were added in xref:clustering/disaster-recovery.adoc#restore-the-system-database[Restore the system database], the amount of servers that needs to be added is equal to the number of `CORDONED` servers.
112-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succedded, skip to step 6.
118+
[NOTE]
119+
====
120+
It is not strictly necessary to add new servers in this step. However, not adding new servers might require the topology for a database to be altered via ALTER DATABASE to make deallocations possible or in the RECREATE command to make it possible.
121+
====
122+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succeeded, skip to step 6.
113123
. Make sure deallocating the servers is possible by doing the following steps:
114124
.. Run `SHOW DATABASES`.
115-
.. Fix `QUARANTINED` databases.
116125
.. Try to start the offline databases allocated on any of the `CORDONED` servers by running `START DATABASE stopped-db WAIT`.
117126
+
118127
[NOTE]
@@ -137,7 +146,7 @@ Consider running SHOW SERVERS to determine what action is suitable to resolve th
137146
[[recover-databases]]
138147
=== Verify recovery of databases
139148

140-
Once the `system` database is verified available, and all servers are online, verify that all databases are in a desirable state.
149+
Once the `system` database is verified available, and all servers are online, manage and verify that all databases are in a desirable state.
141150

142151
. Run `SHOW DATABASES`. If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
143152
+
@@ -153,6 +162,7 @@ Deallocating databases can take an unbounded amount of time since it involves co
153162
Therefore, an allocation in STORE_COPY state should reach the requestedStatus given some time.
154163
====
155164

165+
. For any databases in
156166
. For any recreated databases in `STARTING` state with one of the following messages displayed in the message field:
157167
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
158168
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`

0 commit comments

Comments
 (0)