You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+17-7Lines changed: 17 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,9 @@
5
5
6
6
A database can become unavailable due to issues on different system levels.
7
7
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
8
-
It is also possible for databases to become quarantined due to a critical failure in the system, which may lead to unavailability even without the loss of servers.
9
8
10
9
This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads.
11
-
However, if a database is not performing as expected for other reasons, this section cannot help.
10
+
However, if a database is _unavailable_ because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
12
11
By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
13
12
14
13
[NOTE]
@@ -31,12 +30,18 @@ Consequently, in a disaster where multiple servers go down, some databases may k
31
30
There are three main steps to recovering a cluster from a disaster.
32
31
Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
33
32
33
+
[NOTE]
34
+
====
35
+
Any potential quarantined databases need to be handled before executing this guide, see REF for more information.
36
+
====
37
+
34
38
. Ensure the `system` database is available in the cluster.
35
39
The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.
36
40
37
-
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements.
41
+
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements
42
+
This process starts the managing of databases by default.
38
43
39
-
. After the `system` database is available and the cluster's topology is satisfied, you can manage the databases.
44
+
. After the `system` database is available, the cluster's topology is satisfied and the databases has been managed, continue managing databases and verify that they are available.
40
45
41
46
The steps are described in detail in the following sections.
42
47
@@ -67,6 +72,7 @@ The server may have to be considered indefinitely lost.
67
72
If the response contain a writer, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
68
73
** Create a temporary user by running `CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'`.
69
74
Check if the temporary user is created by running `SHOW USERS`. If it was created as expected, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
75
+
** Use rafted status check as described in REF.
70
76
71
77
+
72
78
. *Restore the `system` database.*
@@ -109,10 +115,13 @@ If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:c
109
115
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
110
116
. For each `CORDONED` server, make sure a new unconstrained server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] to add additional servers.
111
117
If no servers were added in xref:clustering/disaster-recovery.adoc#restore-the-system-database[Restore the system database], the amount of servers that needs to be added is equal to the number of `CORDONED` servers.
112
-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succedded, skip to step 6.
118
+
[NOTE]
119
+
====
120
+
It is not strictly necessary to add new servers in this step. However, not adding new servers might require the topology for a database to be altered via ALTER DATABASE to make deallocations possible or in the RECREATE command to make it possible.
121
+
====
122
+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succeeded, skip to step 6.
113
123
. Make sure deallocating the servers is possible by doing the following steps:
114
124
.. Run `SHOW DATABASES`.
115
-
.. Fix `QUARANTINED` databases.
116
125
.. Try to start the offline databases allocated on any of the `CORDONED` servers by running `START DATABASE stopped-db WAIT`.
117
126
+
118
127
[NOTE]
@@ -137,7 +146,7 @@ Consider running SHOW SERVERS to determine what action is suitable to resolve th
137
146
[[recover-databases]]
138
147
=== Verify recovery of databases
139
148
140
-
Once the `system` database is verified available, and all servers are online, verify that all databases are in a desirable state.
149
+
Once the `system` database is verified available, and all servers are online, manage and verify that all databases are in a desirable state.
141
150
142
151
. Run `SHOW DATABASES`. If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
143
152
+
@@ -153,6 +162,7 @@ Deallocating databases can take an unbounded amount of time since it involves co
153
162
Therefore, an allocation in STORE_COPY state should reach the requestedStatus given some time.
154
163
====
155
164
165
+
. For any databases in
156
166
. For any recreated databases in `STARTING` state with one of the following messages displayed in the message field:
157
167
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
158
168
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`
0 commit comments