Skip to content

Commit f775014

Browse files
committed
WIP
1 parent 78f1701 commit f775014

File tree

1 file changed

+46
-38
lines changed

1 file changed

+46
-38
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 46 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ One way to remedy this is to connect directly to the server using `bolt` instead
5353
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
5454
====
5555

56+
[[restore-the-system-database]]
5657
=== Restore the `system` database
5758

5859
The first step of recovery is to ensure that the `system` database is available.
@@ -97,59 +98,66 @@ Use one of the following options:
9798

9899

99100
[[recover-servers]]
100-
=== Recover servers
101+
=== Recover servers and user databases
101102

102103
Once the `system` database is available, the cluster can be managed.
103104
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new servers.
104-
The steps here identify the lost servers and safely detach them from the cluster.
105+
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved for different reasons.
105106

106107
. Run `SHOW SERVERS`.
107108
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Recover databases].
108109
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
109-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
110-
. For each server that failed to deallocate with one of the following messages:
111-
.. `Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.\*'. +
112-
Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries. +
113-
Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.`
114-
+
115-
or
116-
+
117-
`Could not deallocate server(s) `serverId`.
118-
Database [database] has lost quorum of servers, only found [existing number of primaries] of [expected number of primaries].
119-
Cannot be safely reallocated.`
120-
+
121-
First ensure that there is a backup for the database in question (see xref:backup-restore/online-backup.adoc[Online backup]), and then drop the database by running `DROP DATABASE database-name`.
122-
Return to step 3.
123-
.. `Could not deallocate server [server]. Cannot change allocations for database [stopped-db] because it is offline.`
124-
+
125-
Try to start the offline database by running `START DATABASE stopped-db WAIT`.
126-
If it starts successfully, return to step 3.
127-
Otherwise, ensure that there is a backup for the database before dropping it with `DROP DATABASE stopped-db`.
128-
Return to step 3.
110+
. For each `CORDONED` server, make sure a new unconstrained server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] to add additional servers.
111+
If no servers were added in xref:clustering/disaster-recovery.adoc#restore-the-system-database[Restore the system database], the amount of servers that needs to be added is equal to the number of `CORDONED` servers.
112+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succedded, skip to step 6.
113+
. Make sure deallocating the servers is possible by doing the following steps:
114+
.. Run `SHOW DATABASES`.
115+
.. Fix `QUARANTINED` databases.
116+
.. Try to start the offline databases allocated on any of the `CORDONED` servers by running `START DATABASE stopped-db WAIT`.
129117
+
130118
[NOTE]
131119
====
132120
A database can be set to `READ-ONLY`-mode before it is started to avoid updates on a database that is desired to be stopped with the following:
133121
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
134122
====
135-
136-
.. `Could not deallocate server [server]. Reallocation of [database] not possible, no new target found. All existing servers: [existing-servers]. Actual allocated server with mode [mode] is [current-hostings].`
137-
+
138-
Add new servers and enable them and then return to step 3, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
139-
. Run `SHOW SERVERS YIELD *` once all enabled servers host the requested databases (`hosting`-field contains exactly the databases in the `requestedHosting` field), and proceed to the next step.
140-
Note that this may take a few minutes.
123+
.. Run CALL statusCheck() for all databases, and recreate all databases that failed replication.
124+
See REF for more information on how to recreate databases. Remember to make sure there are recent backups for the databases, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
125+
.. Return to step 4 to retry deallocating all servers.
141126
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
142-
. Return to step 1.
127+
. Return to step 1 to make sure all servers in the cluster are `AVAILABLE`.
128+
129+
130+
`Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.\*'. +
131+
Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries. +
132+
Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.`
133+
134+
-> What does this error message mean? IS THIS QUARANTINE? However, drop would not have worked here either.
135+
143136

144137
[[recover-databases]]
145-
=== Recover databases
138+
=== Verify recovery of databases
139+
140+
Once the `system` database is verified available, and all servers are online, verify that all databases are in a desirable state.
141+
142+
. Run `SHOW DATABASES`. If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
143+
+
144+
[NOTE]
145+
====
146+
Recreating a database can take an unbounded amount of time since it may involve copying the store to a new server, as described in REF(Recreate docs).
147+
Therefore, an allocation in STARTING state might reach the requestedStatus given some time.
148+
====
149+
+
150+
[NOTE]
151+
====
152+
Deallocating databases can take an unbounded amount of time since it involves copying the store to a server.
153+
Therefore, an allocation in STORE_COPY state should reach the requestedStatus given some time.
154+
====
146155

147-
Once the `system` database is verified available, and all servers are online, the databases can be managed.
148-
The steps here aim to make the unavailable databases available.
156+
. For any recreated databases in `STARTING` state with one of the following messages displayed in the message field:
157+
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
158+
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`
159+
** `No store found on any of the seeders ServerId1, ServerId2...`
160+
+
161+
Recreate them from backup using REF(recreate with seed from URI) or define seeding servers in the recreate procedure so that problematic allocations are excluded.
162+
. Return to step 1 to make sure all databases are in their desired state.
149163

150-
. If you have previously dropped databases as part of this guide, re-create each one from a backup.
151-
See the xref:database-administration/standard-databases/create-databases.adoc[Create databases] section for more information on how to create a database.
152-
. Run `SHOW DATABASES`.
153-
If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
154-
// . For each database that remains unavailable, refer to <<unavailable-databases, Managing unavailable databases in a cluster>>.
155-
// Perform the actions required to get the database available then return to step 2.

0 commit comments

Comments
 (0)