Skip to content

Commit 3930931

Browse files
Update disaster-recovery.adoc
1 parent dd49ab3 commit 3930931

File tree

1 file changed

+31
-25
lines changed

1 file changed

+31
-25
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
:description: This section describes how to recover databases that have become unavailable.
1+
:description: This section describes how to recover databases that have become unavailable. How to heal a cluster.
22
[role=enterprise-edition]
33
[[cluster-recovery]]
44
= Disaster recovery
@@ -30,32 +30,32 @@ In this guide the following terms are used:
3030
3131
* An _offline_ server is a server that is not running but may be restartable.
3232
* A _lost_ server, however, is a server that is currently not running and cannot be restarted.
33-
* A _write available_ database is able to serve writes, while a _write unavailable_ database is not.
33+
* A _write-available_ database is able to serve writes, while a _write unavailable_ database is not.
3434
====
3535

3636
There are four steps to recovering a cluster from a disaster:
3737

3838
. Start the Neo4j process on all servers which are not _lost_.
39-
See xref:start-the-neo4j-process[Start the neo4j process] for more information.
40-
. Make the `system` database write available, so that the cluster can be modified.
41-
See xref:make-the-system-database-write-available[Make the `system` database write available] for more information.
39+
See xref:start-the-neo4j-process[Start the Neo4j process] for more information.
40+
. Make the `system` database able to accept write operations, so that the cluster can be modified.
41+
See xref:make-the-system-database-write-available[Make the `system` database write-available] for more information.
4242
. Detach any potential lost servers from the cluster and replace them by new ones.
4343
See xref:make-servers-available[Make servers available] for more information.
44-
. Finish disaster recovery by starting or continuing to manage databases and verify that they are write available.
45-
See xref:make-databases-write-available[Make databases write available] for more information.
44+
. Finish disaster recovery by starting or continuing to manage databases and verify that they are write-available.
45+
See xref:make-databases-write-available[Make databases write-available] for more information.
4646

4747
Each step is described in the following three sections:
4848

4949
. Objective -- a state that the cluster needs to be in, with optional motivation.
50-
. Verifying the state -- An example of how the state can be verified.
50+
. Verifying the state -- an example of how the state can be verified.
5151
. Path to correct state -- a proposed series of steps to get to the correct state.
5252

5353
[CAUTION]
5454
====
5555
Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
5656
====
5757

58-
58+
[[disaster-recovery-steps]]
5959
== Disaster recovery steps
6060

6161
[NOTE]
@@ -69,30 +69,33 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
6969
=== Start the Neo4j process
7070

7171
==== Objective
72+
7273
====
73-
The Neo4j process is started on all servers which are not _lost_.
74+
The Neo4j process is started on all servers that are not _lost_.
7475
====
7576

7677
==== Path to correct state
78+
7779
Start the Neo4j process on all servers that are _offline_.
7880
If a server is unable to start, inspect the logs and contact support personnel.
7981
The server may have to be considered indefinitely lost.
8082

8183
[[make-the-system-database-write-available]]
82-
=== Make the `system` database write available
84+
=== Make the `system` database write-available
8385

8486
==== Objective
8587
====
86-
The `system` database is write available.
88+
The `system` database is able to accept write operations.
8789
====
8890

8991
The `system` database contains the view of the cluster.
9092
This includes which servers and databases are present, where they live and how they are configured.
9193
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
9294
Databases might also need to be recreated to regain write availability.
93-
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery.
95+
Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery.
9496

9597
==== Verifying the state
98+
9699
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
97100

98101
[source, shell]
@@ -107,6 +110,7 @@ Instead, check that the primary is allocated on an available server and that it
107110
=====
108111

109112
==== Path to correct state
113+
110114
Use the following steps to regain write availability for the `system` database if it has been lost.
111115
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
112116
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely.
@@ -167,9 +171,9 @@ SHOW SERVERS;
167171
----
168172

169173
==== Path to correct state
170-
The following steps can be used to remove lost servers and add new ones to the cluster.
171-
To be able to remove lost servers, any allocations it should host need to be moved to available servers in the cluster.
172-
This is done in two different ways:
174+
Use the following steps to remove lost servers and add new ones to the cluster.
175+
To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster.
176+
This can be done in two different ways:
173177

174178
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
175179
* Any allocations that can move will be instructed to do so by deallocating the server.
@@ -179,8 +183,10 @@ This is done in two different ways:
179183
====
180184
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
181185
This prevents new database allocations from being moved to this server.
182-
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
183-
If servers were added in the 'System database write availability' step of this guide, additional servers might not be needed here.
186+
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place.
187+
See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
188+
+
189+
If servers were added in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide, additional servers might not be needed here.
184190
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
185191
+
186192
[NOTE]
@@ -210,7 +216,7 @@ The status check procedure cannot verify the write availability of a database co
210216
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
211217
=====
212218
213-
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
219+
. For each database that is not write-available, recreate it to move it from lost servers and regain write availability.
214220
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
215221
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
216222
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
@@ -235,11 +241,11 @@ This removes the server from the cluster's view.
235241

236242

237243
[[make-databases-write-available]]
238-
=== Make databases write available
244+
=== Make databases write-available
239245

240246
==== Objective
241247
====
242-
All databases that are desired to be started are write available.
248+
All databases that are desired to be started are write-available.
243249
====
244250

245251
Once this state is verified, disaster recovery is complete.
@@ -271,16 +277,16 @@ A stricter verification can be done to verify that all databases are in their de
271277
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers.
272278

273279
==== Path to correct state
274-
The following steps can be used to make all databases in the cluster write available again.
275-
They include recreating any databases that are not write available, as well as identifying any recreations which will not complete.
280+
Use the following steps to make all databases in the cluster write-available again.
281+
They include recreating any databases that are not write-capable and identifying any recreations that will not complete.
276282
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.
277283

278284
.Guide
279285
[%collapsible]
280286
====
281287
. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
282288
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
283-
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
289+
. Recreate every database that is not write-available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
284290
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
285291
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
286292
+
@@ -289,7 +295,7 @@ If any database has `currentStatus` = `quarantined` on an available server, recr
289295
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
290296
=====
291297
292-
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
298+
. Run `SHOW DATABASES` and check any recreated databases that are not write-available.
293299
Recreating a database will not complete if one of the following messages is displayed in the message field:
294300
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
295301
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`

0 commit comments

Comments
 (0)