You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+31-25Lines changed: 31 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
:description: This section describes how to recover databases that have become unavailable.
1
+
:description: This section describes how to recover databases that have become unavailable. How to heal a cluster.
2
2
[role=enterprise-edition]
3
3
[[cluster-recovery]]
4
4
= Disaster recovery
@@ -30,32 +30,32 @@ In this guide the following terms are used:
30
30
31
31
* An _offline_ server is a server that is not running but may be restartable.
32
32
* A _lost_ server, however, is a server that is currently not running and cannot be restarted.
33
-
* A _writeavailable_ database is able to serve writes, while a _write unavailable_ database is not.
33
+
* A _write-available_ database is able to serve writes, while a _write unavailable_ database is not.
34
34
====
35
35
36
36
There are four steps to recovering a cluster from a disaster:
37
37
38
38
. Start the Neo4j process on all servers which are not _lost_.
39
-
See xref:start-the-neo4j-process[Start the neo4j process] for more information.
40
-
. Make the `system` database write available, so that the cluster can be modified.
41
-
See xref:make-the-system-database-write-available[Make the `system` database writeavailable] for more information.
39
+
See xref:start-the-neo4j-process[Start the Neo4j process] for more information.
40
+
. Make the `system` database able to accept write operations, so that the cluster can be modified.
41
+
See xref:make-the-system-database-write-available[Make the `system` database write-available] for more information.
42
42
. Detach any potential lost servers from the cluster and replace them by new ones.
43
43
See xref:make-servers-available[Make servers available] for more information.
44
-
. Finish disaster recovery by starting or continuing to manage databases and verify that they are writeavailable.
45
-
See xref:make-databases-write-available[Make databases writeavailable] for more information.
44
+
. Finish disaster recovery by starting or continuing to manage databases and verify that they are write-available.
45
+
See xref:make-databases-write-available[Make databases write-available] for more information.
46
46
47
47
Each step is described in the following three sections:
48
48
49
49
. Objective -- a state that the cluster needs to be in, with optional motivation.
50
-
. Verifying the state -- An example of how the state can be verified.
50
+
. Verifying the state -- an example of how the state can be verified.
51
51
. Path to correct state -- a proposed series of steps to get to the correct state.
52
52
53
53
[CAUTION]
54
54
====
55
55
Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
56
56
====
57
57
58
-
58
+
[[disaster-recovery-steps]]
59
59
== Disaster recovery steps
60
60
61
61
[NOTE]
@@ -69,30 +69,33 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
69
69
=== Start the Neo4j process
70
70
71
71
==== Objective
72
+
72
73
====
73
-
The Neo4j process is started on all servers which are not _lost_.
74
+
The Neo4j process is started on all servers that are not _lost_.
74
75
====
75
76
76
77
==== Path to correct state
78
+
77
79
Start the Neo4j process on all servers that are _offline_.
78
80
If a server is unable to start, inspect the logs and contact support personnel.
79
81
The server may have to be considered indefinitely lost.
80
82
81
83
[[make-the-system-database-write-available]]
82
-
=== Make the `system` database writeavailable
84
+
=== Make the `system` database write-available
83
85
84
86
==== Objective
85
87
====
86
-
The `system` database is write available.
88
+
The `system` database is able to accept write operations.
87
89
====
88
90
89
91
The `system` database contains the view of the cluster.
90
92
This includes which servers and databases are present, where they live and how they are configured.
91
93
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
92
94
Databases might also need to be recreated to regain write availability.
93
-
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery.
95
+
Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery.
94
96
95
97
==== Verifying the state
98
+
96
99
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
97
100
98
101
[source, shell]
@@ -107,6 +110,7 @@ Instead, check that the primary is allocated on an available server and that it
107
110
=====
108
111
109
112
==== Path to correct state
113
+
110
114
Use the following steps to regain write availability for the `system` database if it has been lost.
111
115
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
112
116
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely.
@@ -167,9 +171,9 @@ SHOW SERVERS;
167
171
----
168
172
169
173
==== Path to correct state
170
-
The following steps can be used to remove lost servers and add new ones to the cluster.
171
-
To be able to remove lost servers, any allocations it should host need to be moved to available servers in the cluster.
172
-
This is done in two different ways:
174
+
Use the following steps to remove lost servers and add new ones to the cluster.
175
+
To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster.
176
+
This can be done in two different ways:
173
177
174
178
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
175
179
* Any allocations that can move will be instructed to do so by deallocating the server.
@@ -179,8 +183,10 @@ This is done in two different ways:
179
183
====
180
184
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
181
185
This prevents new database allocations from being moved to this server.
182
-
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
183
-
If servers were added in the 'System database write availability' step of this guide, additional servers might not be needed here.
186
+
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place.
187
+
See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
188
+
+
189
+
If servers were added in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide, additional servers might not be needed here.
184
190
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
185
191
+
186
192
[NOTE]
@@ -210,7 +216,7 @@ The status check procedure cannot verify the write availability of a database co
210
216
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
211
217
=====
212
218
213
-
. For each database that is not writeavailable, recreate it to move it from lost servers and regain write availability.
219
+
. For each database that is not write-available, recreate it to move it from lost servers and regain write availability.
214
220
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
215
221
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
216
222
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
@@ -235,11 +241,11 @@ This removes the server from the cluster's view.
235
241
236
242
237
243
[[make-databases-write-available]]
238
-
=== Make databases writeavailable
244
+
=== Make databases write-available
239
245
240
246
==== Objective
241
247
====
242
-
All databases that are desired to be started are writeavailable.
248
+
All databases that are desired to be started are write-available.
243
249
====
244
250
245
251
Once this state is verified, disaster recovery is complete.
@@ -271,16 +277,16 @@ A stricter verification can be done to verify that all databases are in their de
271
277
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers.
272
278
273
279
==== Path to correct state
274
-
The following steps can be used to make all databases in the cluster writeavailable again.
275
-
They include recreating any databases that are not write available, as well as identifying any recreations which will not complete.
280
+
Use the following steps to make all databases in the cluster write-available again.
281
+
They include recreating any databases that are not write-capable and identifying any recreations that will not complete.
276
282
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.
277
283
278
284
.Guide
279
285
[%collapsible]
280
286
====
281
287
. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
282
288
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
283
-
. Recreate every database that is not writeavailable and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
289
+
. Recreate every database that is not write-available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
284
290
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
285
291
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
286
292
+
@@ -289,7 +295,7 @@ If any database has `currentStatus` = `quarantined` on an available server, recr
289
295
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
290
296
=====
291
297
292
-
. Run `SHOW DATABASES` and check any recreated databases which are not writeavailable.
298
+
. Run `SHOW DATABASES` and check any recreated databases that are not write-available.
293
299
Recreating a database will not complete if one of the following messages is displayed in the message field:
294
300
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
295
301
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`
0 commit comments