You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+17-30Lines changed: 17 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ Finish disaster recovery by starting or continuing to manage databases and verif
39
39
40
40
Every step consists of the following three sections:
41
41
42
-
. A state that needs to be verified, with optional motivation.
42
+
. A state that the cluster needs to be in, with optional motivation.
43
43
. An example of how the state can be verified.
44
44
. A proposed series of steps to get to the correct state.
45
45
@@ -60,7 +60,7 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
60
60
61
61
=== Neo4j process started
62
62
63
-
==== State
63
+
==== Objective
64
64
====
65
65
The Neo4j process is started on all servers which are not _lost_.
66
66
====
@@ -73,7 +73,7 @@ The server may have to be considered indefinitely lost.
73
73
[[restore-the-system-database]]
74
74
=== `System` database write availability
75
75
76
-
==== State
76
+
==== Objective
77
77
====
78
78
The `system` database is write available.
79
79
====
@@ -86,10 +86,6 @@ Because both of these steps are executed by modifying the `system` database, mak
86
86
87
87
==== Example verification
88
88
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
89
-
The procedure should be called on all remaining primary allocations of the `system` database, in order to provide the correct view.
90
-
The default timeout for the procedure is 1 second, but depending on the network latency in the environment it might need to be extended to produce an accurate result.
91
-
If any of the primary `system` allocations report `replicationSuccessful` = `TRUE`, the `system` database is write available.
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
102
-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
103
98
=====
104
99
105
100
==== Path to correct state
@@ -141,7 +136,7 @@ Be aware that not replacing servers can cause cluster overload when databases ar
141
136
[[recover-servers]]
142
137
=== Server availability
143
138
144
-
==== State
139
+
==== Objective
145
140
====
146
141
All servers in the cluster's view are available and enabled.
147
142
====
@@ -162,8 +157,9 @@ SHOW SERVERS;
162
157
163
158
==== Path to correct state
164
159
The following steps can be used to remove lost servers and add new ones to the cluster.
165
-
That includes moving any potential database allocations from lost servers to available servers.
166
-
These steps might also recreate some databases, since a database which has lost a majority of its primary allocations cannot be moved from one server to another.
160
+
To be able to remove lost servers, any allocations it should host needs to be moved to available servers in the cluster.
161
+
This is done in two steps, first any databases that cannot move by themselves needs to be recreated so that they are forced to move.
162
+
Then, any allocations that can move will be told to do so by deallocating the server.
167
163
168
164
.Guide
169
165
[%collapsible]
@@ -182,35 +178,32 @@ Furthermore, it might require the topology for a database to be altered to make
182
178
=====
183
179
184
180
. For each stopped database (`currentStatus`= `offline`), start them by running `START DATABASE stopped-db`.
185
-
This is necessary since stopped databases cannot be moved from one server to another.
186
-
Verify that they are in `currentStatus` = `started` on all servers which are not lost before moving to the next step, otherwise they might be recreated unnecessarily.
181
+
This is necessary since stopped databases cannot be deallocated from a server.
182
+
It is also necessary for the status check procedure to accurately indicate if this database should be recreated or not.
183
+
Verify that all allocations are in `currentStatus` = `started` on servers which are not lost before moving to the next step.
187
184
If a database fails to start, leave it to be recreated in the next step of this guide.
188
185
+
189
186
[NOTE]
190
187
=====
191
-
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following command:
188
+
A database can be set to `READ-ONLY` before it is started to avoid updates on the database with the following command:
192
189
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
193
190
=====
194
191
195
192
. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases running in primary mode on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
196
-
Depending on the network latency in the environment, consider extending the timeout for this procedure to produce an accurate result.
197
-
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
198
193
+
199
194
[NOTE]
200
195
=====
201
196
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
202
-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
203
197
=====
204
198
205
199
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
206
200
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
207
201
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
208
202
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
209
203
+
210
-
[NOTE]
204
+
[CAUTION]
211
205
=====
212
-
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store will be replaced by the most up-to-date copy according to the cluster's view without manual intervention.
213
-
Furthermore, this option will automatically recreate the database based on a backup if no available allocation can be found.
206
+
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
214
207
=====
215
208
216
209
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
@@ -230,7 +223,7 @@ This removes the server from the cluster's view.
230
223
[[recover-databases]]
231
224
=== Database availability
232
225
233
-
==== State
226
+
==== Objective
234
227
====
235
228
All databases which are desired to be started are write available.
236
229
====
@@ -248,10 +241,6 @@ Therefore, an allocation with `currentStatus` = `STARTING` will probably reach t
248
241
[[example-verification]]
249
242
==== Example verification
250
243
All databases' write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
251
-
The procedure should be called on all servers in the cluster, in order to provide the correct view.
252
-
The default timeout for the procedure is 1 second, but depending on the network latency in the environment it might need to be extended to produce an accurate result.
253
-
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
254
-
Therefore, the desired state has been verified when this is true for all *started* databases.
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
264
-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
265
253
=====
266
254
267
255
A stricter verification can be done to verify that all databases are in their desired states on all servers.
@@ -270,7 +258,7 @@ For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` =
270
258
==== Path to correct state
271
259
The following steps can be used to make all databases in the cluster write available again.
272
260
They include recreating any databases that are not write available, as well as identifying any recreations which will not complete.
273
-
Recreations might fail for different reasons, but one example is that the checksums does not match for the same transaction on different copies.
261
+
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.
274
262
275
263
.Guide
276
264
[%collapsible]
@@ -280,10 +268,9 @@ Recreations might fail for different reasons, but one example is that the checks
280
268
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
281
269
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
282
270
+
283
-
[NOTE]
271
+
[CAUTION]
284
272
=====
285
-
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store will be replaced by the most up-to-date copy according to the cluster's view without manual intervention.
286
-
Furthermore, this option will automatically recreate the database based on a backup if no available allocation can be found.
273
+
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
287
274
=====
288
275
289
276
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
0 commit comments