Skip to content

Commit 43d60b6

Browse files
committed
Review comments.
1 parent 3140906 commit 43d60b6

File tree

1 file changed

+17
-30
lines changed

1 file changed

+17
-30
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 17 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Finish disaster recovery by starting or continuing to manage databases and verif
3939

4040
Every step consists of the following three sections:
4141

42-
. A state that needs to be verified, with optional motivation.
42+
. A state that the cluster needs to be in, with optional motivation.
4343
. An example of how the state can be verified.
4444
. A proposed series of steps to get to the correct state.
4545

@@ -60,7 +60,7 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
6060

6161
=== Neo4j process started
6262

63-
==== State
63+
==== Objective
6464
====
6565
The Neo4j process is started on all servers which are not _lost_.
6666
====
@@ -73,7 +73,7 @@ The server may have to be considered indefinitely lost.
7373
[[restore-the-system-database]]
7474
=== `System` database write availability
7575

76-
==== State
76+
==== Objective
7777
====
7878
The `system` database is write available.
7979
====
@@ -86,10 +86,6 @@ Because both of these steps are executed by modifying the `system` database, mak
8686

8787
==== Example verification
8888
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
89-
The procedure should be called on all remaining primary allocations of the `system` database, in order to provide the correct view.
90-
The default timeout for the procedure is 1 second, but depending on the network latency in the environment it might need to be extended to produce an accurate result.
91-
If any of the primary `system` allocations report `replicationSuccessful` = `TRUE`, the `system` database is write available.
92-
Therefore, the desired state has been verified.
9389

9490
[source, shell]
9591
----
@@ -99,7 +95,6 @@ CALL dbms.cluster.statusCheck(["system"]);
9995
[NOTE]
10096
=====
10197
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
102-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
10398
=====
10499

105100
==== Path to correct state
@@ -141,7 +136,7 @@ Be aware that not replacing servers can cause cluster overload when databases ar
141136
[[recover-servers]]
142137
=== Server availability
143138

144-
==== State
139+
==== Objective
145140
====
146141
All servers in the cluster's view are available and enabled.
147142
====
@@ -162,8 +157,9 @@ SHOW SERVERS;
162157

163158
==== Path to correct state
164159
The following steps can be used to remove lost servers and add new ones to the cluster.
165-
That includes moving any potential database allocations from lost servers to available servers.
166-
These steps might also recreate some databases, since a database which has lost a majority of its primary allocations cannot be moved from one server to another.
160+
To be able to remove lost servers, any allocations it should host needs to be moved to available servers in the cluster.
161+
This is done in two steps, first any databases that cannot move by themselves needs to be recreated so that they are forced to move.
162+
Then, any allocations that can move will be told to do so by deallocating the server.
167163

168164
.Guide
169165
[%collapsible]
@@ -182,35 +178,32 @@ Furthermore, it might require the topology for a database to be altered to make
182178
=====
183179
184180
. For each stopped database (`currentStatus`= `offline`), start them by running `START DATABASE stopped-db`.
185-
This is necessary since stopped databases cannot be moved from one server to another.
186-
Verify that they are in `currentStatus` = `started` on all servers which are not lost before moving to the next step, otherwise they might be recreated unnecessarily.
181+
This is necessary since stopped databases cannot be deallocated from a server.
182+
It is also necessary for the status check procedure to accurately indicate if this database should be recreated or not.
183+
Verify that all allocations are in `currentStatus` = `started` on servers which are not lost before moving to the next step.
187184
If a database fails to start, leave it to be recreated in the next step of this guide.
188185
+
189186
[NOTE]
190187
=====
191-
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following command:
188+
A database can be set to `READ-ONLY` before it is started to avoid updates on the database with the following command:
192189
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
193190
=====
194191
195192
. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases running in primary mode on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
196-
Depending on the network latency in the environment, consider extending the timeout for this procedure to produce an accurate result.
197-
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
198193
+
199194
[NOTE]
200195
=====
201196
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
202-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
203197
=====
204198
205199
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
206200
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
207201
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
208202
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
209203
+
210-
[NOTE]
204+
[CAUTION]
211205
=====
212-
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store will be replaced by the most up-to-date copy according to the cluster's view without manual intervention.
213-
Furthermore, this option will automatically recreate the database based on a backup if no available allocation can be found.
206+
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
214207
=====
215208
216209
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
@@ -230,7 +223,7 @@ This removes the server from the cluster's view.
230223
[[recover-databases]]
231224
=== Database availability
232225

233-
==== State
226+
==== Objective
234227
====
235228
All databases which are desired to be started are write available.
236229
====
@@ -248,10 +241,6 @@ Therefore, an allocation with `currentStatus` = `STARTING` will probably reach t
248241
[[example-verification]]
249242
==== Example verification
250243
All databases' write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
251-
The procedure should be called on all servers in the cluster, in order to provide the correct view.
252-
The default timeout for the procedure is 1 second, but depending on the network latency in the environment it might need to be extended to produce an accurate result.
253-
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
254-
Therefore, the desired state has been verified when this is true for all *started* databases.
255244

256245
[source, shell]
257246
----
@@ -261,7 +250,6 @@ CALL dbms.cluster.statusCheck([]);
261250
[NOTE]
262251
=====
263252
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
264-
The procedure will still produce an accurate result if all but one primary have been lost during a disaster.
265253
=====
266254

267255
A stricter verification can be done to verify that all databases are in their desired states on all servers.
@@ -270,7 +258,7 @@ For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` =
270258
==== Path to correct state
271259
The following steps can be used to make all databases in the cluster write available again.
272260
They include recreating any databases that are not write available, as well as identifying any recreations which will not complete.
273-
Recreations might fail for different reasons, but one example is that the checksums does not match for the same transaction on different copies.
261+
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.
274262

275263
.Guide
276264
[%collapsible]
@@ -280,10 +268,9 @@ Recreations might fail for different reasons, but one example is that the checks
280268
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
281269
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
282270
+
283-
[NOTE]
271+
[CAUTION]
284272
=====
285-
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store will be replaced by the most up-to-date copy according to the cluster's view without manual intervention.
286-
Furthermore, this option will automatically recreate the database based on a backup if no available allocation can be found.
273+
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
287274
=====
288275
289276
. Run `SHOW DATABASES` and check any recreated databases which are not write available.

0 commit comments

Comments
 (0)