Skip to content

Commit dd49ab3

Browse files
committed
Review comments and make the output examples from e.g. SHOW SERVERS the same as the actual output.
1 parent 43d60b6 commit dd49ab3

File tree

1 file changed

+66
-50
lines changed

1 file changed

+66
-50
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 66 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -18,38 +18,45 @@ You have to create a new cluster and restore the databases, see xref:clustering/
1818

1919
== Faults in clusters
2020

21-
Databases in clusters follow an allocation strategy.
22-
This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries.
21+
Databases in clusters may be allocated differently within the cluster and may also have different numbers of primaries and secondaries.
2322
The consequence of this is that all servers may be different in which databases they are hosting.
2423
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
2524
Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
2625

27-
== Guide structure
26+
== Guide overview
2827
[NOTE]
2928
====
30-
In this guide, an _offline_ server is a server that is not running but may be restartable.
31-
A _lost_ server, however, is a server that is currently not running and cannot be restarted.
32-
A _write available_ database is able to serve writes, while a _write unavailable_ database is not.
29+
In this guide the following terms are used:
30+
31+
* An _offline_ server is a server that is not running but may be restartable.
32+
* A _lost_ server, however, is a server that is currently not running and cannot be restarted.
33+
* A _write available_ database is able to serve writes, while a _write unavailable_ database is not.
3334
====
3435

35-
There are three main steps to recovering a cluster from a disaster.
36-
First, ensure the `system` database is write available.
37-
Then, detach any potential lost servers from the cluster and replace them by new ones.
38-
Finish disaster recovery by starting or continuing to manage databases and verify that they are write available.
36+
There are four steps to recovering a cluster from a disaster:
37+
38+
. Start the Neo4j process on all servers which are not _lost_.
39+
See xref:start-the-neo4j-process[Start the neo4j process] for more information.
40+
. Make the `system` database write available, so that the cluster can be modified.
41+
See xref:make-the-system-database-write-available[Make the `system` database write available] for more information.
42+
. Detach any potential lost servers from the cluster and replace them by new ones.
43+
See xref:make-servers-available[Make servers available] for more information.
44+
. Finish disaster recovery by starting or continuing to manage databases and verify that they are write available.
45+
See xref:make-databases-write-available[Make databases write available] for more information.
3946

40-
Every step consists of the following three sections:
47+
Each step is described in the following three sections:
4148

42-
. A state that the cluster needs to be in, with optional motivation.
43-
. An example of how the state can be verified.
44-
. A proposed series of steps to get to the correct state.
49+
. Objective -- a state that the cluster needs to be in, with optional motivation.
50+
. Verifying the state -- An example of how the state can be verified.
51+
. Path to correct state -- a proposed series of steps to get to the correct state.
4552

4653
[CAUTION]
4754
====
4855
Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
4956
====
5057

5158

52-
== Guide to disaster recovery
59+
== Disaster recovery steps
5360

5461
[NOTE]
5562
====
@@ -58,7 +65,8 @@ One way to remedy this is to connect directly to the server using `bolt` instead
5865
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
5966
====
6067

61-
=== Neo4j process started
68+
[[start-the-neo4j-process]]
69+
=== Start the Neo4j process
6270

6371
==== Objective
6472
====
@@ -70,8 +78,8 @@ Start the Neo4j process on all servers that are _offline_.
7078
If a server is unable to start, inspect the logs and contact support personnel.
7179
The server may have to be considered indefinitely lost.
7280

73-
[[restore-the-system-database]]
74-
=== `System` database write availability
81+
[[make-the-system-database-write-available]]
82+
=== Make the `system` database write available
7583

7684
==== Objective
7785
====
@@ -80,11 +88,11 @@ The `system` database is write available.
8088

8189
The `system` database contains the view of the cluster.
8290
This includes which servers and databases are present, where they live and how they are configured.
83-
During a disaster, the view of the cluster might need to change to reflect a new reality, for example by removing lost servers.
91+
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
8492
Databases might also need to be recreated to regain write availability.
8593
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery.
8694

87-
==== Example verification
95+
==== Verifying the state
8896
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
8997

9098
[source, shell]
@@ -94,11 +102,12 @@ CALL dbms.cluster.statusCheck(["system"]);
94102

95103
[NOTE]
96104
=====
97-
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
105+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
106+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
98107
=====
99108

100109
==== Path to correct state
101-
The following steps can be used to regain write availability for the `system` database if it has been lost.
110+
Use the following steps to regain write availability for the `system` database if it has been lost.
102111
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
103112
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely.
104113

@@ -108,7 +117,8 @@ It is important to get a `system` database that is as up-to-date as possible, so
108117
109118
[NOTE]
110119
=====
111-
This section of the disaster recovery guide uses `neo4j-admin`, for more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands].
120+
This section of the disaster recovery guide uses `neo4j-admin` commands.
121+
For more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands].
112122
=====
113123
114124
. Shut down the Neo4j process on all servers.
@@ -123,7 +133,8 @@ It is important that the new servers are unconstrained, or deallocating servers
123133
=====
124134
While recommended, it is not strictly necessary to add new servers in this step.
125135
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database.
126-
The amount of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
136+
The number of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`.
137+
See the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
127138
Be aware that not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
128139
=====
129140
+
@@ -133,8 +144,8 @@ Be aware that not replacing servers can cause cluster overload when databases ar
133144
====
134145

135146

136-
[[recover-servers]]
137-
=== Server availability
147+
[[make-servers-available]]
148+
=== Make servers available
138149

139150
==== Objective
140151
====
@@ -146,9 +157,9 @@ Furthermore, according to the view of the cluster, these lost servers are still
146157
Therefore, informing the cluster of servers which are lost is not enough.
147158
The databases hosted on lost servers also need to be moved onto available servers in the cluster, before the lost servers can be removed.
148159

149-
==== Example verification
160+
==== Verifying the state
150161
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
151-
The state has been verified if *all* servers show `health` = `AVAILABLE` and `status` = `ENABLED`.
162+
The state has been verified if *all* servers show `health` = `Available` and `status` = `Enabled`.
152163

153164
[source, cypher]
154165
----
@@ -157,16 +168,18 @@ SHOW SERVERS;
157168

158169
==== Path to correct state
159170
The following steps can be used to remove lost servers and add new ones to the cluster.
160-
To be able to remove lost servers, any allocations it should host needs to be moved to available servers in the cluster.
161-
This is done in two steps, first any databases that cannot move by themselves needs to be recreated so that they are forced to move.
162-
Then, any allocations that can move will be told to do so by deallocating the server.
171+
To be able to remove lost servers, any allocations it should host need to be moved to available servers in the cluster.
172+
This is done in two different ways:
173+
174+
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
175+
* Any allocations that can move will be instructed to do so by deallocating the server.
163176

164177
.Guide
165178
[%collapsible]
166179
====
167-
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
180+
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
168181
This prevents new database allocations from being moved to this server.
169-
. For each `CORDONED` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
182+
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
170183
If servers were added in the 'System database write availability' step of this guide, additional servers might not be needed here.
171184
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
172185
+
@@ -180,7 +193,7 @@ Furthermore, it might require the topology for a database to be altered to make
180193
. For each stopped database (`currentStatus`= `offline`), start them by running `START DATABASE stopped-db`.
181194
This is necessary since stopped databases cannot be deallocated from a server.
182195
It is also necessary for the status check procedure to accurately indicate if this database should be recreated or not.
183-
Verify that all allocations are in `currentStatus` = `started` on servers which are not lost before moving to the next step.
196+
Verify that all allocations are in `currentStatus` = `online` on servers which are not lost before moving to the next step.
184197
If a database fails to start, leave it to be recreated in the next step of this guide.
185198
+
186199
[NOTE]
@@ -193,39 +206,40 @@ A database can be set to `READ-ONLY` before it is started to avoid updates on th
193206
+
194207
[NOTE]
195208
=====
196-
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
209+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
210+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
197211
=====
198212
199213
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
200214
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
201215
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
202-
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
216+
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
203217
+
204218
[CAUTION]
205219
=====
206-
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
220+
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
207221
=====
208222
209-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
210-
This will try to move all database allocations from this server to an available server in the cluster.
223+
. For each `Cordoned` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
224+
This will move all database allocations from this server to an available server in the cluster.
211225
+
212226
[NOTE]
213227
=====
214228
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers.
215-
Another reason is that some available servers are also `CORDONED`.
229+
Another reason is that some available servers are also `Cordoned`.
216230
=====
217231
218232
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`.
219233
This removes the server from the cluster's view.
220234
====
221235

222236

223-
[[recover-databases]]
224-
=== Database availability
237+
[[make-databases-write-available]]
238+
=== Make databases write available
225239

226240
==== Objective
227241
====
228-
All databases which are desired to be started are write available.
242+
All databases that are desired to be started are write available.
229243
====
230244

231245
Once this state is verified, disaster recovery is complete.
@@ -235,12 +249,12 @@ If they are still desired to be in stopped state, run `STOP DATABASE started-db
235249
[CAUTION]
236250
====
237251
Remember, recreating a database takes an unbounded amount of time since it may involve copying the store to a new server, as described in xref:clustering/databases.adoc#recreate-databases[Recreate databases].
238-
Therefore, an allocation with `currentStatus` = `STARTING` will probably reach the `requestedStatus` given some time.
252+
Therefore, an allocation with `currentStatus` = `starting` will probably reach the `requestedStatus` given some time.
239253
====
240254

241255
[[example-verification]]
242-
==== Example verification
243-
All databases' write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
256+
==== Verifying the state
257+
You can verify all clustered databases' write availability by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[status check] procedure.
244258

245259
[source, shell]
246260
----
@@ -249,7 +263,8 @@ CALL dbms.cluster.statusCheck([]);
249263

250264
[NOTE]
251265
=====
252-
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
266+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
267+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
253268
=====
254269

255270
A stricter verification can be done to verify that all databases are in their desired states on all servers.
@@ -263,14 +278,15 @@ Recreations might fail for different reasons, but one example is that the checks
263278
.Guide
264279
[%collapsible]
265280
====
266-
. Identify all write unavailable databases that are desired to be `STARTED` by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
281+
. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
282+
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
267283
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
268284
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
269-
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
285+
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
270286
+
271287
[CAUTION]
272288
=====
273-
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
289+
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
274290
=====
275291
276292
. Run `SHOW DATABASES` and check any recreated databases which are not write available.

0 commit comments

Comments
 (0)