You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+66-50Lines changed: 66 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,38 +18,45 @@ You have to create a new cluster and restore the databases, see xref:clustering/
18
18
19
19
== Faults in clusters
20
20
21
-
Databases in clusters follow an allocation strategy.
22
-
This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries.
21
+
Databases in clusters may be allocated differently within the cluster and may also have different numbers of primaries and secondaries.
23
22
The consequence of this is that all servers may be different in which databases they are hosting.
24
23
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
25
24
Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
26
25
27
-
== Guide structure
26
+
== Guide overview
28
27
[NOTE]
29
28
====
30
-
In this guide, an _offline_ server is a server that is not running but may be restartable.
31
-
A _lost_ server, however, is a server that is currently not running and cannot be restarted.
32
-
A _write available_ database is able to serve writes, while a _write unavailable_ database is not.
29
+
In this guide the following terms are used:
30
+
31
+
* An _offline_ server is a server that is not running but may be restartable.
32
+
* A _lost_ server, however, is a server that is currently not running and cannot be restarted.
33
+
* A _write available_ database is able to serve writes, while a _write unavailable_ database is not.
33
34
====
34
35
35
-
There are three main steps to recovering a cluster from a disaster.
36
-
First, ensure the `system` database is write available.
37
-
Then, detach any potential lost servers from the cluster and replace them by new ones.
38
-
Finish disaster recovery by starting or continuing to manage databases and verify that they are write available.
36
+
There are four steps to recovering a cluster from a disaster:
37
+
38
+
. Start the Neo4j process on all servers which are not _lost_.
39
+
See xref:start-the-neo4j-process[Start the neo4j process] for more information.
40
+
. Make the `system` database write available, so that the cluster can be modified.
41
+
See xref:make-the-system-database-write-available[Make the `system` database write available] for more information.
42
+
. Detach any potential lost servers from the cluster and replace them by new ones.
43
+
See xref:make-servers-available[Make servers available] for more information.
44
+
. Finish disaster recovery by starting or continuing to manage databases and verify that they are write available.
45
+
See xref:make-databases-write-available[Make databases write available] for more information.
39
46
40
-
Every step consists of the following three sections:
47
+
Each step is described in the following three sections:
41
48
42
-
. A state that the cluster needs to be in, with optional motivation.
43
-
. An example of how the state can be verified.
44
-
. A proposed series of steps to get to the correct state.
49
+
. Objective -- a state that the cluster needs to be in, with optional motivation.
50
+
. Verifying the state -- An example of how the state can be verified.
51
+
. Path to correct state -- a proposed series of steps to get to the correct state.
45
52
46
53
[CAUTION]
47
54
====
48
55
Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
49
56
====
50
57
51
58
52
-
== Guide to disaster recovery
59
+
== Disaster recovery steps
53
60
54
61
[NOTE]
55
62
====
@@ -58,7 +65,8 @@ One way to remedy this is to connect directly to the server using `bolt` instead
58
65
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
59
66
====
60
67
61
-
=== Neo4j process started
68
+
[[start-the-neo4j-process]]
69
+
=== Start the Neo4j process
62
70
63
71
==== Objective
64
72
====
@@ -70,8 +78,8 @@ Start the Neo4j process on all servers that are _offline_.
70
78
If a server is unable to start, inspect the logs and contact support personnel.
71
79
The server may have to be considered indefinitely lost.
72
80
73
-
[[restore-the-system-database]]
74
-
=== `System` database write availability
81
+
[[make-the-system-database-write-available]]
82
+
=== Make the `system` database write available
75
83
76
84
==== Objective
77
85
====
@@ -80,11 +88,11 @@ The `system` database is write available.
80
88
81
89
The `system` database contains the view of the cluster.
82
90
This includes which servers and databases are present, where they live and how they are configured.
83
-
During a disaster, the view of the cluster might need to change to reflect a new reality, for example by removing lost servers.
91
+
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
84
92
Databases might also need to be recreated to regain write availability.
85
93
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery.
86
94
87
-
==== Example verification
95
+
==== Verifying the state
88
96
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
105
+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
106
+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
98
107
=====
99
108
100
109
==== Path to correct state
101
-
The following steps can be used to regain write availability for the `system` database if it has been lost.
110
+
Use the following steps to regain write availability for the `system` database if it has been lost.
102
111
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
103
112
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely.
104
113
@@ -108,7 +117,8 @@ It is important to get a `system` database that is as up-to-date as possible, so
108
117
109
118
[NOTE]
110
119
=====
111
-
This section of the disaster recovery guide uses `neo4j-admin`, for more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands].
120
+
This section of the disaster recovery guide uses `neo4j-admin` commands.
121
+
For more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands].
112
122
=====
113
123
114
124
. Shut down the Neo4j process on all servers.
@@ -123,7 +133,8 @@ It is important that the new servers are unconstrained, or deallocating servers
123
133
=====
124
134
While recommended, it is not strictly necessary to add new servers in this step.
125
135
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database.
126
-
The amount of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
136
+
The number of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`.
137
+
See the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
127
138
Be aware that not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
128
139
=====
129
140
+
@@ -133,8 +144,8 @@ Be aware that not replacing servers can cause cluster overload when databases ar
133
144
====
134
145
135
146
136
-
[[recover-servers]]
137
-
=== Server availability
147
+
[[make-servers-available]]
148
+
=== Make servers available
138
149
139
150
==== Objective
140
151
====
@@ -146,9 +157,9 @@ Furthermore, according to the view of the cluster, these lost servers are still
146
157
Therefore, informing the cluster of servers which are lost is not enough.
147
158
The databases hosted on lost servers also need to be moved onto available servers in the cluster, before the lost servers can be removed.
148
159
149
-
==== Example verification
160
+
==== Verifying the state
150
161
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
151
-
The state has been verified if *all* servers show `health` = `AVAILABLE` and `status` = `ENABLED`.
162
+
The state has been verified if *all* servers show `health` = `Available` and `status` = `Enabled`.
152
163
153
164
[source, cypher]
154
165
----
@@ -157,16 +168,18 @@ SHOW SERVERS;
157
168
158
169
==== Path to correct state
159
170
The following steps can be used to remove lost servers and add new ones to the cluster.
160
-
To be able to remove lost servers, any allocations it should host needs to be moved to available servers in the cluster.
161
-
This is done in two steps, first any databases that cannot move by themselves needs to be recreated so that they are forced to move.
162
-
Then, any allocations that can move will be told to do so by deallocating the server.
171
+
To be able to remove lost servers, any allocations it should host need to be moved to available servers in the cluster.
172
+
This is done in two different ways:
173
+
174
+
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
175
+
* Any allocations that can move will be instructed to do so by deallocating the server.
163
176
164
177
.Guide
165
178
[%collapsible]
166
179
====
167
-
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
180
+
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
168
181
This prevents new database allocations from being moved to this server.
169
-
. For each `CORDONED` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
182
+
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
170
183
If servers were added in the 'System database write availability' step of this guide, additional servers might not be needed here.
171
184
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
172
185
+
@@ -180,7 +193,7 @@ Furthermore, it might require the topology for a database to be altered to make
180
193
. For each stopped database (`currentStatus`= `offline`), start them by running `START DATABASE stopped-db`.
181
194
This is necessary since stopped databases cannot be deallocated from a server.
182
195
It is also necessary for the status check procedure to accurately indicate if this database should be recreated or not.
183
-
Verify that all allocations are in `currentStatus` = `started` on servers which are not lost before moving to the next step.
196
+
Verify that all allocations are in `currentStatus` = `online` on servers which are not lost before moving to the next step.
184
197
If a database fails to start, leave it to be recreated in the next step of this guide.
185
198
+
186
199
[NOTE]
@@ -193,39 +206,40 @@ A database can be set to `READ-ONLY` before it is started to avoid updates on th
193
206
+
194
207
[NOTE]
195
208
=====
196
-
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
209
+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
210
+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
197
211
=====
198
212
199
213
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
200
214
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
201
215
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
202
-
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
216
+
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
203
217
+
204
218
[CAUTION]
205
219
=====
206
-
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
220
+
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
207
221
=====
208
222
209
-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
210
-
This will try to move all database allocations from this server to an available server in the cluster.
223
+
. For each `Cordoned` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
224
+
This will move all database allocations from this server to an available server in the cluster.
211
225
+
212
226
[NOTE]
213
227
=====
214
228
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers.
215
-
Another reason is that some available servers are also `CORDONED`.
229
+
Another reason is that some available servers are also `Cordoned`.
216
230
=====
217
231
218
232
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`.
219
233
This removes the server from the cluster's view.
220
234
====
221
235
222
236
223
-
[[recover-databases]]
224
-
=== Database availability
237
+
[[make-databases-write-available]]
238
+
=== Make databases write available
225
239
226
240
==== Objective
227
241
====
228
-
All databases which are desired to be started are write available.
242
+
All databases that are desired to be started are write available.
229
243
====
230
244
231
245
Once this state is verified, disaster recovery is complete.
@@ -235,12 +249,12 @@ If they are still desired to be in stopped state, run `STOP DATABASE started-db
235
249
[CAUTION]
236
250
====
237
251
Remember, recreating a database takes an unbounded amount of time since it may involve copying the store to a new server, as described in xref:clustering/databases.adoc#recreate-databases[Recreate databases].
238
-
Therefore, an allocation with `currentStatus` = `STARTING` will probably reach the `requestedStatus` given some time.
252
+
Therefore, an allocation with `currentStatus` = `starting` will probably reach the `requestedStatus` given some time.
239
253
====
240
254
241
255
[[example-verification]]
242
-
==== Example verification
243
-
All databases' write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
256
+
==== Verifying the state
257
+
You can verify all clustered databases' write availability by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[status check] procedure.
The write availability of a database configured to have a single primary cannot be checked with the status check, instead check that the primary is allocated on an available server and that it has `currentStatus` = `STARTED`.
266
+
The status check procedure cannot verify the write availability of a database configured to have a single primary.
267
+
Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`.
253
268
=====
254
269
255
270
A stricter verification can be done to verify that all databases are in their desired states on all servers.
@@ -263,14 +278,15 @@ Recreations might fail for different reasons, but one example is that the checks
263
278
.Guide
264
279
[%collapsible]
265
280
====
266
-
. Identify all write unavailable databases that are desired to be `STARTED` by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
281
+
. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
282
+
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
267
283
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
268
284
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
269
-
If any database has `currentStatus` = `QUARANTINED` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
285
+
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
270
286
+
271
287
[CAUTION]
272
288
=====
273
-
By using recreate with xref:clustering/databases.adoc#undefined-servers[Undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in some edge cases where the system database has been restored.
289
+
If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
274
290
=====
275
291
276
292
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
0 commit comments