You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+35-33Lines changed: 35 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,8 @@ A database can become unavailable due to issues on different system levels.
7
7
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
8
8
9
9
This section contains a step-by-step guide on how to recover *unavailable databases* that are incapable of serving writes, while possibly still being able to serve reads.
10
+
The guide recovers the unavailable databases and make them fully operational, with minimal impact on the other databases in the cluster.
10
11
However, if a database is not performing as expected for other reasons, this section cannot help.
11
-
By following the steps outlined here, you can recover the unavailable databases and make them fully operational, with minimal impact on the other databases in the cluster.
12
12
13
13
[CAUTION]
14
14
====
@@ -53,15 +53,22 @@ Verifying each state before continuing to the next step, regardless of the disas
53
53
54
54
[NOTE]
55
55
====
56
-
Before beginning this guide, start the Neo4j process on all servers that are _offline_.
57
-
If a server is unable to start, inspect the logs and contact support personnel.
58
-
The server may have to be considered indefinitely lost.
59
-
====
60
-
61
56
Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the `neo4j` scheme for routing.
62
57
One way to remedy this is to connect directly to the server using `bolt` instead of `neo4j`.
63
58
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
59
+
====
60
+
61
+
=== Neo4j process started
62
+
63
+
==== State
64
+
====
65
+
The Neo4j process is started on all servers which are not _lost_.
66
+
====
64
67
68
+
==== Path to correct state
69
+
Start the Neo4j process on all servers that are _offline_.
70
+
If a server is unable to start, inspect the logs and contact support personnel.
71
+
The server may have to be considered indefinitely lost.
65
72
66
73
[[restore-the-system-database]]
67
74
=== `System` database write availability
@@ -110,14 +117,14 @@ This causes downtime for all databases in the cluster until the processes are st
110
117
. On each server, run `bin/neo4j-admin database info system` and compare the `lastCommittedTransaction` to find out which server has the most up-to-date copy of the `system` database.
111
118
. On the most up-to-date server, run `bin/neo4j-admin database dump system --to-path=[path-to-dump]` to take a dump of the current `system` database and store it in an accessible location.
112
119
. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
113
-
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
120
+
It is important that the new servers are unconstrained, or deallocating servers in the next step of this guide might be blocked, even though enough servers were added.
114
121
+
115
122
[NOTE]
116
123
=====
117
124
While recommended, it is not strictly necessary to add new servers in this step.
118
125
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database.
119
126
The amount of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
120
-
Not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
127
+
Be aware that not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
121
128
=====
122
129
+
123
130
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
@@ -136,8 +143,8 @@ All servers in the cluster's view are available and enabled.
136
143
137
144
A lost server will still be in the `system` database's view of the cluster, but in an unavailable state.
138
145
According to the view of the cluster, these lost servers are still hosting the databases they had before they became lost.
139
-
Therefore, removing lost servers is not as easy as informing the `system` database that they are lost.
140
-
It also includes moving requested allocations on the lost servers onto servers which are actually in the cluster, so that those databases' topologies are still satisfied.
146
+
Therefore, informing the cluster of servers which are lost is not enough.
147
+
The databases hosted on the lost servers also need to be moved onto servers which are actually in the cluster.
141
148
142
149
==== Example verification
143
150
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
@@ -150,7 +157,7 @@ SHOW SERVERS;
150
157
151
158
==== Path to correct state
152
159
The following steps can be used to remove lost servers and add new ones to the cluster.
153
-
They include moving any potential database allocations from lost servers to available servers in the cluster.
160
+
That includes moving any potential database allocations from lost servers to available servers in the cluster.
154
161
These steps might also recreate some databases, since a database which has lost a majority of its primary allocations cannot be moved from one server to another.
155
162
156
163
.Guide
@@ -170,44 +177,38 @@ However, not adding new servers reduces the capacity of the cluster to handle wo
170
177
Furthermore, it might require the topology for a database to be altered to make deallocating servers and recreating databases possible.
171
178
=====
172
179
173
-
// ? from here
174
-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
175
-
This will try to move all database allocations from this server to an available server in the cluster.
176
-
Once a server is `DEALLOCATED`, all allocated user databases on this server has been moved successfully.
177
-
+
178
-
[NOTE]
179
-
=====
180
-
Remember, moving databases can take an unbounded amount of time since it involves copying the store to a new server.
181
-
Therefore, an allocation with `currentStatus` = `DEALLOCATING` should reach the `requestedStatus` = `DEALLOCATED` given some time.
182
-
=====
183
-
. If any deallocations failed, make them possible by executing the following steps:
184
-
.. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
185
-
.. For each stopped database that has at least one allocation on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
180
+
. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
181
+
. For each stopped database that has at least one allocation on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
186
182
This is necessary since stopped databases cannot be moved from one server to another.
187
183
+
188
184
[NOTE]
189
185
=====
190
186
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following command:
191
187
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
192
188
=====
193
-
.. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
189
+
. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
194
190
Depending on the environment, consider extending the timeout for this procedure.
195
191
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
196
-
197
-
.. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
192
+
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
198
193
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
194
+
If any allocation has `currentStatus` = `QUARANTINED`, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
199
195
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
200
196
+
201
197
[NOTE]
202
198
=====
203
199
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], also databases which have lost all allocation can be recreated.
204
200
Otherwise, recreating with xref:clustering/databases.adoc#uri-seed[Backup as seed] must be used for that specific case.
205
201
=====
206
-
.. Return to step 3 to retry deallocating all servers.
207
-
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
208
-
This safely removes the server from the cluster's view.
209
-
210
-
// ? to here really
202
+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
203
+
This will try to move all database allocations from this server to an available server in the cluster.
204
+
+
205
+
[NOTE]
206
+
=====
207
+
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers.
208
+
Another reason is that some available servers are also `CORDONED`.
209
+
=====
210
+
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`.
211
+
This removes the server from the cluster's view.
211
212
====
212
213
213
214
@@ -242,7 +243,7 @@ Therefore, the desired state has been verified when this is true for all databas
242
243
CALL dbms.cluster.statusCheck([]);
243
244
----
244
245
245
-
A stricter verification could be done to verify if all databases are in desired states on all servers.
246
+
A stricter verification can be done to verify that all databases are in their desired states on all servers.
246
247
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers.
247
248
248
249
==== Path to correct state
@@ -255,6 +256,7 @@ Recreations might fail for different reasons, but one example is that the checks
255
256
====
256
257
. Run `CALL dbms.cluster.statusCheck([])` on all servers to identify write unavailable databases, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
257
258
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
259
+
If any allocation has `currentStatus` = `QUARANTINED`, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
258
260
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
259
261
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
260
262
Recreating a database will not complete if one of the following messages is displayed in the message field:
0 commit comments