Skip to content

Commit d24af9b

Browse files
committed
WIP
1 parent ccc0826 commit d24af9b

File tree

1 file changed

+35
-33
lines changed

1 file changed

+35
-33
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ A database can become unavailable due to issues on different system levels.
77
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
88

99
This section contains a step-by-step guide on how to recover *unavailable databases* that are incapable of serving writes, while possibly still being able to serve reads.
10+
The guide recovers the unavailable databases and make them fully operational, with minimal impact on the other databases in the cluster.
1011
However, if a database is not performing as expected for other reasons, this section cannot help.
11-
By following the steps outlined here, you can recover the unavailable databases and make them fully operational, with minimal impact on the other databases in the cluster.
1212

1313
[CAUTION]
1414
====
@@ -53,15 +53,22 @@ Verifying each state before continuing to the next step, regardless of the disas
5353

5454
[NOTE]
5555
====
56-
Before beginning this guide, start the Neo4j process on all servers that are _offline_.
57-
If a server is unable to start, inspect the logs and contact support personnel.
58-
The server may have to be considered indefinitely lost.
59-
====
60-
6156
Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the `neo4j` scheme for routing.
6257
One way to remedy this is to connect directly to the server using `bolt` instead of `neo4j`.
6358
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
59+
====
60+
61+
=== Neo4j process started
62+
63+
==== State
64+
====
65+
The Neo4j process is started on all servers which are not _lost_.
66+
====
6467

68+
==== Path to correct state
69+
Start the Neo4j process on all servers that are _offline_.
70+
If a server is unable to start, inspect the logs and contact support personnel.
71+
The server may have to be considered indefinitely lost.
6572

6673
[[restore-the-system-database]]
6774
=== `System` database write availability
@@ -110,14 +117,14 @@ This causes downtime for all databases in the cluster until the processes are st
110117
. On each server, run `bin/neo4j-admin database info system` and compare the `lastCommittedTransaction` to find out which server has the most up-to-date copy of the `system` database.
111118
. On the most up-to-date server, run `bin/neo4j-admin database dump system --to-path=[path-to-dump]` to take a dump of the current `system` database and store it in an accessible location.
112119
. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
113-
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
120+
It is important that the new servers are unconstrained, or deallocating servers in the next step of this guide might be blocked, even though enough servers were added.
114121
+
115122
[NOTE]
116123
=====
117124
While recommended, it is not strictly necessary to add new servers in this step.
118125
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database.
119126
The amount of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
120-
Not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
127+
Be aware that not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide.
121128
=====
122129
+
123130
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
@@ -136,8 +143,8 @@ All servers in the cluster's view are available and enabled.
136143

137144
A lost server will still be in the `system` database's view of the cluster, but in an unavailable state.
138145
According to the view of the cluster, these lost servers are still hosting the databases they had before they became lost.
139-
Therefore, removing lost servers is not as easy as informing the `system` database that they are lost.
140-
It also includes moving requested allocations on the lost servers onto servers which are actually in the cluster, so that those databases' topologies are still satisfied.
146+
Therefore, informing the cluster of servers which are lost is not enough.
147+
The databases hosted on the lost servers also need to be moved onto servers which are actually in the cluster.
141148

142149
==== Example verification
143150
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
@@ -150,7 +157,7 @@ SHOW SERVERS;
150157

151158
==== Path to correct state
152159
The following steps can be used to remove lost servers and add new ones to the cluster.
153-
They include moving any potential database allocations from lost servers to available servers in the cluster.
160+
That includes moving any potential database allocations from lost servers to available servers in the cluster.
154161
These steps might also recreate some databases, since a database which has lost a majority of its primary allocations cannot be moved from one server to another.
155162

156163
.Guide
@@ -170,44 +177,38 @@ However, not adding new servers reduces the capacity of the cluster to handle wo
170177
Furthermore, it might require the topology for a database to be altered to make deallocating servers and recreating databases possible.
171178
=====
172179
173-
// ? from here
174-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
175-
This will try to move all database allocations from this server to an available server in the cluster.
176-
Once a server is `DEALLOCATED`, all allocated user databases on this server has been moved successfully.
177-
+
178-
[NOTE]
179-
=====
180-
Remember, moving databases can take an unbounded amount of time since it involves copying the store to a new server.
181-
Therefore, an allocation with `currentStatus` = `DEALLOCATING` should reach the `requestedStatus` = `DEALLOCATED` given some time.
182-
=====
183-
. If any deallocations failed, make them possible by executing the following steps:
184-
.. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
185-
.. For each stopped database that has at least one allocation on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
180+
. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
181+
. For each stopped database that has at least one allocation on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
186182
This is necessary since stopped databases cannot be moved from one server to another.
187183
+
188184
[NOTE]
189185
=====
190186
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following command:
191187
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
192188
=====
193-
.. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
189+
. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
194190
Depending on the environment, consider extending the timeout for this procedure.
195191
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
196-
197-
.. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
192+
. For each database that is not write available, recreate it to move it from lost servers and regain write availability.
198193
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
194+
If any allocation has `currentStatus` = `QUARANTINED`, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
199195
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
200196
+
201197
[NOTE]
202198
=====
203199
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], also databases which have lost all allocation can be recreated.
204200
Otherwise, recreating with xref:clustering/databases.adoc#uri-seed[Backup as seed] must be used for that specific case.
205201
=====
206-
.. Return to step 3 to retry deallocating all servers.
207-
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
208-
This safely removes the server from the cluster's view.
209-
210-
// ? to here really
202+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
203+
This will try to move all database allocations from this server to an available server in the cluster.
204+
+
205+
[NOTE]
206+
=====
207+
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers.
208+
Another reason is that some available servers are also `CORDONED`.
209+
=====
210+
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`.
211+
This removes the server from the cluster's view.
211212
====
212213

213214

@@ -242,7 +243,7 @@ Therefore, the desired state has been verified when this is true for all databas
242243
CALL dbms.cluster.statusCheck([]);
243244
----
244245

245-
A stricter verification could be done to verify if all databases are in desired states on all servers.
246+
A stricter verification can be done to verify that all databases are in their desired states on all servers.
246247
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers.
247248

248249
==== Path to correct state
@@ -255,6 +256,7 @@ Recreations might fail for different reasons, but one example is that the checks
255256
====
256257
. Run `CALL dbms.cluster.statusCheck([])` on all servers to identify write unavailable databases, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
257258
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
259+
If any allocation has `currentStatus` = `QUARANTINED`, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
258260
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
259261
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
260262
Recreating a database will not complete if one of the following messages is displayed in the message field:

0 commit comments

Comments
 (0)