Skip to content

Commit 81332e9

Browse files
committed
Move to new structure.
1 parent a3e8f2f commit 81332e9

File tree

1 file changed

+127
-75
lines changed

1 file changed

+127
-75
lines changed

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 127 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ A database can become unavailable due to issues on different system levels.
77
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
88

99
This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads.
10-
However, if a database is unavailable because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
10+
However, if a database is not performing as expected for other reasons, this section cannot help.
1111
By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
1212

1313
[NOTE]
@@ -21,135 +21,187 @@ See xref:clustering/setup/deploy.adoc[Deploy a basic cluster] and xref:clusterin
2121

2222
Databases in clusters follow an allocation strategy.
2323
This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries.
24-
The consequence of this is that all servers are different in which databases they are hosting.
24+
Furthermore, some databases may not be allowed to be allocated to some servers because of user defined strategies.
25+
The consequence of this is that all servers may be different in which databases they are hosting and are allowed to host.
2526
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
26-
Therefore, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
27-
28-
== Guide to disaster recovery
27+
Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
2928

29+
== Guide structure
3030
There are three main steps to recovering a cluster from a disaster.
31-
Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
31+
First, ensure the `system` database is write available i.e. able to accept writes.
32+
Then, detach any potential lost servers and replace them by new ones.
33+
Finish disaster recovery by starting or continuing to manage databases and verify that they are available.
3234

33-
[NOTE]
34-
====
35-
Any potential quarantined databases need to be handled before executing this guide, see xref:database-administration/standard-databases/errors.adoc#quarantine[Quarantined databases] for more information.
35+
Every step consists of the following four sections:
36+
37+
. State that needs to be verified.
38+
. Example of how the state can be verified.
39+
. Motivation for why this state is necessary.
40+
. Path to correct state.
41+
42+
[CAUTION]
3643
====
44+
Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
3745
38-
. Ensure the `system` database is available in the cluster.
39-
The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.
46+
====
4047

41-
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements.
42-
This process also starts the managing of databases.
48+
In this section, an _offline_ server is a server that is not running but may be _restartable_.
49+
A _lost_ server, however, is a server that is currently not running and cannot be restarted.
4350

44-
. After the `system` database is available and the cluster's topology is satisfied, start or continue managing databases and verify that they are available.
4551

46-
The steps are described in detail in the following sections.
52+
== Guide to disaster recovery
4753

4854
[NOTE]
4955
====
50-
In this section, an _offline_ server is a server that is not running but may be _restartable_.
51-
A _lost_ server, however, is a server that is currently not running and cannot be restarted.
56+
Before beginning this guide, start the Neo4j process on all servers that are _offline_.
57+
If a server is unable to start, inspect the logs and contact support personnel.
58+
The server may have to be considered indefinitely lost.
5259
====
5360

54-
[NOTE]
55-
====
5661
Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the `neo4j` scheme for routing.
5762
One way to remedy this is to connect directly to the server using `bolt` instead of `neo4j`.
5863
See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme.
59-
====
64+
6065

6166
[[restore-the-system-database]]
62-
=== `System` database availability
67+
=== `System` database write availability
6368

64-
The first step of recovery is to ensure that the `system` database is able to accept writes.
65-
The `system` database is required for clusters to function properly.
69+
==== State
70+
====
71+
The `system` database is write available, i.e. able to accept writes.
72+
====
6673

67-
. Start the Neo4j process on all servers that are _offline_.
68-
If a server is unable to start, inspect the logs and contact support personnel.
69-
The server may have to be considered indefinitely lost.
70-
. Validate the `system` database's write availability by running `CALL dbms.cluster.statusCheck(["system"])` on all remaining system primaries, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
71-
Depending on the environment, consider extending the timeout for this procedure.
72-
If any of the system primaries report `replicationSuccessful` = `TRUE`, the system database is write available and does not need to be recovered.
73-
Therefore, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Server availability].
74+
==== Motivation
75+
The `system` database contains the view of the cluster. This includes which servers and databases are present and how they are configured.
76+
During a disaster, the goal is to change the view of the cluster, for example by removing and adding servers or recreating databases.
77+
In order for the view to be updated, the `system` database needs to be write available.
78+
Therefore, it is vital to ensure it is available so that the next steps are possible to execute.
7479

75-
+
76-
. Regain availability by restoring the `system` database.
77-
+
78-
[NOTE]
79-
====
80-
Only do the steps below if the `system` database's write availability cannot be validated by the first two steps in this section.
80+
==== Example verification
81+
The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure.
82+
The procedure should be called on all remaining primary allocations of the `system` database, in order to provide the correct view.
83+
The status check procedure writes a dummy transaction, and therefore the correctness of the procedure depends on the given timeout.
84+
The default timeout is 1 second, but depending on the network latency in the environment it might need to be extended.
85+
If any of the primary `system` allocations report `replicationSuccessful` = `TRUE`, the `system` database is write available.
86+
Therefore, the desired state has been verified.
87+
88+
[source, shell]
89+
----
90+
CALL dbms.cluster.statusCheck(["system"]);
91+
----
92+
93+
==== Path to correct state
94+
The following steps can be used to regain write availability for the `system` database if it has been lost.
95+
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
96+
It is important to get a `system` database that is as up-to-date as possible, so that future commands operate on state that is as correct as possible.
97+
98+
.Guide
99+
[%collapsible]
81100
====
82-
+
83101
84-
The following steps create a new `system` database from a backup of the current `system` database.
85-
This is required since the current `system` database has lost too many members to be able to accept writes.
86-
87-
.. Shut down the Neo4j process on all servers.
88-
Note that this causes downtime for all databases in the cluster.
89-
.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin dbms unbind-system-db` to reset the `system` database state on the servers.
90-
See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands] for more information.
91-
.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin database info system` to find out which server is most up-to-date, ie. has the highest last-committed transaction id.
92-
.. On the most up-to-date server, take a dump of the current `system` database by running `bin/neo4j-admin database dump system --to-path=[path-to-dump]` and store the dump in an accessible location.
93-
See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands] for more information.
94-
.. For every _lost_ server, add a new unconstrained one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
102+
[NOTE]
103+
=====
104+
This section of the disaster recovery guide uses `neo4j-admin`, for more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands].
105+
=====
106+
107+
. Shut down the Neo4j process on all servers.
108+
This causes downtime for all databases in the cluster until the processes are started again at the end of this section.
109+
. On each server, run `bin/neo4j-admin dbms unbind-system-db` to reset the `system` database state on the servers.
110+
. On each server, run `bin/neo4j-admin database info system` and compare the `lastCommittedTransaction` to find out which server has the most up-to-date copy of the `system` database.
111+
. On the most up-to-date server, run `bin/neo4j-admin database dump system --to-path=[path-to-dump]` to take a dump of the current `system` database and store it in an accessible location.
112+
. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
113+
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers was added.
95114
+
96115
[NOTE]
97-
====
116+
=====
98117
While recommended to avoid cluster overload, it is not strictly necessary to add servers in this step.
99-
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primaries for the new `system` database.
100-
The amount of primaries needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
101-
====
118+
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database.
119+
The amount of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
120+
=====
102121
+
103-
.. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
104-
.. Ensure that the discovery settings are correct on all servers, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
105-
.. Return to step 1, to start all servers and confirm the `system` database is now available.
122+
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
123+
. On each server, ensure that the discovery settings are correct, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
124+
. Start the Neo4j process on all servers.
125+
====
106126

107127

108128
[[recover-servers]]
109129
=== Server availability
110130

111-
Once the `system` database is available, the cluster can be managed.
112-
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new ones.
113-
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved from the lost servers because they have lost availability.
131+
==== State
132+
====
133+
All servers in the cluster's view are available and enabled.
134+
====
135+
136+
==== Motivation
137+
// different stuffs here
138+
Following the loss of one or more servers, the cluster's view of servers must be updated.
139+
This includes moving allocations on the lost servers onto servers which are actually in the cluster
140+
This includes identifying the lost servers and replacing them by new ones.
141+
142+
==== Example verification
143+
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
144+
The state has been verified if *all* servers show `health` = `AVAILABLE` and `status` = `ENABLED`.
114145

115-
. Run `SHOW SERVERS`.
116-
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Database availability].
146+
[source, cypher]
147+
----
148+
SHOW SERVERS;
149+
----
150+
151+
==== Path to correct state
152+
Detach lost servers and add new ones to the cluster
153+
154+
.Guide
155+
[%collapsible]
156+
====
117157
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
158+
This prevents new database allocations from being moved to this server.
118159
. For each `CORDONED` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
119-
If servers were added in the xref:clustering/disaster-recovery.adoc#restore-the-system-database[System database availability] step, the amount of servers that needs to be added in this step is less than the number of `CORDONED` servers.
160+
If servers were added in the 'System database write availability' step of this guide, additional servers might not be needed here.
120161
121162
+
122163
[NOTE]
123-
====
164+
=====
124165
While recommended, it is not strictly necessary to add new servers in this step.
125166
However, not adding new servers reduces the capacity of the cluster to handle work and might require the topology for a database to be altered to make deallocations and recreations possible.
126-
====
167+
=====
127168
128-
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succeeded, skip to step 6.
129-
. If any deallocations failed, make them possible by the following steps:
169+
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
170+
This will try to move all database allocations from this server to another server in the cluster.
171+
Once a server is `DEALLOCATED`, all allocated user databases on this server has been moved successfully.
172+
+
173+
[NOTE]
174+
=====
175+
Remember, moving databases can take an unbounded amount of time since it involves copying the store to a new server.
176+
Therefore, an allocation with `currentStatus` = `DEALLOCATING` should reach the `requestedStatus` = `DEALLOCATED` given some time.
177+
=====
178+
. If any deallocations failed, make them possible by executing the following steps:
130179
.. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
131-
.. For each stopped database that is allocated on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
180+
.. For each stopped database that has at least one allocation on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
132181
+
133182
[NOTE]
134-
====
135-
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following:
183+
=====
184+
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following command:
136185
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
137-
====
138-
.. Run `CALL dbms.cluster.statusCheck([])` on all servers, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
186+
=====
187+
.. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
139188
Depending on the environment, consider extending the timeout for this procedure.
140189
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
141190
142-
.. Recreate every database that is not write available, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
191+
.. For each database that is not write available, recreate it to regain write availability.
192+
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
143193
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
144194
+
145195
[NOTE]
146-
====
196+
=====
147197
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], also databases which have lost all allocation can be recreated.
148198
Otherwise, recreating with xref:clustering/databases.adoc#uri-seed[Backup as seed] must be used for that specific case.
149-
====
150-
.. Return to step 4 to retry deallocating all servers.
199+
=====
200+
.. Return to step 3 to retry deallocating all servers.
151201
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
152-
. Return to step 1 to make sure all servers in the cluster are `AVAILABLE`.
202+
This safely removes the server from the cluster view.
203+
204+
====
153205

154206

155207
[[recover-databases]]

0 commit comments

Comments
 (0)