You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/clustering/disaster-recovery.adoc
+67-52Lines changed: 67 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ A database can become unavailable due to issues on different system levels.
7
7
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
8
8
9
9
This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads.
10
-
However, if a database is _unavailable_ because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
10
+
However, if a database is unavailable because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
11
11
By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
12
12
13
13
[NOTE]
@@ -23,7 +23,7 @@ Databases in clusters follow an allocation strategy.
23
23
This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries.
24
24
The consequence of this is that all servers are different in which databases they are hosting.
25
25
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
26
-
Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
26
+
Therefore, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
27
27
28
28
== Guide to disaster recovery
29
29
@@ -32,16 +32,16 @@ Completing each step, regardless of the disaster scenario, is recommended to ens
32
32
33
33
[NOTE]
34
34
====
35
-
Any potential quarantined databases need to be handled before executing this guide, see REF for more information.
35
+
Any potential quarantined databases need to be handled before executing this guide, see xref:database-administration/standard-databases/errors.adoc#quarantine[Quarantined databases] for more information.
36
36
====
37
37
38
38
. Ensure the `system` database is available in the cluster.
39
39
The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.
40
40
41
-
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements
42
-
This process starts the managing of databases by default.
41
+
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements.
42
+
This process also starts the managing of databases.
43
43
44
-
. After the `system` database is available, the cluster's topology is satisfied and the databases has been managed, continue managing databases and verify that they are available.
44
+
. After the `system` database is available and the cluster's topology is satisfied, start or continue managing databases and verify that they are available.
45
45
46
46
The steps are described in detail in the following sections.
47
47
@@ -59,23 +59,21 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
59
59
====
60
60
61
61
[[restore-the-system-database]]
62
-
=== Restore the `system` database
62
+
=== `System` database availability
63
63
64
-
The first step of recovery is to ensure that the `system` database is available.
64
+
The first step of recovery is to ensure that the `system` database is able to accept writes.
65
65
The `system` database is required for clusters to function properly.
66
66
67
-
. *Start all servers that are _offline_*.
67
+
. Start the Neo4j process on all servers that are _offline_.
68
68
If a server is unable to start, inspect the logs and contact support personnel.
69
69
The server may have to be considered indefinitely lost.
70
-
. *Validate the `system` database's availability.* Use one of the following options:
71
-
** Run `SHOW DATABASE system`.
72
-
If the response contain a writer, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
73
-
** Create a temporary user by running `CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'`.
74
-
Check if the temporary user is created by running `SHOW USERS`. If it was created as expected, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
75
-
** Use rafted status check as described in REF.
70
+
. Validate the `system` database's write availability by running `CALL dbms.cluster.statusCheck(["system"])` on all remaining system primaries, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
71
+
Depending on the environment, consider extending the timeout for this procedure.
72
+
If any of the system primaries report `replicationSuccessful` = `TRUE`, the system database is write available and does not need to be recovered.
73
+
Therefore, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Server availability].
76
74
77
75
+
78
-
. *Restore the `system` database.*
76
+
. Regain availability by restoring the `system` database.
79
77
+
80
78
[NOTE]
81
79
====
@@ -93,81 +91,98 @@ See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands]
93
91
.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin database info system` to find out which server is most up-to-date, ie. has the highest last-committed transaction id.
94
92
.. On the most up-to-date server, take a dump of the current `system` database by running `bin/neo4j-admin database dump system --to-path=[path-to-dump]` and store the dump in an accessible location.
95
93
See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands] for more information.
96
-
.. Ensure there are enough `system` database primaries to create the new `system` database.
97
-
The amount of primaries needed is equal or more than the `dbms.cluster.minimum_initial_system_primaries_count` config, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[fix link] for more information.
98
-
Use one of the following options:
99
-
** Add completely new servers, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
100
-
** Change the `system` database mode (`server.cluster.system_database_mode`) on the current `system` database's secondary servers to allow them to be primaries for the new `system` database.
94
+
.. For every _lost_ server, add a new unconstrained one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
95
+
+
96
+
[NOTE]
97
+
====
98
+
While recommended to avoid cluster overload, it is not strictly necessary to add servers in this step.
99
+
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primaries for the new `system` database.
100
+
The amount of primaries needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
101
+
====
102
+
+
101
103
.. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
102
104
.. Ensure that the discovery settings are correct on all servers, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
103
105
.. Return to step 1, to start all servers and confirm the `system` database is now available.
104
106
105
107
106
108
[[recover-servers]]
107
-
=== Recover servers and user databases
109
+
=== Server availability
108
110
109
111
Once the `system` database is available, the cluster can be managed.
110
-
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new servers.
111
-
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved for different reasons.
112
+
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new ones.
113
+
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved from the lost servers because they have lost availability.
112
114
113
115
. Run `SHOW SERVERS`.
114
-
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Recover databases].
116
+
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Database availability].
115
117
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
116
-
. For each `CORDONED` server, make sure a new unconstrained server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] to add additional servers.
117
-
If no servers were added in xref:clustering/disaster-recovery.adoc#restore-the-system-database[Restore the system database], the amount of servers that needs to be added is equal to the number of `CORDONED` servers.
118
+
. For each `CORDONED` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
119
+
If servers were added in the xref:clustering/disaster-recovery.adoc#restore-the-system-database[System database availability] step, the amount of servers that needs to be added in this step is less than the number of `CORDONED` servers.
120
+
121
+
+
118
122
[NOTE]
119
123
====
120
-
It is not strictly necessary to add new servers in this step. However, not adding new servers might require the topology for a database to be altered via ALTER DATABASE to make deallocations possible or in the RECREATE command to make it possible.
124
+
While recommended, it is not strictly necessary to add new servers in this step.
125
+
However, not adding new servers reduces the capacity of the cluster to handle work and might require the topology for a database to be altered to make deallocations and recreations possible.
121
126
====
127
+
122
128
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succeeded, skip to step 6.
123
-
. Make sure deallocating the servers is possible by doing the following steps:
124
-
.. Run `SHOW DATABASES`.
125
-
.. Try to start the offline databases allocated on any of the `CORDONED` servers by running `START DATABASE stopped-db WAIT`.
129
+
. If any deallocations failed, make them possible by the following steps:
130
+
.. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
131
+
.. For each stopped database that is allocated on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
126
132
+
127
133
[NOTE]
128
134
====
129
-
A database can be set to `READ-ONLY`-mode before it is started to avoid updates on a database that is desired to be stopped with the following:
135
+
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following:
130
136
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
131
137
====
132
-
.. Run CALL statusCheck() for all databases, and recreate all databases that failed replication.
133
-
See REF for more information on how to recreate databases. Remember to make sure there are recent backups for the databases, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
138
+
.. Run `CALL dbms.cluster.statusCheck([])` on all servers, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
139
+
Depending on the environment, consider extending the timeout for this procedure.
140
+
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
141
+
142
+
.. Recreate every database that is not write available, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
143
+
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
144
+
+
145
+
[NOTE]
146
+
====
147
+
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], also databases which have lost all allocation can be recreated.
148
+
Otherwise, recreating with xref:clustering/databases.adoc#uri-seed[Backup as seed] must be used for that specific case.
149
+
====
134
150
.. Return to step 4 to retry deallocating all servers.
135
151
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
136
152
. Return to step 1 to make sure all servers in the cluster are `AVAILABLE`.
137
153
138
154
139
-
`Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.\*'. +
140
-
Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries. +
141
-
Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.`
142
-
143
-
-> What does this error message mean? IS THIS QUARANTINE? However, drop would not have worked here either.
144
-
145
-
146
155
[[recover-databases]]
147
-
=== Verify recovery of databases
156
+
=== Database availability
148
157
149
-
Once the `system` database is verified available, and all servers are online, manage and verify that all databases are in a desirable state.
158
+
Once the `system` database and all servers are available, manage and verify that all databases are in the desired state.
150
159
151
-
. Run `SHOW DATABASES`. If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
160
+
. Run `CALL dbms.cluster.statusCheck([])` on all servers, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
161
+
Depending on the environment, consider extending the timeout for this procedure.
162
+
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
163
+
If all databases are write available, disaster recovery is complete.
152
164
+
153
165
[NOTE]
154
166
====
155
-
Recreating a database can take an unbounded amount of time since it may involve copying the store to a new server, as described in REF(Recreate docs).
156
-
Therefore, an allocation in STARTING state might reach the requestedStatus given some time.
167
+
Remember that previously stopped databases might have been started during this process.
157
168
====
169
+
170
+
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
171
+
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
172
+
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
173
+
158
174
+
159
175
[NOTE]
160
176
====
161
-
Deallocating databases can take an unbounded amount of time since it involves copying the store to a server.
162
-
Therefore, an allocation in STORE_COPY state should reach the requestedStatus given some time.
177
+
Remember, recreating a database can take an unbounded amount of time since it may involve copying the store to a new server, as described in xref:clustering/databases.adoc#recreate-databases[Recreate databases].
178
+
Therefore, an allocation with `currentStatus` = `STARTING` might reach the `requestedStatus` given some time.
163
179
====
164
-
165
-
. For any databases in
166
-
. For any recreated databases in `STARTING` state with one of the following messages displayed in the message field:
180
+
Recreating a database will not complete if one of the following messages is displayed in the message field:
167
181
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
168
182
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`
169
183
** `No store found on any of the seeders ServerId1, ServerId2...`
170
184
+
171
-
Recreate them from backup using REF(recreate with seed from URI) or define seeding servers in the recreate procedure so that problematic allocations are excluded.
185
+
186
+
. For each database which will not complete recreation, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
172
187
. Return to step 1 to make sure all databases are in their desired state.
0 commit comments