Skip to content

Commit 94fa307

Browse files
committed
WIP
1 parent 92737ea commit 94fa307

File tree

2 files changed

+68
-53
lines changed

2 files changed

+68
-53
lines changed

modules/ROOT/pages/clustering/databases.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ Neo4j 5.24 introduces the xref:reference/procedures.adoc#procedure_dbms_cluster_
234234
* To change the database store to a specified backup, while keeping all the associated privileges for the database.
235235

236236
* To make your database write-available again after it has been lost (for example, due to a disaster).
237-
// See xref:clustering/disaster-recovery.adoc[] for more information.
237+
See xref:clustering/disaster-recovery.adoc[Disaster recovery] for more information.
238238

239239
[CAUTION]
240240
====

modules/ROOT/pages/clustering/disaster-recovery.adoc

Lines changed: 67 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ A database can become unavailable due to issues on different system levels.
77
For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable.
88

99
This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads.
10-
However, if a database is _unavailable_ because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
10+
However, if a database is unavailable because some members are in a quarantined state or if a database is not performing as expected for other reasons, this section cannot help.
1111
By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
1212

1313
[NOTE]
@@ -23,7 +23,7 @@ Databases in clusters follow an allocation strategy.
2323
This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries.
2424
The consequence of this is that all servers are different in which databases they are hosting.
2525
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
26-
Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
26+
Therefore, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
2727

2828
== Guide to disaster recovery
2929

@@ -32,16 +32,16 @@ Completing each step, regardless of the disaster scenario, is recommended to ens
3232

3333
[NOTE]
3434
====
35-
Any potential quarantined databases need to be handled before executing this guide, see REF for more information.
35+
Any potential quarantined databases need to be handled before executing this guide, see xref:database-administration/standard-databases/errors.adoc#quarantine[Quarantined databases] for more information.
3636
====
3737

3838
. Ensure the `system` database is available in the cluster.
3939
The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.
4040

41-
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements
42-
This process starts the managing of databases by default.
41+
. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements.
42+
This process also starts the managing of databases.
4343

44-
. After the `system` database is available, the cluster's topology is satisfied and the databases has been managed, continue managing databases and verify that they are available.
44+
. After the `system` database is available and the cluster's topology is satisfied, start or continue managing databases and verify that they are available.
4545

4646
The steps are described in detail in the following sections.
4747

@@ -59,23 +59,21 @@ See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] f
5959
====
6060

6161
[[restore-the-system-database]]
62-
=== Restore the `system` database
62+
=== `System` database availability
6363

64-
The first step of recovery is to ensure that the `system` database is available.
64+
The first step of recovery is to ensure that the `system` database is able to accept writes.
6565
The `system` database is required for clusters to function properly.
6666

67-
. *Start all servers that are _offline_*.
67+
. Start the Neo4j process on all servers that are _offline_.
6868
If a server is unable to start, inspect the logs and contact support personnel.
6969
The server may have to be considered indefinitely lost.
70-
. *Validate the `system` database's availability.* Use one of the following options:
71-
** Run `SHOW DATABASE system`.
72-
If the response contain a writer, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
73-
** Create a temporary user by running `CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'`.
74-
Check if the temporary user is created by running `SHOW USERS`. If it was created as expected, the `system` database is write available and does not need to be recovered, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers].
75-
** Use rafted status check as described in REF.
70+
. Validate the `system` database's write availability by running `CALL dbms.cluster.statusCheck(["system"])` on all remaining system primaries, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
71+
Depending on the environment, consider extending the timeout for this procedure.
72+
If any of the system primaries report `replicationSuccessful` = `TRUE`, the system database is write available and does not need to be recovered.
73+
Therefore, skip to step xref:clustering/disaster-recovery.adoc#recover-servers[Server availability].
7674

7775
+
78-
. *Restore the `system` database.*
76+
. Regain availability by restoring the `system` database.
7977
+
8078
[NOTE]
8179
====
@@ -93,81 +91,98 @@ See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands]
9391
.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin database info system` to find out which server is most up-to-date, ie. has the highest last-committed transaction id.
9492
.. On the most up-to-date server, take a dump of the current `system` database by running `bin/neo4j-admin database dump system --to-path=[path-to-dump]` and store the dump in an accessible location.
9593
See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands] for more information.
96-
.. Ensure there are enough `system` database primaries to create the new `system` database.
97-
The amount of primaries needed is equal or more than the `dbms.cluster.minimum_initial_system_primaries_count` config, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[fix link] for more information.
98-
Use one of the following options:
99-
** Add completely new servers, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
100-
** Change the `system` database mode (`server.cluster.system_database_mode`) on the current `system` database's secondary servers to allow them to be primaries for the new `system` database.
94+
.. For every _lost_ server, add a new unconstrained one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
95+
+
96+
[NOTE]
97+
====
98+
While recommended to avoid cluster overload, it is not strictly necessary to add servers in this step.
99+
There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primaries for the new `system` database.
100+
The amount of primaries needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`, see the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information.
101+
====
102+
+
101103
.. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
102104
.. Ensure that the discovery settings are correct on all servers, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
103105
.. Return to step 1, to start all servers and confirm the `system` database is now available.
104106

105107

106108
[[recover-servers]]
107-
=== Recover servers and user databases
109+
=== Server availability
108110

109111
Once the `system` database is available, the cluster can be managed.
110-
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new servers.
111-
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved for different reasons.
112+
Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new ones.
113+
The steps here identify the lost servers and safely detach them from the cluster, while recreating any databases that cannot be moved from the lost servers because they have lost availability.
112114

113115
. Run `SHOW SERVERS`.
114-
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Recover databases].
116+
If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Database availability].
115117
. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
116-
. For each `CORDONED` server, make sure a new unconstrained server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] to add additional servers.
117-
If no servers were added in xref:clustering/disaster-recovery.adoc#restore-the-system-database[Restore the system database], the amount of servers that needs to be added is equal to the number of `CORDONED` servers.
118+
. For each `CORDONED` server, make sure a new *unconstrained* server has been added to the cluster to take its place, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
119+
If servers were added in the xref:clustering/disaster-recovery.adoc#restore-the-system-database[System database availability] step, the amount of servers that needs to be added in this step is less than the number of `CORDONED` servers.
120+
121+
+
118122
[NOTE]
119123
====
120-
It is not strictly necessary to add new servers in this step. However, not adding new servers might require the topology for a database to be altered via ALTER DATABASE to make deallocations possible or in the RECREATE command to make it possible.
124+
While recommended, it is not strictly necessary to add new servers in this step.
125+
However, not adding new servers reduces the capacity of the cluster to handle work and might require the topology for a database to be altered to make deallocations and recreations possible.
121126
====
127+
122128
. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. If all deallocations succeeded, skip to step 6.
123-
. Make sure deallocating the servers is possible by doing the following steps:
124-
.. Run `SHOW DATABASES`.
125-
.. Try to start the offline databases allocated on any of the `CORDONED` servers by running `START DATABASE stopped-db WAIT`.
129+
. If any deallocations failed, make them possible by the following steps:
130+
.. Run `SHOW DATABASES`. If a database show `currentStatus`= `offline` this database has been stopped.
131+
.. For each stopped database that is allocated on any of the `CORDONED` servers, start them by running `START DATABASE stopped-db WAIT`.
126132
+
127133
[NOTE]
128134
====
129-
A database can be set to `READ-ONLY`-mode before it is started to avoid updates on a database that is desired to be stopped with the following:
135+
A database can be set to `READ-ONLY` before it is started to avoid updates on a database that is desired to be stopped with the following:
130136
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
131137
====
132-
.. Run CALL statusCheck() for all databases, and recreate all databases that failed replication.
133-
See REF for more information on how to recreate databases. Remember to make sure there are recent backups for the databases, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
138+
.. Run `CALL dbms.cluster.statusCheck([])` on all servers, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
139+
Depending on the environment, consider extending the timeout for this procedure.
140+
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
141+
142+
.. Recreate every database that is not write available, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
143+
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
144+
+
145+
[NOTE]
146+
====
147+
By using recreate with xref:clustering/databases.adoc#undefined-servers-backup[Undefined servers with fallback backup], also databases which have lost all allocation can be recreated.
148+
Otherwise, recreating with xref:clustering/databases.adoc#uri-seed[Backup as seed] must be used for that specific case.
149+
====
134150
.. Return to step 4 to retry deallocating all servers.
135151
. For each deallocated server, run `DROP SERVER deallocated-server-id`.
136152
. Return to step 1 to make sure all servers in the cluster are `AVAILABLE`.
137153

138154

139-
`Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.\*'. +
140-
Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries. +
141-
Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.`
142-
143-
-> What does this error message mean? IS THIS QUARANTINE? However, drop would not have worked here either.
144-
145-
146155
[[recover-databases]]
147-
=== Verify recovery of databases
156+
=== Database availability
148157

149-
Once the `system` database is verified available, and all servers are online, manage and verify that all databases are in a desirable state.
158+
Once the `system` database and all servers are available, manage and verify that all databases are in the desired state.
150159

151-
. Run `SHOW DATABASES`. If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete.
160+
. Run `CALL dbms.cluster.statusCheck([])` on all servers, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
161+
Depending on the environment, consider extending the timeout for this procedure.
162+
If any of the primary allocations for a database report `replicationSuccessful` = `TRUE`, this database is write available.
163+
If all databases are write available, disaster recovery is complete.
152164
+
153165
[NOTE]
154166
====
155-
Recreating a database can take an unbounded amount of time since it may involve copying the store to a new server, as described in REF(Recreate docs).
156-
Therefore, an allocation in STARTING state might reach the requestedStatus given some time.
167+
Remember that previously stopped databases might have been started during this process.
157168
====
169+
170+
. Recreate every database that is not write available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
171+
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
172+
. Run `SHOW DATABASES` and check any recreated databases which are not write available.
173+
158174
+
159175
[NOTE]
160176
====
161-
Deallocating databases can take an unbounded amount of time since it involves copying the store to a server.
162-
Therefore, an allocation in STORE_COPY state should reach the requestedStatus given some time.
177+
Remember, recreating a database can take an unbounded amount of time since it may involve copying the store to a new server, as described in xref:clustering/databases.adoc#recreate-databases[Recreate databases].
178+
Therefore, an allocation with `currentStatus` = `STARTING` might reach the `requestedStatus` given some time.
163179
====
164-
165-
. For any databases in
166-
. For any recreated databases in `STARTING` state with one of the following messages displayed in the message field:
180+
Recreating a database will not complete if one of the following messages is displayed in the message field:
167181
** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.`
168182
** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.`
169183
** `No store found on any of the seeders ServerId1, ServerId2...`
170184
+
171-
Recreate them from backup using REF(recreate with seed from URI) or define seeding servers in the recreate procedure so that problematic allocations are excluded.
185+
186+
. For each database which will not complete recreation, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed] or define seeding servers in the recreate procedure using xref:clustering/databases.adoc#specified-servers[Specified seeders] so that problematic allocations are excluded.
172187
. Return to step 1 to make sure all databases are in their desired state.
173188

0 commit comments

Comments
 (0)