From fdf09d08840f45b19b50b5853d786f6201df3b59 Mon Sep 17 00:00:00 2001 From: Anna Sjerling <102957391+AnnaSjerling@users.noreply.github.com> Date: Fri, 20 Dec 2024 12:07:40 +0100 Subject: [PATCH] Rewrite disaster recovery based on recreate and status check (#1890) This PR rewrites the disaster recovery docs based on the new recreate and status check procedures. With regards to the functional changes there are three major things to note. 1. The recovery of the system database is the same as before, even though the way we check if the system db is write available or not has changed. 2. The step which removes lost servers now also includes recreating some databases. It was decided this mixing of recovering servers and databases is necessary to get recreate into the guide in the best way we could see. 3. Previously it was not clear how to handle quarantined databases, even though the guide mentioned them in the intro. Now, it is more explicitly discussed how they are supposed to be handles during disaster recovery. In the past, we have also gotten feedback that the disaster recovery docs are hard to follow. Therefore, this PR also includes refactoring which introduces a new guide structure. This new structure provides more explanations of why we are asking the user to do a certain thing. --------- Co-authored-by: NataliaIvakina <82437520+NataliaIvakina@users.noreply.github.com> --- modules/ROOT/pages/clustering/databases.adoc | 4 +- .../pages/clustering/disaster-recovery.adoc | 364 ++++++++++++------ 2 files changed, 256 insertions(+), 112 deletions(-) diff --git a/modules/ROOT/pages/clustering/databases.adoc b/modules/ROOT/pages/clustering/databases.adoc index 8cf8b899a..794a7ab98 100644 --- a/modules/ROOT/pages/clustering/databases.adoc +++ b/modules/ROOT/pages/clustering/databases.adoc @@ -268,8 +268,8 @@ Neo4j 5.24 introduces the xref:procedures.adoc#procedure_dbms_cluster_recreateDa * To change the database store to a specified backup, while keeping all the associated privileges for the database. * To make your database write-available again after it has been lost (for example, due to a disaster). -// See xref:clustering/disaster-recovery.adoc[] for more information. - +See xref:clustering/disaster-recovery.adoc[Disaster recovery] for more information. + [CAUTION] ==== The recreate procedure works only for real user databases and not for composite databases, or the `system` database. diff --git a/modules/ROOT/pages/clustering/disaster-recovery.adoc b/modules/ROOT/pages/clustering/disaster-recovery.adoc index cbfe8475f..a6b88c6fe 100644 --- a/modules/ROOT/pages/clustering/disaster-recovery.adoc +++ b/modules/ROOT/pages/clustering/disaster-recovery.adoc @@ -1,51 +1,63 @@ -:description: This section describes how to recover databases that have become unavailable. +:description: This section describes how to recover databases that have become unavailable. How to heal a cluster. [role=enterprise-edition] [[cluster-recovery]] = Disaster recovery A database can become unavailable due to issues on different system levels. For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable. -It is also possible for databases to become quarantined due to a critical failure in the system, which may lead to unavailability even without the loss of servers. -This section contains a step-by-step guide on how to recover _unavailable databases_ that are incapable of serving writes, while still may be able to serve reads. +This section contains a step-by-step guide on how to recover *unavailable databases* that are incapable of serving writes, while possibly still being able to serve reads. +The guide recovers the unavailable databases and make them fully operational, with minimal impact on the other databases in the cluster. However, if a database is not performing as expected for other reasons, this section cannot help. -By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster. -[NOTE] +[CAUTION] ==== -If *all* servers in a Neo4j cluster are lost in a data center failover, it is not possible to recover the current cluster. -You have to create a new cluster and restore the databases. -See xref:clustering/setup/deploy.adoc[Deploy a basic cluster] and xref:clustering/databases.adoc#cluster-seed[Seed a database] for more information. +If *all* servers in a Neo4j cluster are lost in a disaster, it is not possible to recover the current cluster. +You have to create a new cluster and restore the databases, see xref:clustering/setup/deploy.adoc[Deploy a basic cluster] and xref:clustering/databases.adoc#cluster-seed[Seed a database] for more information. ==== == Faults in clusters -Databases in clusters follow an allocation strategy. -This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries. -The consequence of this is that all servers are different in which databases they are hosting. +Databases in clusters may be allocated differently within the cluster and may also have different numbers of primaries and secondaries. +The consequence of this is that all servers may be different in which databases they are hosting. Losing a server in a cluster may cause some databases to lose a member while others are unaffected. -Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources. +Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources. -== Guide to disaster recovery +== Guide overview +[NOTE] +==== +In this guide the following terms are used: -There are three main steps to recovering a cluster from a disaster. -Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational. +* An _offline_ server is a server that is not running but may be restartable. +* A _lost_ server, however, is a server that is currently not running and cannot be restarted. +* A _write-available_ database is able to serve writes, while a _write unavailable_ database is not. +==== -. Ensure the `system` database is available in the cluster. -The `system` database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else. +There are four steps to recovering a cluster from a disaster: -. After the `system` database's availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster's topology meets the requirements. +. Start the Neo4j process on all servers which are not _lost_. +See xref:start-the-neo4j-process[Start the Neo4j process] for more information. +. Make the `system` database able to accept write operations, so that the cluster can be modified. +See xref:make-the-system-database-write-available[Make the `system` database write-available] for more information. +. Detach any potential lost servers from the cluster and replace them by new ones. +See xref:make-servers-available[Make servers available] for more information. +. Finish disaster recovery by starting or continuing to manage databases and verify that they are write-available. +See xref:make-databases-write-available[Make databases write-available] for more information. -. After the `system` database is available and the cluster's topology is satisfied, you can manage the databases. +Each step is described in the following three sections: -The steps are described in detail in the following sections. +. Objective -- a state that the cluster needs to be in, with optional motivation. +. Verifying the state -- an example of how the state can be verified. +. Path to correct state -- a proposed series of steps to get to the correct state. -[NOTE] +[CAUTION] ==== -In this section, an _offline_ server is a server that is not running but may be _restartable_. -A _lost_ server, however, is a server that is currently not running and cannot be restarted. +Verifying each state before continuing to the next step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational. ==== +[[disaster-recovery-steps]] +== Disaster recovery steps + [NOTE] ==== Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the `neo4j` scheme for routing. @@ -53,109 +65,241 @@ One way to remedy this is to connect directly to the server using `bolt` instead See xref:clustering/setup/routing.adoc#clustering-routing[Server-side routing] for more information on the `bolt` scheme. ==== -=== Restore the `system` database +[[start-the-neo4j-process]] +=== Start the Neo4j process -The first step of recovery is to ensure that the `system` database is available. -The `system` database is required for clusters to function properly. +==== Objective -. Start all servers that are _offline_. -(If a server is unable to start, inspect the logs and contact support personnel. -The server may have to be considered indefinitely lost.) -. *Validate the `system` database's availability.* -.. Run `SHOW DATABASE system`. -If the response does not contain a writer, the `system` database is unavailable and needs to be recovered, continue to step 3. -.. Optionally, you can create a temporary user to validate the `system` database's writability by running `CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'`. -.. Confirm that the temporary user is created as expected, by running `SHOW USERS`, then continue to xref:clustering/disaster-recovery.adoc#recover-servers[Recover servers]. -If not, continue to step 3. -+ -. *Restore the `system` database.* -+ -[NOTE] ==== -Only do the steps below if the `system` database's availability cannot be validated by the first two steps in this section. +The Neo4j process is started on all servers that are not _lost_. +==== + +==== Path to correct state + +Start the Neo4j process on all servers that are _offline_. +If a server is unable to start, inspect the logs and contact support personnel. +The server may have to be considered indefinitely lost. + +[[make-the-system-database-write-available]] +=== Make the `system` database write-available + +==== Objective +==== +The `system` database is able to accept write operations. +==== + +The `system` database contains the view of the cluster. +This includes which servers and databases are present, where they live and how they are configured. +During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers. +Databases might also need to be recreated to regain write availability. +Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery. + +==== Verifying the state + +The `system` database's write availability can be verified by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[Status check] procedure. + +[source, shell] +---- +CALL dbms.cluster.statusCheck(["system"]); +---- + +[NOTE] +===== +The status check procedure cannot verify the write availability of a database configured to have a single primary. +Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`. +===== + +==== Path to correct state + +Use the following steps to regain write availability for the `system` database if it has been lost. +They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster. +It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely. + +.Guide +[%collapsible] ==== + +[NOTE] +===== +This section of the disaster recovery guide uses `neo4j-admin` commands. +For more information about the used commands, see xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[neo4j-admin commands]. +===== + +. Shut down the Neo4j process on all servers. +This causes downtime for all databases in the cluster until the processes are started again at the end of this section. +. On each server, run `bin/neo4j-admin dbms unbind-system-db` to reset the `system` database state on the servers. +. On each server, run `bin/neo4j-admin database info system` and compare the `lastCommittedTransaction` to find out which server has the most up-to-date copy of the `system` database. +. On the most up-to-date server, run `bin/neo4j-admin database dump system --to-path=[path-to-dump]` to take a dump of the current `system` database and store it in an accessible location. +. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster]. +It is important that the new servers are unconstrained, or deallocating servers in the next step of this guide might be blocked, even though enough servers were added. + [NOTE] +===== +While recommended, it is not strictly necessary to add new servers in this step. +There is also an option to change the `system` database mode (`server.cluster.system_database_mode`) on secondary allocations to make them primary allocations for the new `system` database. +The number of primary allocations needed is defined by `dbms.cluster.minimum_initial_system_primaries_count`. +See the xref:configuration/configuration-settings.adoc#config_dbms.cluster.minimum_initial_system_primaries_count[Configuration settings] for more information. +Be aware that not replacing servers can cause cluster overload when databases are moved from lost servers to available ones in the next step of this guide. +===== ++ +. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump. +. On each server, ensure that the discovery settings are correct, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information. +. Start the Neo4j process on all servers. +==== + + +[[make-servers-available]] +=== Make servers available + +==== Objective +==== +All servers in the cluster's view are available and enabled. ==== -Recall that the cluster remains fault tolerant, and thus able to serve both reads and writes, as long as a majority of the primaries are available. -In case of a disaster affecting one or more server(s), but where the majority of servers are still available, it is possible to add a new server to the cluster and recover the `system` database (and any other affected user databases) on it by copying the `system` database (and affected user databases) from one of the available servers. -This method prevents downtime for the other databases in the cluster. -If this is the case, ie. if a majority of servers are still available, follow the instructions in <>. + +A lost server will still be in the `system` database's view of the cluster, but in an unavailable state. +Furthermore, according to the view of the cluster, these lost servers are still hosting the databases they had before they became lost. +Therefore, informing the cluster of servers which are lost is not enough. +The databases hosted on lost servers also need to be moved onto available servers in the cluster, before the lost servers can be removed. + +==== Verifying the state +The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information. +The state has been verified if *all* servers show `health` = `Available` and `status` = `Enabled`. + +[source, cypher] +---- +SHOW SERVERS; +---- + +==== Path to correct state +Use the following steps to remove lost servers and add new ones to the cluster. +To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster. +This can be done in two different ways: + +* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move. +* Any allocations that can move will be instructed to do so by deallocating the server. + +.Guide +[%collapsible] ==== +. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers. +This prevents new database allocations from being moved to this server. +. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place. +See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information. + -The following steps create a new `system` database from a backup of the current `system` database. -This is required since the current `system` database has lost too many members in the server failover. - -.. Shut down the Neo4j process on all servers. -Note that this causes downtime for all databases in the cluster. -.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin dbms unbind-system-db` to reset the `system` database state on the servers. -See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[`neo4j-admin` commands] for more information. -.. On each server, run the following `neo4j-admin` command `bin/neo4j-admin database info system` to find out which server is most up-to-date, ie. has the highest last-committed transaction id. -.. On the most up-to-date server, take a dump of the current `system` database by running `bin/neo4j-admin database dump system --to-path=[path-to-dump]` and store the dump in an accessible location. -See xref:tools/neo4j-admin/index.adoc#neo4j-admin-commands[`neo4j-admin` commands] for more information. -.. Ensure there are enough `system` database primaries to create the new `system` database with fault tolerance. -Either: -... Add completely new servers (see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster]) or -... Change the `system` database mode (`server.cluster.system_database_mode`) on the current `system` database's secondary servers to allow them to be primaries for the new `system` database. -.. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump. -.. Ensure that `dbms.cluster.discovery.endpoints` are set correctly on all servers, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information. -.. Return to step 1. - - -[[recover-servers]] -=== Recover servers - -Once the `system` database is available, the cluster can be managed. -Following the loss of one or more servers, the cluster's view of servers must be updated, ie. the lost servers must be replaced by new servers. -The steps here identify the lost servers and safely detach them from the cluster. - -. Run `SHOW SERVERS`. -If *all* servers show health `AVAILABLE` and status `ENABLED` continue to xref:clustering/disaster-recovery.adoc#recover-databases[Recover databases]. -. For each `UNAVAILABLE` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers. -. For each `CORDONED` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. -. For each server that failed to deallocate with one of the following messages: -.. `Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.\*'. + -Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries. + -Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.` +If servers were added in the <> step of this guide, additional servers might not be needed here. +It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added. + -or +[NOTE] +===== +While recommended, it is not strictly necessary to add new servers in this step. +However, not adding new servers reduces the capacity of the cluster to handle work. +Furthermore, it might require the topology for a database to be altered to make deallocating servers and recreating databases possible. +===== + +. For each stopped database (`currentStatus`= `offline`), start them by running `START DATABASE stopped-db`. +This is necessary since stopped databases cannot be deallocated from a server. +It is also necessary for the status check procedure to accurately indicate if this database should be recreated or not. +Verify that all allocations are in `currentStatus` = `online` on servers which are not lost before moving to the next step. +If a database fails to start, leave it to be recreated in the next step of this guide. + -`Could not deallocate server(s) `serverId`. -Database [database] has lost quorum of servers, only found [existing number of primaries] of [expected number of primaries]. -Cannot be safely reallocated.` +[NOTE] +===== +A database can be set to `READ-ONLY` before it is started to avoid updates on the database with the following command: +`ALTER DATABASE database-name SET ACCESS READ ONLY`. +===== + +. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases running in primary mode on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information. + -First ensure that there is a backup for the database in question (see xref:backup-restore/online-backup.adoc[Online backup]), and then drop the database by running `DROP DATABASE database-name`. -Return to step 3. -.. `Could not deallocate server [server]. Cannot change allocations for database [stopped-db] because it is offline.` +[NOTE] +===== +The status check procedure cannot verify the write availability of a database configured to have a single primary. +Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`. +===== + +. For each database that is not write-available, recreate it to move it from lost servers and regain write availability. +Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options. +Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information. +If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed]. + -Try to start the offline database by running `START DATABASE stopped-db WAIT`. -If it starts successfully, return to step 3. -Otherwise, ensure that there is a backup for the database before dropping it with `DROP DATABASE stopped-db`. -Return to step 3. +[CAUTION] +===== +If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored. +===== + +. For each `Cordoned` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. +This will move all database allocations from this server to an available server in the cluster. + [NOTE] +===== +This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers. +Another reason is that some available servers are also `Cordoned`. +===== + +. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`. +This removes the server from the cluster's view. ==== -A database can be set to `READ-ONLY`-mode before it is started to avoid updates on a database that is desired to be stopped with the following: -`ALTER DATABASE database-name SET ACCESS READ ONLY`. + + +[[make-databases-write-available]] +=== Make databases write-available + +==== Objective +==== +All databases that are desired to be started are write-available. ==== -.. `Could not deallocate server [server]. Reallocation of [database] not possible, no new target found. All existing servers: [existing-servers]. Actual allocated server with mode [mode] is [current-hostings].` +Once this state is verified, disaster recovery is complete. +However, remember that previously stopped databases might have been started during this process. +If they are still desired to be in stopped state, run `STOP DATABASE started-db WAIT`. + +[CAUTION] +==== +Remember, recreating a database takes an unbounded amount of time since it may involve copying the store to a new server, as described in xref:clustering/databases.adoc#recreate-databases[Recreate databases]. +Therefore, an allocation with `currentStatus` = `starting` will probably reach the `requestedStatus` given some time. +==== + +[[example-verification]] +==== Verifying the state +You can verify all clustered databases' write availability by using the xref:clustering/monitoring/status-check.adoc#monitoring-replication[status check] procedure. + +[source, shell] +---- +CALL dbms.cluster.statusCheck([]); +---- + +[NOTE] +===== +The status check procedure cannot verify the write availability of a database configured to have a single primary. +Instead, check that the primary is allocated on an available server and that it has `currentStatus` = `online` by running `SHOW DATABASES`. +===== + +A stricter verification can be done to verify that all databases are in their desired states on all servers. +For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers. + +==== Path to correct state +Use the following steps to make all databases in the cluster write-available again. +They include recreating any databases that are not write-capable and identifying any recreations that will not complete. +Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers. + +.Guide +[%collapsible] +==== +. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step. +Filter out all databases desired to be stopped, so that they are not recreated unnecessarily. +. Recreate every database that is not write-available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information. +Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information. +If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed]. + -Add new servers and enable them and then return to step 3, see xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information. -. Run `SHOW SERVERS YIELD *` once all enabled servers host the requested databases (`hosting`-field contains exactly the databases in the `requestedHosting` field), and proceed to the next step. -Note that this may take a few minutes. -. For each deallocated server, run `DROP SERVER deallocated-server-id`. -. Return to step 1. - -[[recover-databases]] -=== Recover databases - -Once the `system` database is verified available, and all servers are online, the databases can be managed. -The steps here aim to make the unavailable databases available. - -. If you have previously dropped databases as part of this guide, re-create each one from a backup. -See the xref:database-administration/standard-databases/create-databases.adoc[Create databases] section for more information on how to create a database. -. Run `SHOW DATABASES`. -If all databases are in desired states on all servers (`requestedStatus`=`currentStatus`), disaster recovery is complete. -// . For each database that remains unavailable, refer to <>. -// Perform the actions required to get the database available then return to step 2. +[CAUTION] +===== +If you recreate databases using xref:clustering/databases.adoc#undefined-servers[undefined servers] or xref:clustering/databases.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored. +===== + +. Run `SHOW DATABASES` and check any recreated databases that are not write-available. +Recreating a database will not complete if one of the following messages is displayed in the message field: +** `Seeders ServerId1 and ServerId2 have different checksums for transaction TransactionId. All seeders must have the same checksum for the same append index.` +** `Seeders ServerId1 and ServerId2 have incompatible storeIds. All seeders must have compatible storeIds.` +** `No store found on any of the seeders ServerId1, ServerId2...` +. For each database which will not complete recreation, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed]. + +====