-
Notifications
You must be signed in to change notification settings - Fork 83
Add images to the Disaster recovery page #2619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
cc68b3f
3789059
6598d6c
2119158
8bd24c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -19,10 +19,47 @@ You have to create a new cluster and restore the databases, see xref:clustering/ | |||||
== Faults in clusters | ||||||
|
||||||
Databases in clusters may be allocated differently within the cluster and may also have different numbers of primaries and secondaries. | ||||||
|
||||||
image::healthy-cluster.svg[width="400", title="Healthy cluster", role=popup] | ||||||
|
||||||
The consequence of this is that all servers may be different in which databases they are hosting. | ||||||
Losing a server in a cluster may cause some databases to lose a member while others are unaffected. | ||||||
Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources. | ||||||
|
||||||
Figure 2 shows the disaster when three servers are lost, demonstrating that this situation impacts databases in different ways. | ||||||
|
||||||
image::disaster.svg[width="400", title="Example of a cluster disaster", role=popup] | ||||||
|
||||||
.Disaster scenarios and recovery strategies | ||||||
[cols="1,2,2", options=header] | ||||||
|=== | ||||||
^|Database | ||||||
^|Disaster scenario | ||||||
^|Recovery strategy | ||||||
|
||||||
|Database A | ||||||
|All allocations are lost. | ||||||
|The database needs to be recreated from a backup. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @AnnaSjerling, could you add a bit more details about the recovery strategy in each case? |
||||||
|
||||||
|Database B | ||||||
|The primary allocation is lost, and the secondary allocation is available. | ||||||
|The database needs to be recreated, but can be based on available allocations in the cluster. | ||||||
|
||||||
|Database C | ||||||
|Two primary allocations and a secondary one are lost. | ||||||
|The database needs to be recreated, but can be based on available allocations in the cluster. | ||||||
|
||||||
|Database D | ||||||
|One primary allocation and two secondary allocations are lost. | ||||||
|The database will move when a server is deallocated. | ||||||
|
||||||
|Database E | ||||||
|Stays unaffected. | ||||||
|No action is required. | ||||||
|=== | ||||||
|
||||||
Although databases C and D share the same topology, their primaries and secondaries are allocated differently, requiring distinct recovery strategies in this disaster example. | ||||||
|
||||||
== Guide overview | ||||||
[NOTE] | ||||||
==== | ||||||
|
@@ -115,9 +152,6 @@ Use the following steps to regain write availability for the `system` database i | |||||
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster. | ||||||
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely. | ||||||
|
||||||
.Guide | ||||||
[%collapsible] | ||||||
==== | ||||||
|
||||||
[NOTE] | ||||||
===== | ||||||
|
@@ -133,6 +167,8 @@ This causes downtime for all databases in the cluster until the processes are st | |||||
. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster]. | ||||||
It is important that the new servers are unconstrained, or deallocating servers in the next step of this guide might be blocked, even though enough servers were added. | ||||||
+ | ||||||
In the current example, the new unconstrained servers are added in this step. | ||||||
+ | ||||||
[NOTE] | ||||||
===== | ||||||
While recommended, it is not strictly necessary to add new servers in this step. | ||||||
|
@@ -143,10 +179,12 @@ Be aware that not replacing servers can cause cluster overload when databases ar | |||||
===== | ||||||
+ | ||||||
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump. | ||||||
+ | ||||||
image::system-db-restored.svg[width="400", title="The unconstrained servers are added and the `system` database is restored", role=popup] | ||||||
+ | ||||||
. On each server, ensure that the discovery settings are correct. | ||||||
See xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information. | ||||||
. Start the Neo4j process on all servers. | ||||||
==== | ||||||
|
||||||
|
||||||
[[make-servers-available]] | ||||||
|
@@ -180,16 +218,17 @@ This is done in two different steps: | |||||
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move. | ||||||
* Any allocations that can move will be instructed to do so by deallocating the server. | ||||||
|
||||||
.Guide | ||||||
[%collapsible] | ||||||
==== | ||||||
|
||||||
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers. | ||||||
This prevents new database allocations from being moved to this server. | ||||||
+ | ||||||
image::servers-cordoned.svg[width="400", title="Cordon unavailable servers", role=popup] | ||||||
|
||||||
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place. | ||||||
See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information. | ||||||
+ | ||||||
If servers were added in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide, additional servers might not be needed here. | ||||||
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added. | ||||||
If servers were added in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide (like it is done in the current disaster recovery example), additional servers might not be needed here. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added. | ||||||
+ | ||||||
[NOTE] | ||||||
===== | ||||||
|
@@ -229,10 +268,14 @@ If any database has `currentStatus` = `quarantined` on an available server, recr | |||||
===== | ||||||
If you recreate databases using xref:database-administration/standard-databases/recreate-database.adoc#undefined-servers[undefined servers] or xref:database-administration/standard-databases/recreate-database.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored. | ||||||
===== | ||||||
+ | ||||||
image::servers-cordoned-databases-moved.svg[width="400", title="Recreate databases", role=popup] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, we need to explain why databases A, B, and C were recreated in this step. Could you help me with that? |
||||||
|
||||||
. For each `Cordoned` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers. | ||||||
This will move all database allocations from this server to an available server in the cluster. | ||||||
+ | ||||||
image::servers-deallocated.svg[width="400", title="Deallocate databases from unavailable servers", role=popup] | ||||||
+ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To add a line explaining why the database D was moved or recreated in this step. |
||||||
[NOTE] | ||||||
===== | ||||||
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers. | ||||||
|
@@ -241,7 +284,7 @@ Another reason is that some available servers are also `Cordoned`. | |||||
|
||||||
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`. | ||||||
This removes the server from the cluster's view. | ||||||
==== | ||||||
|
||||||
|
||||||
|
||||||
[[make-databases-write-available]] | ||||||
|
@@ -281,13 +324,12 @@ A stricter verification can be done to verify that all databases are in their de | |||||
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers. | ||||||
|
||||||
==== Path to correct state | ||||||
|
||||||
Use the following steps to make all databases in the cluster write-available again. | ||||||
They include recreating any databases that are not write-available and identifying any recreations that will not complete. | ||||||
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers. | ||||||
|
||||||
.Guide | ||||||
[%collapsible] | ||||||
==== | ||||||
|
||||||
. Identify all write-unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the <<#example-verification, Example verification>> part of this disaster recovery step. | ||||||
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily. | ||||||
. Recreate every database that is not write-available and has not been recreated previously. | ||||||
|
@@ -308,4 +350,7 @@ Recreating a database will not complete if one of the following messages is disp | |||||
** `No store found on any of the seeders ServerId1, ServerId2...` | ||||||
. For each database which will not complete recreation, recreate them from backup using xref:database-administration/standard-databases/recreate-database.adoc#uri-seed[Backup as seed]. | ||||||
|
||||||
==== | ||||||
image::fully-recovered-cluster.svg[width="400", title="Fully recovered cluster", role="popup"] | ||||||
|
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking that we could add some kind of conclusion here.
What else should we say here? WDYT? |
||||||
|
Uh oh!
There was an error while loading. Please reload this page.