You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-disaster-recovery.md
+13-10Lines changed: 13 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -102,12 +102,13 @@ Determining whether a disaster occurred for a stateful service and managing it f
102
102
- For services without persisted state, a failure of a quorum or more of replicas results _immediately_ in permanent quorum loss. When Service Fabric detects quorum loss in a stateful non-persistent service, it will immediately proceed to step 3 by declaring (potential) data loss. Proceeding to data loss makes sense because Service Fabric knows that there's no point in waiting for the replicas to come back, because even if they were recover, due to non-persisted nature of the service the data has been lost.
103
103
- For stateful persistent services, a failure of a quorum or more of replicas causes Service Fabric to wait for the replicas to come back and restore quorum. This results in a service outage for any _writes_ to the affected partition (or "replica set") of the service. However, reads may still be possible with reduced consistency guarantees. The default amount of time that Service Fabric waits for quorum to be restored is infinite, since proceeding is a (potential) data loss event and carries other risks.
104
104
3. Determining if there has been actual data loss, and restoring from backups
105
-
- When Service Fabric calls the `OnDataLossAsync` method, it is always because of _suspected_ data loss. Service Fabric ensures that this call is delivered to the _best_ remaining replica. This is whichever replica has made the most progress. The reason we always say _suspected_ data loss is that it is possible that the remaining replica actually has all same state as the Primary did when it went down. However, without that state to compare it to, there's no good way for Service Fabric or operators to know for sure. At this point, Service Fabric also knows the other replicas are not coming back. That was the decision made when we stopped waiting for the quorum loss to resolve itself. The best course of action for the service is usually to freeze and wait for specific administrative intervention. So what does a typical implementation of the `OnDataLossAsync` method do?
106
-
- First, log that `OnDataLossAsync` has been triggered, and fire off any necessary administrative alerts.
107
-
- Usually at this point, to pause and wait for further decisions and manual actions to be taken. This is because even if backups are available they may need to be prepared. For example, if two different services coordinate information, those backups may need to be modified in order to ensure that once the restore happens that the information those two services care about is consistent.
108
-
- Often there is also some other telemetry or exhaust from the service. This metadata may be contained in other services or in logs. This information can be used needed to determine if there were any calls received and processed at the primary that were not present in the backup or replicated to this particular replica. These may need to be replayed or added to the backup before restoration is feasible.
109
-
- Comparisons of the remaining replica's state to that contained in any backups that are available. If using the Service Fabric reliable collections then there are tools and processes available for doing so, described in [this article](service-fabric-reliable-services-backup-restore.md). The goal is to see if the state within the replica is sufficient, or also what the backup may be missing.
110
-
- Once the comparison is done, and if necessary the restore completed, the service code should return true if any state changes were made. If the replica determined that it was the best available copy of the state and made no changes, then return false. True indicates that any _other_ remaining replicas may now be inconsistent with this one. They will be dropped and rebuilt from this replica. False indicates that no state changes were made, so the other replicas can keep what they have.
105
+
106
+
When Service Fabric calls the `OnDataLossAsync` method, it is always because of _suspected_ data loss. Service Fabric ensures that this call is delivered to the _best_ remaining replica. This is whichever replica has made the most progress. The reason we always say _suspected_ data loss is that it is possible that the remaining replica actually has all same state as the Primary did when it went down. However, without that state to compare it to, there's no good way for Service Fabric or operators to know for sure. At this point, Service Fabric also knows the other replicas are not coming back. That was the decision made when we stopped waiting for the quorum loss to resolve itself. The best course of action for the service is usually to freeze and wait for specific administrative intervention. So what does a typical implementation of the `OnDataLossAsync` method do?
107
+
1. First, log that `OnDataLossAsync` has been triggered, and fire off any necessary administrative alerts.
108
+
1. Usually at this point, to pause and wait for further decisions and manual actions to be taken. This is because even if backups are available they may need to be prepared. For example, if two different services coordinate information, those backups may need to be modified in order to ensure that once the restore happens that the information those two services care about is consistent.
109
+
1. Often there is also some other telemetry or exhaust from the service. This metadata may be contained in other services or in logs. This information can be used needed to determine if there were any calls received and processed at the primary that were not present in the backup or replicated to this particular replica. These may need to be replayed or added to the backup before restoration is feasible.
110
+
1. Comparisons of the remaining replica's state to that contained in any backups that are available. If using the Service Fabric reliable collections then there are tools and processes available for doing so, described in [this article](service-fabric-reliable-services-backup-restore.md). The goal is to see if the state within the replica is sufficient, or also what the backup may be missing.
111
+
1. Once the comparison is done, and if necessary the restore completed, the service code should return true if any state changes were made. If the replica determined that it was the best available copy of the state and made no changes, then return false. True indicates that any _other_ remaining replicas may now be inconsistent with this one. They will be dropped and rebuilt from this replica. False indicates that no state changes were made, so the other replicas can keep what they have.
111
112
112
113
It is critically important that service authors practice potential data loss and failure scenarios before services are ever deployed in production. To protect against the possibility of data loss, it is important to periodically [back up the state](service-fabric-reliable-services-backup-restore.md) of any of your stateful services to a geo-redundant store. You must also ensure that you have the ability to restore it. Since backups of many different services are taken at different times, you need to ensure that after a restore your services have a consistent view of each other. For example, consider a situation where one service generates a number and stores it, then sends it to another service that also stores it. After a restore, you might discover that the second service has the number but the first does not, because it's backup didn't include that operation.
113
114
@@ -128,11 +129,13 @@ There could be situations where it is not possible to recover replicas for examp
128
129
DO NOT use these methods if potential DATA LOSS is not acceptable to bring the service online in which case all efforts should be made to towards recovering physical machines.
129
130
130
131
The following steps could result in dataloss – please be sure before following them:
131
-
> [!NOTE]
132
-
> It is _never_ safe to use these methods other than in a targeted way against specific partitions.
133
-
>
132
+
133
+
> [!NOTE]
134
+
> It is _never_ safe to use these methods other than in a targeted way against specific partitions.
135
+
>
136
+
134
137
- Use `Repair-ServiceFabricPartition -PartitionId` or `System.Fabric.FabricClient.ClusterManagementClient.RecoverPartitionAsync(Guid partitionId)` API. This API allows specifying the ID of the partition to recover from the QuorumLoss with potential data loss.
135
-
- If your cluster encounter frequent failures causing services to go into Quorum Loss state and potential _DATA LOSS is ACCEPTABLE_ then specifying appropriate [QuorumLossWaitDuration](https://docs.microsoft.com/en-us/powershell/module/servicefabric/update-servicefabricservice?view=azureservicefabricps) value can help your service auto-recover. Service Fabric will wait for provided QuorumLossWaitDuration (default is infinite) before performing recovery. This method is _NOT RECOMMENDED_ since this can cause unexpected data losses.
138
+
- If your cluster encounter frequent failures causing services to go into Quorum Loss state and potential _data loss is acceptable_, then specifying an appropriate [QuorumLossWaitDuration](https://docs.microsoft.com/powershell/module/servicefabric/update-servicefabricservice?view=azureservicefabricps) value can help your service auto-recover. Service Fabric will wait for provided QuorumLossWaitDuration (default is infinite) before performing recovery. This method is _not recommended_ because this can cause unexpected data losses.
136
139
137
140
## Availability of the Service Fabric cluster
138
141
Generally speaking, the Service Fabric cluster itself is a highly distributed environment with no single points of failure. A failure of any one node will not cause availability or reliability issues for the cluster, primarily because the Service Fabric system services follow the same guidelines provided earlier: they always run with three or more replicas by default, and those system services that are stateless run on all nodes. The underlying Service Fabric networking and failure detection layers are fully distributed. Most system services can be rebuilt from metadata in the cluster, or know how to resynchronize their state from other places. The availability of the cluster can become compromised if system services get into quorum loss situations like those described above. In these cases you may not be able to perform certain operations on the cluster like starting an upgrade or deploying new services, but the cluster itself is still up. Services on already running will remain running in these conditions unless they require writes to the system services to continue functioning. For example, if the Failover Manager is in quorum loss all services will continue to run, but any services that fail will not be able to automatically restart, since this requires the involvement of the Failover Manager.
0 commit comments