Skip to content

Commit bb19c6c

Browse files
authored
Merge pull request #110436 from markjbrown/HA-update
Update outage scenarios
2 parents f3b3564 + 8c9458f commit bb19c6c

File tree

1 file changed

+23
-16
lines changed

1 file changed

+23
-16
lines changed

articles/cosmos-db/high-availability.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -46,18 +46,25 @@ Regional outages aren't uncommon, and Azure Cosmos DB makes sure your database i
4646

4747
- Single-region accounts may lose availability following a regional outage. It's always recommended to set up **at least two regions** (preferably, at least two write regions) with your Cosmos account to ensure high availability at all times.
4848

49-
- **Multi-region accounts with a single-write region (write region outage):**
50-
- During a write region outage, the Cosmos account will automatically promote a secondary region to be the new primary write region when **enable automatic failover** is configured on the Azure Cosmos account. When enabled, the failover will occur to another region in the order of region priority you’ve specified.
51-
- Customers may also choose to use **manual failover** and monitor their Cosmos write endpoint URL's themselves using an agent built themselves. For customers with complex and sophisticated health monitoring needs, this can provide reduced RTO should a failure occur in the write region.
52-
- When the previously impacted region is back online, any write data that was unreplicated when the region failed, is made available through the [conflicts feed](how-to-manage-conflicts.md#read-from-conflict-feed). Applications can read the conflicts feed, resolve the conflicts based on the application-specific logic, and write the updated data back to the Azure Cosmos container as appropriate.
53-
- Once the previously impacted write region recovers, it becomes automatically available as a read region. You can switch back to the recovered region as the write region. You can switch the regions by using [Azure CLI or Azure portal](how-to-manage-database-account.md#manual-failover). There is **no data or availability loss** before, during or after you switch the write region and your application continues to be highly available.
54-
55-
- **Multi-region accounts with a single-write region (read region outage):**
56-
- During a read region outage, these accounts will remain highly available for reads and writes.
57-
- The impacted region is automatically disconnected and will be marked offline. The [Azure Cosmos DB SDKs](sql-api-sdk-dotnet.md) will redirect read calls to the next available region in the preferred region list.
58-
- If none of the regions in the preferred region list is available, calls automatically fall back to the current write region.
59-
- No changes are required in your application code to handle read region outage. Eventually, when the impacted region is back online, the previously impacted read region will automatically sync with the current write region and will be available again to serve read requests.
60-
- Subsequent reads are redirected to the recovered region without requiring any changes to your application code. During both failover and rejoining of a previously failed region, read consistency guarantees continue to be honored by Cosmos DB.
49+
### Multi-region accounts with a single-write region (write region outage)
50+
51+
- During a write region outage, the Cosmos account will automatically promote a secondary region to be the new primary write region when **enable automatic failover** is configured on the Azure Cosmos account. When enabled, the failover will occur to another region in the order of region priority you've specified.
52+
- When the previously impacted region is back online, any write data that was not replicated when the region failed, is made available through the [conflicts feed](how-to-manage-conflicts.md#read-from-conflict-feed). Applications can read the conflicts feed, resolve the conflicts based on the application-specific logic, and write the updated data back to the Azure Cosmos container as appropriate.
53+
- Once the previously impacted write region recovers, it becomes automatically available as a read region. You can switch back to the recovered region as the write region. You can switch the regions by using [PowerShell, Azure CLI or Azure portal](how-to-manage-database-account.md#manual-failover). There is **no data or availability loss** before, during or after you switch the write region and your application continues to be highly available.
54+
55+
> [!IMPORTANT]
56+
> It is strongly recommended that you configure the Azure Cosmos accounts used for production workloads to **enable automatic failover**. Manual failover requires connectivity between secondary and primary write region to complete a consistency check to ensure there is no data loss during the failover. If the primary region is unavailable, this consistency check cannot complete and the manual failover will not succeed, resulting in loss of write availability.
57+
58+
### Multi-region accounts with a single-write region (read region outage)
59+
60+
- During a read region outage, Cosmos accounts using any consistency level or strong consistency with three or more read regions will remain highly available for reads and writes.
61+
- The impacted region is automatically disconnected and will be marked offline. The [Azure Cosmos DB SDKs](sql-api-sdk-dotnet.md) will redirect read calls to the next available region in the preferred region list.
62+
- If none of the regions in the preferred region list is available, calls automatically fall back to the current write region.
63+
- No changes are required in your application code to handle read region outage. When the impacted read region is back online it will automatically sync with the current write region and will be available again to serve read requests.
64+
- Subsequent reads are redirected to the recovered region without requiring any changes to your application code. During both failover and rejoining of a previously failed region, read consistency guarantees continue to be honored by Cosmos DB.
65+
66+
> [!IMPORTANT]
67+
> Azure Cosmos accounts using strong consistency with two or fewer read regions will lose write availability during a read region outage but will maintain read availability for remaining regions.
6168
6269
- Even in a rare and unfortunate event when the Azure region is permanently irrecoverable, there is no data loss if your multi-region Cosmos account is configured with *Strong* consistency. In the event of a permanently irrecoverable write region, a multi-region Cosmos account configured with bounded-staleness consistency, the potential data loss window is restricted to the staleness window (*K* or *T*) where K=100,000 updates and T=5 minutes. For session, consistent-prefix and eventual consistency levels, the potential data loss window is restricted to a maximum of 15 minutes. For more information on RTO and RPO targets for Azure Cosmos DB, see [Consistency levels and data durability](consistency-levels-tradeoffs.md#rto)
6370

@@ -67,7 +74,7 @@ In addition to cross region resiliency, you can now enable **zone redundancy** w
6774

6875
With Availability Zone support, Azure Cosmos DB will ensure replicas are placed across multiple zones within a given region to provide high availability and resiliency during zonal failures. There are no changes to latency and other SLAs in this configuration. In the event of a single zone failure, zone redundancy provides full data durability with RPO=0 and availability with RTO=0.
6976

70-
Zone redundancy is a *supplemental capability* to the [multi-master replication](how-to-multi-master.md) feature. Zone redundancy alone cannot be relied upon to achieve regional resiliency. For example, in the event of regional outages or low latency access across the regions, its advised to have multiple write regions in addition to zone redundancy.
77+
Zone redundancy is a *supplemental capability* to the [multi-master replication](how-to-multi-master.md) feature. Zone redundancy alone cannot be relied upon to achieve regional resiliency. For example, in the event of regional outages or low latency access across the regions, it's advised to have multiple write regions in addition to zone redundancy.
7178

7279
When configuring multi-region writes for your Azure Cosmos account, you can opt into zone redundancy at no extra cost. Otherwise, please see the note below regarding the pricing for zone redundancy support. You can enable zone redundancy on an existing region of your Azure Cosmos account by removing the region and adding it back with the zone redundancy enabled.
7380

@@ -108,12 +115,12 @@ The following table summarizes the high availability capability of various accou
108115
> [!NOTE]
109116
> To enable Availability Zone support for a multi region Azure Cosmos account, the account must have multi-master writes enabled.
110117
111-
You can enable zone redundancy when adding a region to new or existing Azure Cosmos accounts. To enable zone redundancy on your Azure Cosmos account, you should set the `isZoneRedundant` flag to `true` for a specific location. You can set this flag within the locations property. For example, the following powershell snippet enables zone redundancy for the "Southeast Asia" region:
118+
You can enable zone redundancy when adding a region to new or existing Azure Cosmos accounts. To enable zone redundancy on your Azure Cosmos account, you should set the `isZoneRedundant` flag to `true` for a specific location. You can set this flag within the locations property. For example, the following PowerShell snippet enables zone redundancy for the "Southeast Asia" region:
112119

113120
```powershell
114121
$locations = @(
115122
@{ "locationName"="Southeast Asia"; "failoverPriority"=0; "isZoneRedundant"= "true" },
116-
@{ "locationName"="East US"; "failoverPriority"=1 }
123+
@{ "locationName"="East US"; "failoverPriority"=1; "isZoneRedundant"= "true" }
117124
)
118125
```
119126

@@ -139,7 +146,7 @@ You can enable Availability Zones by using Azure portal when creating an Azure C
139146

140147
- For multi-region Cosmos accounts that are configured with a single-write region, [enable automatic-failover by using Azure CLI or Azure portal](how-to-manage-database-account.md#automatic-failover). After you enable automatic failover, whenever there is a regional disaster, Cosmos DB will automatically failover your account.
141148

142-
- Even if your Cosmos account is highly available, your application may not be correctly designed to remain highly available. To test the end-to-end high availability of your application, as a part of your application testing or disaster-recovery (DR) drills, temporarily disable automatic-failover for the account, invoke the [manual failover by using Azure CLI or Azure portal](how-to-manage-database-account.md#manual-failover), then monitor your application's failover. Once complete, you can fail back over to the primary region and restore automatic-failover for the account.
149+
- Even if your Azure Cosmos account is highly available, your application may not be correctly designed to remain highly available. To test the end-to-end high availability of your application, as a part of your application testing or disaster recovery (DR) drills, temporarily disable automatic-failover for the account, invoke the [manual failover by using PowerShell, Azure CLI or Azure portal](how-to-manage-database-account.md#manual-failover), then monitor your application's failover. Once complete, you can fail back over to the primary region and restore automatic-failover for the account.
143150

144151
- Within a globally distributed database environment, there is a direct relationship between the consistency level and data durability in the presence of a region-wide outage. As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after a disruptive event. The time required for an application to fully recover is known as recovery time objective (RTO). You also need to understand the maximum period of recent data updates the application can tolerate losing when recovering after a disruptive event. The time period of updates that you might afford to lose is known as recovery point objective (RPO). To see the RPO and RTO for Azure Cosmos DB, see [Consistency levels and data durability](consistency-levels-tradeoffs.md#rto)
145152

0 commit comments

Comments
 (0)