You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/iot-hub/iot-hub-ha-dr.md
+21-71Lines changed: 21 additions & 71 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,68 +5,22 @@ author: kgremban
5
5
ms.service: azure-iot-hub
6
6
services: iot-hub
7
7
ms.topic: conceptual
8
-
ms.date: 07/20/2023
8
+
ms.date: 03/25/2025
9
9
ms.author: kgremban
10
10
ms.custom: references_regions
11
11
---
12
12
13
13
# IoT Hub high availability and disaster recovery
14
14
15
-
As a first step towards implementing a resilient IoT solution, architects, developers, and business owners must define the uptime goals for the solutions they're building. These goals can be defined primarily based on specific business objectives for each scenario. In this context, the article [Azure Business Continuity Technical Guidance](/azure/architecture/framework/resiliency/app-design) describes a general framework to help you think about business continuity and disaster recovery. The [Disaster recovery and high availability for Azure applications](/azure/architecture/reliability/disaster-recovery) paper provides architecture guidance on strategies for Azure applications to achieve High Availability (HA) and Disaster Recovery (DR).
16
-
17
-
This article discusses the HA and DR features offered specifically by the IoT Hub service. The broad areas discussed in this article are:
18
-
19
-
* Intra-region HA
20
-
* Cross region DR
21
-
* Achieving cross region HA
15
+
As a first step towards implementing a resilient IoT solution, architects, developers, and business owners must define the uptime goals for the solutions they're building. These goals can be defined primarily based on specific business objectives for each scenario. In this context, the article [Azure Business Continuity Technical Guidance](/azure/architecture/framework/resiliency/app-design) describes a general framework to help you think about business continuity and disaster recovery. The [Disaster recovery and high availability for Azure applications](/azure/architecture/reliability/disaster-recovery) paper provides architecture guidance on strategies for Azure applications to achieve high availability (HA) and disaster recovery (DR).
22
16
23
17
Depending on the uptime goals you define for your IoT solutions, you should determine which of the options outlined in this article best suit your business objectives. Incorporating any of these HA/DR alternatives into your IoT solution requires a careful evaluation of the trade-offs between the:
24
18
25
19
* Level of resiliency you require
26
20
* Implementation and maintenance complexity
27
21
* COGS impact
28
22
29
-
## Intra-region HA
30
-
31
-
The IoT Hub service provides intra-region HA by implementing redundancies in almost all layers of the service. The [SLA published by the IoT Hub service](https://azure.microsoft.com/support/legal/sla/iot-hub) is achieved by making use of these redundancies. No extra work is required by the developers of an IoT solution to take advantage of these HA features. Although IoT Hub offers a reasonably high uptime guarantee, transient failures can still be expected as with any distributed computing platform. If you're just getting started with migrating your solutions to the cloud from an on-premises solution, your focus needs to shift from optimizing "mean time between failures" to "mean time to recover". In other words, transient failures are to be considered normal while operating with the cloud in the mix. Appropriate [retry patterns](../iot/concepts-manage-device-reconnections.md#retry-patterns) must be built in to the components interacting with a cloud application to deal with transient failures.
32
-
33
-
## Availability zones
34
-
35
-
IoT Hub supports [Azure availability zones](../reliability/availability-zones-overview.md). An availability zone is a high-availability offering that protects your applications and data from datacenter failures. A region with availability zone support comprises three zones supporting that region. Each zone provides one or more datacenters, each in a unique physical location with independent power, cooling, and networking. This configuration provides replication and redundancy within the region.
36
-
37
-
Availability zones provide two advantages: data resiliency and smoother deployments.
38
-
39
-
*Data resiliency* comes from replacing the underlying storage services with availability-zones-supported storage. Data resilience is important for IoT solutions because these solutions often operate in complex, dynamic, and uncertain environments where failures or disruptions can have significant consequences. Whether an IoT solution supports a manufacturing floor, retail or restaurant environments, healthcare systems, or infrastructure, the availability and quality of data is necessary to recover from failures and to provide reliable and consistent services.
40
-
41
-
*Smoother deployments* come from replacing the underlying data center hardware with newer hardware that supports availability zones. These hardware improvements minimize customer impact from device disconnects and reconnects as well as other deployment-related downtime. The IoT Hub engineering team deploys multiple updates to each IoT hub ever month, for both security reasons and to provide feature improvements. Availability-zones-supported hardware is split into 15 update domains so that each update goes smoother, with minimal impact to your workflows. For more information about update domains, see [Availability sets](/azure/virtual-machines/availability-set-overview).
42
-
43
-
Availability zone support for IoT Hub is enabled automatically for new IoT Hub resources created in the following Azure regions:
44
-
45
-
| Region | Data resiliency | Smoother deployments |
46
-
| ------ | --------------- | ------------ |
47
-
| Australia East | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
48
-
| Brazil South | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
| UK South | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
63
-
| West Europe | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
64
-
| West US 2 | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
65
-
| West US 3 | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
66
-
67
-
## Cross region DR
68
-
69
-
There could be some rare situations when a datacenter experiences extended outages due to power failures or other failures involving physical assets. Such events are rare during which the intra region HA capability described previously may not always help. IoT Hub provides multiple solutions for recovering from such extended outages.
23
+
There could be some rare situations when a datacenter experiences extended outages due to power failures or other failures involving physical assets. Such events are rare. IoT Hub provides multiple solutions for recovering from such extended outages.
70
24
71
25
The recovery options available to customers in such a situation are [Microsoft-initiated failover](#microsoft-initiated-failover) and [manual failover](#manual-failover). The fundamental difference between the two is that Microsoft initiates the former and the user initiates the latter. Also, manual failover provides a lower recovery time objective (RTO) compared to the Microsoft-initiated failover option. The specific RTOs offered with each option are discussed in the following sections. When either of these options to perform failover of an IoT hub from its primary region is exercised, the hub becomes fully functional in the corresponding [Azure geo-paired region](../reliability/cross-region-replication-azure.md).
72
26
@@ -91,6 +45,16 @@ Once the failover operation for the IoT hub completes, all operations from the d
91
45
>* If you use Azure Functions or Azure Stream Analytics to connect the built-in Events endpoint, you might need to perform a **Restart**. This is because during failover previous offsets are no longer valid.
92
46
>* When routing to storage, we recommend listing the blobs or files and then iterating over them, to ensure all blobs or files are read without making any assumptions of partition. The partition range could potentially change during a Microsoft-initiated failover or manual failover. You can use the [List Blobs API](/rest/api/storageservices/list-blobs) to enumerate the list of blobs or [List ADLS Gen2 API](/rest/api/storageservices/datalakestoragegen2/filesystem/list) for the list of files. To learn more, see [Azure Storage as a routing endpoint](iot-hub-devguide-endpoints.md#azure-storage-as-a-routing-endpoint).
93
47
48
+
## Choose the right HA/DR option
49
+
50
+
Here's a summary of the HA/DR options presented in this article that can be used as a frame of reference to choose the right option that works for your solution.
| Microsoft-initiated failover |2 - 26 hours|Refer to the RPO table above|No|None|None|
55
+
| Manual failover |10 min - 2 hours|Refer to the RPO table above|Yes|Very low. You only need to trigger this operation from the portal.|None|
56
+
| Cross region HA |< 1 min|Depends on the replication frequency of your custom HA solution|No|High|> 1x the cost of 1 IoT hub|
57
+
94
58
## Microsoft-initiated failover
95
59
96
60
Microsoft-initiated failover is exercised by Microsoft in rare situations to fail over all the IoT hubs from an affected region to the corresponding geo-paired region. This process is a default option and requires no intervention from the user. Microsoft reserves the right to make a determination of when this option will be exercised. This mechanism doesn't involve a user consent before the user's hub is failed over. Microsoft-initiated failover has a recovery time objective (RTO) of 2-26 hours.
@@ -108,21 +72,16 @@ If your business uptime goals aren't satisfied by the RTO that Microsoft initiat
108
72
109
73
The manual failover option is always available for use irrespective of whether the primary region is experiencing downtime or not. Therefore, this option could potentially be used to perform planned failovers. One example usage of planned failovers is to perform periodic failover drills. A word of caution though is that a planned failover operation results in a downtime for the hub for the period defined by the RTO for this option, and also results in a data loss as defined by the RPO table above. You could consider setting up a test IoT hub instance to exercise the planned failover option periodically to gain confidence in your ability to get your end-to-end solutions up and running when a real disaster happens.
110
74
111
-
Manual failover is available at no additional cost for IoT hubs created after May 18, 2017
75
+
Manual failover is available at no additional cost for IoT hubs created after May 18, 2017.
112
76
113
77
For step-by-step instructions, see [Tutorial: Perform manual failover for an IoT hub](tutorial-manual-failover.md)
114
78
115
-
## Manual failover and Event Hubs
79
+
###Manual failover and Event Hubs
116
80
117
81
The Event Hubs-compatible name and endpoint of the IoT Hub built-in events endpoint change after manual failover. This is because the Event Hubs client doesn't have visibility into IoT Hub events. The same is true for other cloud-based clients such as Functions and Azure Stream Analytics. To retrieve the endpoint and name, you can use the Azure portal or the .NET SDK.
118
82
119
-
### Use the portal
120
-
121
-
For more information about using the portal to retrieve the Event Hub-compatible endpoint and the Event Hub-compatible name, see [Connect to the built-in endpoint](iot-hub-devguide-messages-read-builtin.md#connect-to-the-built-in-endpoint).
122
-
123
-
### Use the .NET SDK
124
-
125
-
To use the IoT Hub connection string to recapture the Event Hubs-compatible endpoint, use a sample located at [https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs](https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs). The code example uses the connection string to get the new Event Hubs endpoint and re-establish the connection. You must have Visual Studio installed.
83
+
* Use the Azure portal: For more information about using the portal to retrieve the Event Hub-compatible endpoint and the Event Hub-compatible name, see [Connect to the built-in endpoint](iot-hub-devguide-messages-read-builtin.md#connect-to-the-built-in-endpoint).
84
+
* Use the .NET SDK: To use the IoT Hub connection string to recapture the Event Hubs-compatible endpoint, use a sample located at [https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs](https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs). The code example uses the connection string to get the new Event Hubs endpoint and re-establish the connection. You must have Visual Studio installed.
126
85
127
86
### Run test drills
128
87
@@ -132,7 +91,7 @@ Test drills shouldn't be performed on IoT hubs that are being used in your produ
132
91
133
92
Manual failover should *not* be used as a mechanism to permanently migrate your hub between the Azure geo paired regions. Assuming that the devices were located closest to the hub's primary region, latency for operations being performed against the IoT hub will increase when the hub fails over to a secondary region.
134
93
135
-
## Failback
94
+
###Failback
136
95
137
96
You can fail back to the old primary region by triggering the failover action a second time. If the original failover operation was performed to recover from an extended outage in the original primary region, we recommended that the hub should be failed back to the original location once that location has recovered from the outage situation.
138
97
@@ -152,7 +111,7 @@ Time to recover = RTO [10 min - 2 hours for manual failover | 2 - 26 hours for M
152
111
153
112
## Disable disaster recovery
154
113
155
-
IoT Hub provides Microsoft-Initiated Failover and Manual Failover by replicating data to the [paired region](../reliability/cross-region-replication-azure.md) for each IoT hub. For some regions, you can avoid data replication outside of the region by disabling disaster recovery when creating an IoT hub. The following regions support this feature:
114
+
IoT Hub provides Microsoft-initiated failover and manual failover by replicating data to the [paired region](../reliability/cross-region-replication-azure.md) for each IoT hub. For some regions, you can avoid data replication outside of the region by disabling disaster recovery when creating an IoT hub. The following regions support this feature:
156
115
157
116
***Brazil South**; paired region, South Central US.
158
117
***Southeast Asia (Singapore)**; paired region, East Asia (Hong Kong SAR).
@@ -169,9 +128,10 @@ Failover capability won't be available if you disable disaster recovery for an I
169
128
170
129
You can only disable disaster recovery to avoid data replication outside of the paired region in Brazil South or Southeast Asia when you create an IoT hub. If you want to configure your existing IoT hub to disable disaster recovery, you need to create a new IoT hub with disaster recovery disabled and manually migrate your existing IoT hub. For guidance, see [How to migrate an IoT hub](migrate-hub-state-cli.md).
171
130
172
-
## Achieve crossregion HA
131
+
## Achieve cross-region HA
173
132
174
133
If your business uptime goals aren't satisfied by the RTO that either Microsoft-initiated failover or manual failover options provide, you should consider implementing a per-device automatic cross region failover mechanism.
134
+
175
135
A complete treatment of deployment topologies in IoT solutions is outside the scope of this article. The article discusses the *regional failover* deployment model for high availability and disaster recovery.
176
136
177
137
In a regional failover model, the solution back end runs primarily in one datacenter location. A secondary IoT hub and back end are deployed in another datacenter location. If the IoT hub in the primary region suffers an outage or the network connectivity from the device to the primary region is interrupted, devices use a secondary service endpoint. You can improve the solution availability by implementing a cross-region failover model instead of staying within a single region.
@@ -189,16 +149,6 @@ At a high level, to implement a regional failover model with IoT Hub, you need t
189
149
190
150
To simplify this step, you should use idempotent operations. Idempotent operations minimize the side-effects from the eventual consistent distribution of events, and from duplicates or out-of-order delivery of events. In addition, the application logic should be designed to tolerate potential inconsistencies or slightly out-of-date state. This situation can occur due to the extra time it takes for the system to heal based on recovery point objectives (RPO).
191
151
192
-
## Choose the right HA/DR option
193
-
194
-
Here's a summary of the HA/DR options presented in this article that can be used as a frame of reference to choose the right option that works for your solution.
0 commit comments