Skip to content

Commit fa33445

Browse files
committed
reliability cont.
1 parent 7e4a56e commit fa33445

File tree

3 files changed

+65
-371
lines changed

3 files changed

+65
-371
lines changed

articles/iot-hub/TOC.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,14 +157,16 @@
157157
- name: Understanding IoT hub IP address
158158
displayName: service tags, firewall rules, IP filter
159159
href: iot-hub-understand-ip-address.md
160-
- name: Scaling and availability
160+
- name: Reliability and scaling
161161
items:
162162
- name: Best practices for large-scale IoT device deployments
163163
displayName: device provisioning, staggered provisioning schedule, reprovisioning devices, monitoring devices
164164
href: ../iot-dps/concepts-deploy-at-scale.md?toc=/azure/iot-hub/toc.json&bc=/azure/iot-hub/breadcrumb/toc.json
165165
- name: High availability and disaster recovery
166166
displayName: HA, DR, availability zone, failover, failback
167167
href: iot-hub-ha-dr.md
168+
- name: Reliability
169+
href: ../reliability/reliability-iot-hub.md?toc=/azure/iot-hub/toc.json&bc=/azure/iot-hub/breadcrumb/toc.json
168170
- name: Authentication and authorization
169171
items:
170172
- name: Microsoft Entra ID

articles/iot-hub/iot-hub-ha-dr.md

Lines changed: 21 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -5,68 +5,22 @@ author: kgremban
55
ms.service: azure-iot-hub
66
services: iot-hub
77
ms.topic: conceptual
8-
ms.date: 07/20/2023
8+
ms.date: 03/25/2025
99
ms.author: kgremban
1010
ms.custom: references_regions
1111
---
1212

1313
# IoT Hub high availability and disaster recovery
1414

15-
As a first step towards implementing a resilient IoT solution, architects, developers, and business owners must define the uptime goals for the solutions they're building. These goals can be defined primarily based on specific business objectives for each scenario. In this context, the article [Azure Business Continuity Technical Guidance](/azure/architecture/framework/resiliency/app-design) describes a general framework to help you think about business continuity and disaster recovery. The [Disaster recovery and high availability for Azure applications](/azure/architecture/reliability/disaster-recovery) paper provides architecture guidance on strategies for Azure applications to achieve High Availability (HA) and Disaster Recovery (DR).
16-
17-
This article discusses the HA and DR features offered specifically by the IoT Hub service. The broad areas discussed in this article are:
18-
19-
* Intra-region HA
20-
* Cross region DR
21-
* Achieving cross region HA
15+
As a first step towards implementing a resilient IoT solution, architects, developers, and business owners must define the uptime goals for the solutions they're building. These goals can be defined primarily based on specific business objectives for each scenario. In this context, the article [Azure Business Continuity Technical Guidance](/azure/architecture/framework/resiliency/app-design) describes a general framework to help you think about business continuity and disaster recovery. The [Disaster recovery and high availability for Azure applications](/azure/architecture/reliability/disaster-recovery) paper provides architecture guidance on strategies for Azure applications to achieve high availability (HA) and disaster recovery (DR).
2216

2317
Depending on the uptime goals you define for your IoT solutions, you should determine which of the options outlined in this article best suit your business objectives. Incorporating any of these HA/DR alternatives into your IoT solution requires a careful evaluation of the trade-offs between the:
2418

2519
* Level of resiliency you require
2620
* Implementation and maintenance complexity
2721
* COGS impact
2822

29-
## Intra-region HA
30-
31-
The IoT Hub service provides intra-region HA by implementing redundancies in almost all layers of the service. The [SLA published by the IoT Hub service](https://azure.microsoft.com/support/legal/sla/iot-hub) is achieved by making use of these redundancies. No extra work is required by the developers of an IoT solution to take advantage of these HA features. Although IoT Hub offers a reasonably high uptime guarantee, transient failures can still be expected as with any distributed computing platform. If you're just getting started with migrating your solutions to the cloud from an on-premises solution, your focus needs to shift from optimizing "mean time between failures" to "mean time to recover". In other words, transient failures are to be considered normal while operating with the cloud in the mix. Appropriate [retry patterns](../iot/concepts-manage-device-reconnections.md#retry-patterns) must be built in to the components interacting with a cloud application to deal with transient failures.
32-
33-
## Availability zones
34-
35-
IoT Hub supports [Azure availability zones](../reliability/availability-zones-overview.md). An availability zone is a high-availability offering that protects your applications and data from datacenter failures. A region with availability zone support comprises three zones supporting that region. Each zone provides one or more datacenters, each in a unique physical location with independent power, cooling, and networking. This configuration provides replication and redundancy within the region.
36-
37-
Availability zones provide two advantages: data resiliency and smoother deployments.
38-
39-
*Data resiliency* comes from replacing the underlying storage services with availability-zones-supported storage. Data resilience is important for IoT solutions because these solutions often operate in complex, dynamic, and uncertain environments where failures or disruptions can have significant consequences. Whether an IoT solution supports a manufacturing floor, retail or restaurant environments, healthcare systems, or infrastructure, the availability and quality of data is necessary to recover from failures and to provide reliable and consistent services.
40-
41-
*Smoother deployments* come from replacing the underlying data center hardware with newer hardware that supports availability zones. These hardware improvements minimize customer impact from device disconnects and reconnects as well as other deployment-related downtime. The IoT Hub engineering team deploys multiple updates to each IoT hub ever month, for both security reasons and to provide feature improvements. Availability-zones-supported hardware is split into 15 update domains so that each update goes smoother, with minimal impact to your workflows. For more information about update domains, see [Availability sets](/azure/virtual-machines/availability-set-overview).
42-
43-
Availability zone support for IoT Hub is enabled automatically for new IoT Hub resources created in the following Azure regions:
44-
45-
| Region | Data resiliency | Smoother deployments |
46-
| ------ | --------------- | ------------ |
47-
| Australia East | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
48-
| Brazil South | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
49-
| Canada Central | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
50-
| Central India | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
51-
| Central US | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
52-
| East US | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
53-
| France Central | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
54-
| Germany West Central | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
55-
| Japan East | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
56-
| Korea Central | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
57-
| North Europe | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
58-
| Norway East | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
59-
| Qatar Central | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
60-
| Southcentral US | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
61-
| Southeast Asia | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
62-
| UK South | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
63-
| West Europe | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
64-
| West US 2 | :::image type="icon" source="./media/icons/yes-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
65-
| West US 3 | :::image type="icon" source="./media/icons/no-icon.png"::: | :::image type="icon" source="./media/icons/yes-icon.png"::: |
66-
67-
## Cross region DR
68-
69-
There could be some rare situations when a datacenter experiences extended outages due to power failures or other failures involving physical assets. Such events are rare during which the intra region HA capability described previously may not always help. IoT Hub provides multiple solutions for recovering from such extended outages.
23+
There could be some rare situations when a datacenter experiences extended outages due to power failures or other failures involving physical assets. Such events are rare. IoT Hub provides multiple solutions for recovering from such extended outages.
7024

7125
The recovery options available to customers in such a situation are [Microsoft-initiated failover](#microsoft-initiated-failover) and [manual failover](#manual-failover). The fundamental difference between the two is that Microsoft initiates the former and the user initiates the latter. Also, manual failover provides a lower recovery time objective (RTO) compared to the Microsoft-initiated failover option. The specific RTOs offered with each option are discussed in the following sections. When either of these options to perform failover of an IoT hub from its primary region is exercised, the hub becomes fully functional in the corresponding [Azure geo-paired region](../reliability/cross-region-replication-azure.md).
7226

@@ -91,6 +45,16 @@ Once the failover operation for the IoT hub completes, all operations from the d
9145
>* If you use Azure Functions or Azure Stream Analytics to connect the built-in Events endpoint, you might need to perform a **Restart**. This is because during failover previous offsets are no longer valid.
9246
>* When routing to storage, we recommend listing the blobs or files and then iterating over them, to ensure all blobs or files are read without making any assumptions of partition. The partition range could potentially change during a Microsoft-initiated failover or manual failover. You can use the [List Blobs API](/rest/api/storageservices/list-blobs) to enumerate the list of blobs or [List ADLS Gen2 API](/rest/api/storageservices/datalakestoragegen2/filesystem/list) for the list of files. To learn more, see [Azure Storage as a routing endpoint](iot-hub-devguide-endpoints.md#azure-storage-as-a-routing-endpoint).
9347
48+
## Choose the right HA/DR option
49+
50+
Here's a summary of the HA/DR options presented in this article that can be used as a frame of reference to choose the right option that works for your solution.
51+
52+
| HA/DR option | RTO | RPO | Requires manual intervention? | Implementation complexity | Cost impact|
53+
| --- | --- | --- | --- | --- | --- |
54+
| Microsoft-initiated failover |2 - 26 hours|Refer to the RPO table above|No|None|None|
55+
| Manual failover |10 min - 2 hours|Refer to the RPO table above|Yes|Very low. You only need to trigger this operation from the portal.|None|
56+
| Cross region HA |< 1 min|Depends on the replication frequency of your custom HA solution|No|High|> 1x the cost of 1 IoT hub|
57+
9458
## Microsoft-initiated failover
9559

9660
Microsoft-initiated failover is exercised by Microsoft in rare situations to fail over all the IoT hubs from an affected region to the corresponding geo-paired region. This process is a default option and requires no intervention from the user. Microsoft reserves the right to make a determination of when this option will be exercised. This mechanism doesn't involve a user consent before the user's hub is failed over. Microsoft-initiated failover has a recovery time objective (RTO) of 2-26 hours.
@@ -108,21 +72,16 @@ If your business uptime goals aren't satisfied by the RTO that Microsoft initiat
10872

10973
The manual failover option is always available for use irrespective of whether the primary region is experiencing downtime or not. Therefore, this option could potentially be used to perform planned failovers. One example usage of planned failovers is to perform periodic failover drills. A word of caution though is that a planned failover operation results in a downtime for the hub for the period defined by the RTO for this option, and also results in a data loss as defined by the RPO table above. You could consider setting up a test IoT hub instance to exercise the planned failover option periodically to gain confidence in your ability to get your end-to-end solutions up and running when a real disaster happens.
11074

111-
Manual failover is available at no additional cost for IoT hubs created after May 18, 2017
75+
Manual failover is available at no additional cost for IoT hubs created after May 18, 2017.
11276

11377
For step-by-step instructions, see [Tutorial: Perform manual failover for an IoT hub](tutorial-manual-failover.md)
11478

115-
## Manual failover and Event Hubs
79+
### Manual failover and Event Hubs
11680

11781
The Event Hubs-compatible name and endpoint of the IoT Hub built-in events endpoint change after manual failover. This is because the Event Hubs client doesn't have visibility into IoT Hub events. The same is true for other cloud-based clients such as Functions and Azure Stream Analytics. To retrieve the endpoint and name, you can use the Azure portal or the .NET SDK.
11882

119-
### Use the portal
120-
121-
For more information about using the portal to retrieve the Event Hub-compatible endpoint and the Event Hub-compatible name, see [Connect to the built-in endpoint](iot-hub-devguide-messages-read-builtin.md#connect-to-the-built-in-endpoint).
122-
123-
### Use the .NET SDK
124-
125-
To use the IoT Hub connection string to recapture the Event Hubs-compatible endpoint, use a sample located at [https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs](https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs). The code example uses the connection string to get the new Event Hubs endpoint and re-establish the connection. You must have Visual Studio installed.
83+
* Use the Azure portal: For more information about using the portal to retrieve the Event Hub-compatible endpoint and the Event Hub-compatible name, see [Connect to the built-in endpoint](iot-hub-devguide-messages-read-builtin.md#connect-to-the-built-in-endpoint).
84+
* Use the .NET SDK: To use the IoT Hub connection string to recapture the Event Hubs-compatible endpoint, use a sample located at [https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs](https://github.com/Azure/azure-sdk-for-net/tree/main/samples/iothub-connect-to-eventhubs). The code example uses the connection string to get the new Event Hubs endpoint and re-establish the connection. You must have Visual Studio installed.
12685

12786
### Run test drills
12887

@@ -132,7 +91,7 @@ Test drills shouldn't be performed on IoT hubs that are being used in your produ
13291

13392
Manual failover should *not* be used as a mechanism to permanently migrate your hub between the Azure geo paired regions. Assuming that the devices were located closest to the hub's primary region, latency for operations being performed against the IoT hub will increase when the hub fails over to a secondary region.
13493

135-
## Failback
94+
### Failback
13695

13796
You can fail back to the old primary region by triggering the failover action a second time. If the original failover operation was performed to recover from an extended outage in the original primary region, we recommended that the hub should be failed back to the original location once that location has recovered from the outage situation.
13897

@@ -152,7 +111,7 @@ Time to recover = RTO [10 min - 2 hours for manual failover | 2 - 26 hours for M
152111
153112
## Disable disaster recovery
154113

155-
IoT Hub provides Microsoft-Initiated Failover and Manual Failover by replicating data to the [paired region](../reliability/cross-region-replication-azure.md) for each IoT hub. For some regions, you can avoid data replication outside of the region by disabling disaster recovery when creating an IoT hub. The following regions support this feature:
114+
IoT Hub provides Microsoft-initiated failover and manual failover by replicating data to the [paired region](../reliability/cross-region-replication-azure.md) for each IoT hub. For some regions, you can avoid data replication outside of the region by disabling disaster recovery when creating an IoT hub. The following regions support this feature:
156115

157116
* **Brazil South**; paired region, South Central US.
158117
* **Southeast Asia (Singapore)**; paired region, East Asia (Hong Kong SAR).
@@ -169,9 +128,10 @@ Failover capability won't be available if you disable disaster recovery for an I
169128

170129
You can only disable disaster recovery to avoid data replication outside of the paired region in Brazil South or Southeast Asia when you create an IoT hub. If you want to configure your existing IoT hub to disable disaster recovery, you need to create a new IoT hub with disaster recovery disabled and manually migrate your existing IoT hub. For guidance, see [How to migrate an IoT hub](migrate-hub-state-cli.md).
171130

172-
## Achieve cross region HA
131+
## Achieve cross-region HA
173132

174133
If your business uptime goals aren't satisfied by the RTO that either Microsoft-initiated failover or manual failover options provide, you should consider implementing a per-device automatic cross region failover mechanism.
134+
175135
A complete treatment of deployment topologies in IoT solutions is outside the scope of this article. The article discusses the *regional failover* deployment model for high availability and disaster recovery.
176136

177137
In a regional failover model, the solution back end runs primarily in one datacenter location. A secondary IoT hub and back end are deployed in another datacenter location. If the IoT hub in the primary region suffers an outage or the network connectivity from the device to the primary region is interrupted, devices use a secondary service endpoint. You can improve the solution availability by implementing a cross-region failover model instead of staying within a single region.
@@ -189,16 +149,6 @@ At a high level, to implement a regional failover model with IoT Hub, you need t
189149

190150
To simplify this step, you should use idempotent operations. Idempotent operations minimize the side-effects from the eventual consistent distribution of events, and from duplicates or out-of-order delivery of events. In addition, the application logic should be designed to tolerate potential inconsistencies or slightly out-of-date state. This situation can occur due to the extra time it takes for the system to heal based on recovery point objectives (RPO).
191151

192-
## Choose the right HA/DR option
193-
194-
Here's a summary of the HA/DR options presented in this article that can be used as a frame of reference to choose the right option that works for your solution.
195-
196-
| HA/DR option | RTO | RPO | Requires manual intervention? | Implementation complexity | Cost impact|
197-
| --- | --- | --- | --- | --- | --- |
198-
| Microsoft-initiated failover |2 - 26 hours|Refer to the RPO table above|No|None|None|
199-
| Manual failover |10 min - 2 hours|Refer to the RPO table above|Yes|Very low. You only need to trigger this operation from the portal.|None|
200-
| Cross region HA |< 1 min|Depends on the replication frequency of your custom HA solution|No|High|> 1x the cost of 1 IoT hub|
201-
202152
## Next steps
203153

204154
* [What is Azure IoT Hub?](about-iot-hub.md)

0 commit comments

Comments
 (0)