Skip to content

Commit c94bf1e

Browse files
committed
edit
1 parent 45fb03e commit c94bf1e

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
---
2+
title: Reliability in Azure Event Hubs
3+
description: Learn about reliability in Azure Event Hubs
4+
author: anaharris-ms
5+
ms.author: anaharris
6+
ms.topic: reliability-article
7+
ms.custom: subject-reliability
8+
ms.service: event-hub
9+
ms.date: 06/12/2024
10+
---
11+
12+
<!--#Customer intent: I want to understand reliability support in Azure Event Hubs so that I can respond to and/or avoid failures in order to minimize downtime and data loss. -->
13+
14+
15+
# Reliability in Azure Event Hubs
16+
17+
This article describes reliability support in [Azure Event Hubs](../event-hubs/event-hubs-about.md), and covers both intra-regional resiliency with [availability zones](#availability-zone-support) and [cross-region disaster recovery and business continuity](#cross-region-disaster-recovery-and-business-continuity). For a more detailed overview of reliability principles in Azure, see [Azure reliability](/azure/architecture/framework/resiliency/overview).
18+
19+
## Availability zone support
20+
21+
[!INCLUDE [Reliability recommendations](includes/reliability-recommendations-include.md)]
22+
23+
24+
Event Hubs supports [availability zones](../availability-zones/az-overview.md), providing fault-isolated locations within an Azure region. The Availability Zones support is only available in [Azure regions with availability zones](../availability-zones/az-region.md#azure-regions-with-availability-zones). Both metadata and data (events) are replicated across data centers in the availability zone.
25+
26+
When creating a namespace, you see the following highlighted message when you select a region that has availability zones.
27+
28+
:::image type="content" source="./media/event-hubs-geo-dr/eh-az.png" alt-text="Image showing the Create Namespace page with region that has availability zones":::
29+
30+
> [!NOTE]
31+
> When you use the Azure portal, zone redundancy via support for availability zones is automatically enabled. You can't disable it in the portal. You can use the Azure CLI command [`az eventhubs namespace`](/cli/azure/eventhubs/namespace#az-eventhubs-namespace-create) with `--zone-redundant=false` or use the PowerShell command [`New-AzEventHubNamespace`](/powershell/module/az.eventhub/new-azeventhubnamespace) with `-ZoneRedundant=false` to create a namespace with zone redundancy disabled.
32+
33+
## Private endpoints
34+
This section provides more considerations when using Geo-disaster recovery with namespaces that use private endpoints. To learn about using private endpoints with Event Hubs in general, see [Configure private endpoints](private-link-service.md).
35+
36+
### New pairings
37+
If you try to create a pairing between a primary namespace with a private endpoint and a secondary namespace without a private endpoint, the pairing will fail. The pairing will succeed only if both primary and secondary namespaces have private endpoints. We recommend that you use same configurations on the primary and secondary namespaces and on virtual networks in which private endpoints are created.
38+
39+
> [!NOTE]
40+
> When you try to pair the primary namespace with private endpoint and a secondary namespace, the validation process only checks whether a private endpoint exists on the secondary namespace. It doesn't check whether the endpoint works or will work after failover. It's your responsibility to ensure that the secondary namespace with private endpoint will work as expected after failover.
41+
>
42+
> To test that the private endpoint configurations are same on primary and secondary namespaces, send a read request (for example: [Get Event Hub](/rest/api/eventhub/get-event-hub)) to the secondary namespace from outside the virtual network, and verify that you receive an error message from the service.
43+
44+
### Existing pairings
45+
If pairing between primary and secondary namespace already exists, private endpoint creation on the primary namespace will fail. To resolve, create a private endpoint on the secondary namespace first and then create one for the primary namespace.
46+
47+
> [!NOTE]
48+
> While we allow read-only access to the secondary namespace, updates to the private endpoint configurations are permitted.
49+
50+
### Recommended configuration
51+
When creating a disaster recovery configuration for your application and Event Hubs namespaces, you must create private endpoints for both primary and secondary Event Hubs namespaces against virtual networks hosting both primary and secondary instances of your application.
52+
53+
Let's say you have two virtual networks: VNET-1, VNET-2 and these primary and secondary namespaces: EventHubs-Namespace1-Primary, EventHubs-Namespace2-Secondary. You need to do the following steps:
54+
55+
- On EventHubs-Namespace1-Primary, create two private endpoints that use subnets from VNET-1 and VNET-2
56+
- On EventHubs-Namespace2-Secondary, create two private endpoints that use the same subnets from VNET-1 and VNET-2
57+
58+
![Private endpoints and virtual networks](./media/event-hubs-geo-dr/private-endpoints-virtual-networks.png)
59+
60+
Advantage of this approach is that failover can happen at the application layer independent of Event Hubs namespace. Consider the following scenarios:
61+
62+
**Application-only failover:** Here, the application won't exist in VNET-1 but will move to VNET-2. As both private endpoints are configured on both VNET-1 and VNET-2 for both primary and secondary namespaces, the application will just work.
63+
64+
**Event Hubs namespace-only failover**: Here again, since both private endpoints are configured on both virtual networks for both primary and secondary namespaces, the application will just work.
65+
66+
> [!NOTE]
67+
> For guidance on geo-disaster recovery of a virtual network, see [Virtual Network - Business Continuity](../virtual-network/virtual-network-disaster-recovery-guidance.md).
68+
69+
## Role-based access control
70+
Microsoft Entra role-based access control (RBAC) assignments to entities in the primary namespace aren't replicated to the secondary namespace. Create role assignments manually in the secondary namespace to secure access to them.
71+
72+
73+
74+
## Cross-region disaster recovery and business continuity
75+
76+
[!INCLUDE [introduction to disaster recovery](includes/reliability-disaster-recovery-description-include.md)]
77+
78+
The all-active Azure Event Hubs cluster model with availability zone support provides resiliency against hardware and datacenter outages. However, in the case of a disaster where an entire region and all its zones are unavailable for a period of time, you can use Geo-disaster recovery to recover your workload and application configuration.
79+
80+
Geo-Disaster recovery ensures that the entire configuration of a namespace (Event Hubs, Consumer Groups, and settings) is continuously replicated from a primary namespace to a secondary namespace when paired.
81+
82+
The Geo-disaster recovery feature of Azure Event Hubs is a disaster recovery solution. The concepts and workflow described in this article apply to disaster scenarios, and not to transient, or temporary outages.For a detailed discussion of disaster recovery in Microsoft Azure, see [this article](/azure/architecture/resiliency/disaster-recovery-azure-applications).
83+
84+
With Geo-Disaster recovery, you can initiate a once-only failover move from the primary to the secondary at any time. The failover move re-points the chosen alias name for the namespace to the secondary namespace. After the move, the pairing is then removed. The failover is nearly instantaneous once initiated.
85+
86+
> [!IMPORTANT]
87+
> - Geo-Disaster recovery **does not replicate the event data**. In learn how to recover event data from the primary Event Hub after the downed region is restored, see [replication guidance](../event-hubs/event-hubs-federation-overview.md)
88+
> - Microsoft Entra role-based access control (RBAC) assignments to entities in the primary namespace aren't replicated to the secondary namespace. You'll need to create role assignments manually in the secondary namespace to secure access to them.
89+
90+
91+
## Basic concepts and terms
92+
93+
Geo-Disaster recovery implements metadata disaster recovery, and relies on primary and secondary disaster recovery namespaces.
94+
95+
The Geo-disaster recovery feature is available for the [standard, premium, and dedicated SKUs](https://azure.microsoft.com/pricing/details/event-hubs/) only. You don't need to make any connection string changes, as the connection is made via an alias.
96+
97+
The following terms are used in this article:
98+
99+
- *Alias*: The name for a disaster recovery configuration that you set up. The alias provides a single stable Fully Qualified Domain Name (FQDN) connection string. Applications use this alias connection string to connect to a namespace.
100+
101+
- *Primary/secondary namespace*: The namespaces that correspond to the alias. The primary namespace is "active" and receives messages (can be an existing or new namespace). The secondary namespace is "passive" and doesn't receive messages. The metadata between both is in sync, so both can seamlessly accept messages without any application code or connection string changes. To ensure that only the active namespace receives messages, you must use the alias.
102+
- *Metadata*: Entities such as event hubs and consumer groups; and their properties of the service that are associated with the namespace. Only entities and their settings are replicated automatically. Messages and events aren't replicated.
103+
- *Failover*: The process of activating the secondary namespace.
104+
105+
## Supported namespace pairs
106+
The following combinations of primary and secondary namespaces are supported:
107+
108+
| Primary namespace tier | Allowed secondary namespace tier |
109+
| ----------------- | -------------------- |
110+
| Standard | Standard, Dedicated |
111+
| Premium | Premium |
112+
| Dedicated | Dedicated |
113+
114+
> [!NOTE]
115+
> You can't pair namespaces that are in the same dedicated cluster. You can pair namespaces that are in separate clusters.
116+
117+
## Setup and failover flow
118+
119+
The following section is an overview of the failover process, and explains how to set up the initial failover.
120+
121+
:::image type="content" source="../event-hubs/media/event-hubs-geo-dr/geo1.png" alt-text="Image showing the overview of failover process ":::
122+
123+
124+
### Setup
125+
126+
You first create or use an existing primary namespace, and a new secondary namespace, then pair the two. This pairing gives you an alias that you can use to connect. Because you use an alias, you don't have to change connection strings. Only new namespaces can be added to your failover pairing.
127+
128+
1. Create the primary namespace.
129+
1. Create the secondary namespace in a different region. This step is optional. You can create the secondary namespace while creating the pairing in the next step.
130+
1. In the Azure portal, navigate to your primary namespace.
131+
1. Select **Geo-recovery** on the left menu, and select **Initiate pairing** on the toolbar.
132+
133+
:::image type="content" source="./media/event-hubs-geo-dr/primary-namspace-initiate-pairing-button.png" alt-text="Initiate pairing from the primary namespace":::
134+
1. On the **Initiate pairing** page, follow these steps:
135+
1. Select an existing secondary namespace or create one in a different region. In this example, an existing namespace is selected.
136+
1. For **Alias**, enter an alias for the geo-dr pairing.
137+
1. Then, select **Create**.
138+
139+
:::image type="content" source="./media/event-hubs-geo-dr/initiate-pairing-page.png" alt-text="Select the secondary namespace":::
140+
1. You should see the **Geo-DR Alias** page. You can also navigate to this page from the primary namespace by selecting **Geo-recovery** on the left menu.
141+
142+
:::image type="content" source="./media/event-hubs-geo-dr/geo-dr-alias-page.png" alt-text="Geo-DR alias page":::
143+
1. On the **Geo-DR Alias** page, select **Shared access policies** on the left menu to access the primary connection string for the alias. Use this connection string instead of using the connection string to the primary/secondary namespace directly.
144+
1. On this **Overview** page, you can do the following actions:
145+
1. Break the pairing between primary and secondary namespaces. Select **Break pairing** on the toolbar.
146+
1. Manually fail over to the secondary namespace. Select **Failover** on the toolbar.
147+
148+
> [!WARNING]
149+
> Failing over will activate the secondary namespace and remove the primary namespace from the Geo-Disaster Recovery pairing. Create another namespace to have a new geo-disaster recovery pair.
150+
151+
Finally, you should add some monitoring to detect if a failover is necessary. In most cases, the service is one part of a large ecosystem, thus automatic failovers are rarely possible, as often failovers must be performed in sync with the remaining subsystem or infrastructure.
152+
153+
### Example
154+
155+
In one example of this scenario, consider a Point of Sale (POS) solution that emits either messages or events. Event Hubs passes those events to some mapping or reformatting solution, which then forwards mapped data to another system for further processing. At that point, all of these systems might be hosted in the same Azure region. The decision of when and what part to fail over depends on the flow of data in your infrastructure.
156+
157+
You can automate failover either with monitoring systems, or with custom-built monitoring solutions. However, such automation takes extra planning and work, which is out of the scope of this article.
158+
159+
### Failover flow
160+
161+
If you initiate the failover, two steps are required:
162+
163+
1. If another outage occurs, you want to be able to fail over again. Therefore, set up another passive namespace and update the pairing.
164+
165+
2. Pull messages from the former primary namespace once it's available again. After that, use that namespace for regular messaging outside of your geo-recovery setup, or delete the old primary namespace.
166+
167+
> [!NOTE]
168+
> Only fail forward semantics are supported. In this scenario, you fail over and then re-pair with a new namespace. Failing back is not supported; for example, in a SQL cluster.
169+
170+
:::image type="content" source="./media/event-hubs-geo-dr/geo2.png" alt-text="Image showing the failover flow":::
171+
172+
## Manual failover
173+
This section shows how to manually fail over using Azure portal, CLI, PowerShell, C#, etc.
174+
175+
# [Azure portal](#tab/portal)
176+
177+
1. In the Azure portal, navigate to your primary namespace.
178+
1. Select **Geo-recovery** on the left menu.
179+
1. Manually fail over to the secondary namespace. Select **Failover** on the toolbar.
180+
181+
> [!WARNING]
182+
> Failing over will activate the secondary namespace and remove the primary namespace from the Geo-Disaster Recovery pairing. Create another namespace to have a new geo-disaster recovery pair.
183+
184+
# [Azure CLI](#tab/cli)
185+
Use the [az eventhubs georecovery-alias fail-over](/cli/azure/eventhubs/georecovery-alias#az-eventhubs-georecovery-alias-fail-over) command.
186+
187+
# [Azure PowerShell](#tab/powershell)
188+
Use the [Set-AzEventHubGeoDRConfigurationFailOver](/powershell/module/az.eventhub/set-azeventhubgeodrconfigurationfailover) cmdlet.
189+
190+
# [C#](#tab/csharp)
191+
Use the [DisasterRecoveryConfigsOperationsExtensions.FailOverAsync](/dotnet/api/microsoft.azure.management.eventhub.disasterrecoveryconfigsoperationsextensions.failoverasync#Microsoft_Azure_Management_EventHub_DisasterRecoveryConfigsOperationsExtensions_FailOverAsync_Microsoft_Azure_Management_EventHub_IDisasterRecoveryConfigsOperations_System_String_System_String_System_String_System_Threading_CancellationToken_) method.
192+
193+
For the sample code that uses this method, see the [GeoDRClient](https://github.com/Azure/azure-event-hubs/blob/3cb13d5d87385b97121144b0615bec5109415c5a/samples/Management/DotNet/GeoDRClient/GeoDRClient/GeoDisasterRecoveryClient.cs#L137) sample in GitHub.
194+
195+
---
196+
197+
## Management
198+
199+
If you made a mistake; for example, you paired the wrong regions during the initial setup, you can break the pairing of the two namespaces at any time. If you want to use the paired namespaces as regular namespaces, delete the alias.
200+
201+
## Considerations
202+
203+
Note the following considerations to keep in mind:
204+
205+
1. By design, Event Hubs geo-disaster recovery does not replicate data, and therefore you cannot reuse the old offset value of your primary event hub on your secondary event hub. We recommend restarting your event receiver with one of the following methods:
206+
207+
- *EventPosition.FromStart()* - If you wish read all data on your secondary event hub.
208+
- *EventPosition.FromEnd()* - If you wish to read all new data from the time of connection to your secondary event hub.
209+
- *EventPosition.FromEnqueuedTime(dateTime)* - If you wish to read all data received in your secondary event hub starting from a given date and time.
210+
211+
2. In your failover planning, you should also consider the time factor. For example, if you lose connectivity for longer than 15 to 20 minutes, you might decide to initiate the failover.
212+
213+
3. The fact that no data is replicated means that current active sessions aren't replicated. Additionally, duplicate detection and scheduled messages may not work. New sessions, scheduled messages, and new duplicates will work.
214+
215+
4. Failing over a complex distributed infrastructure should be [rehearsed](/azure/architecture/reliability/disaster-recovery#disaster-recovery-plan) at least once.
216+
217+
5. Synchronizing entities can take some time, approximately 50-100 entities per minute.
218+
219+
6. Some aspects of the management plane for the secondary namespace become read-only while geo-recovery pairing is active.
220+
221+
7. The data plane of the secondary namespace will be read-only while geo-recovery pairing is active. The data plane of the secondary namespace will accept GET requests to enable validation of client connectivity and access controls.
222+
223+
224+
## Next steps
225+
Review the following samples or reference documentation.
226+
- [.NET GeoDR sample](https://github.com/Azure/azure-event-hubs/tree/master/samples/Management/DotNet/GeoDRClient)
227+
- [Java GeoDR sample](https://github.com/Azure-Samples/eventhub-java-manage-event-hub-geo-disaster-recovery)
228+
- [.NET - Azure.Messaging.EventHubs samples](https://github.com/Azure/azure-sdk-for-net/tree/master/sdk/eventhub/Azure.Messaging.EventHubs/samples)
229+
- [.NET - Microsoft.Azure.EventHubs samples](https://github.com/Azure/azure-event-hubs/tree/master/samples/DotNet)
230+
- [Java - azure-messaging-eventhubs samples](https://github.com/Azure/azure-sdk-for-java/tree/master/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs)
231+
- [Java - azure-eventhubs samples](https://github.com/Azure/azure-event-hubs/tree/master/samples/Java)
232+
- [Python samples](https://github.com/Azure/azure-sdk-for-python/tree/master/sdk/eventhub/azure-eventhub/samples)
233+
- [JavaScript samples](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/eventhub/event-hubs/samples/v5/javascript)
234+
- [TypeScript samples](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/eventhub/event-hubs/samples/v5/typescript)
235+
- [REST API reference](/rest/api/eventhub/)
236+
237+
[2]: ./media/event-hubs-geo-dr/geo2.png

0 commit comments

Comments
 (0)