Skip to content

Commit d15057b

Browse files
authored
Merge pull request #277250 from ccompy/patch-30
geo replication updates
2 parents 172f4fd + ce65ccd commit d15057b

14 files changed

+207
-2
lines changed

articles/event-hubs/TOC.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,9 @@
152152
href: https://github.com/Azure-Samples/azure-messaging-replication-dotnet/tree/main/functions/config/ServiceBusCopyToEventHub
153153
- name: Geo-disaster recovery
154154
items:
155-
- name: Geo-disaster recovery and Geo-replication
155+
- name: Geo-replication
156+
href: geo-replication.md
157+
- name: Metadata only geo-disaster recovery
156158
href: event-hubs-geo-dr.md
157159
- name: Security
158160
items:
@@ -306,6 +308,9 @@
306308
href: event-hubs-auto-inflate.md
307309
- name: Event Hubs management libraries
308310
href: event-hubs-management-libraries.md
311+
- name: Use geo-replication
312+
href: use-geo-replication.md
313+
309314
- name: Monitor
310315
items:
311316
- name: Monitor Azure Event Hubs

articles/event-hubs/event-hubs-geo-dr.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,12 @@ ms.date: 06/01/2023
77

88
# Azure Event Hubs - Geo-disaster recovery
99

10+
> [!NOTE]
11+
> This article is about the GA Geo-disaster recovery feature that replicated metadata and not the public preview Geo-replication feature described at [Geo-replication](./geo-replication.md).
12+
1013
Resilience against disastrous outages of data processing resources is a requirement for many enterprises and in some cases even required by industry regulations.
1114

12-
Azure Event Hubs already spreads the risk of catastrophic failures of individual machines or even complete racks across clusters that span multiple failure domains within a datacenter. It implements transparent failure detection and failover mechanisms such that the service will continue to operate within the assured service-levels and typically without noticeable interruptions in the event of such failures. If you create an Event Hubs namespace with [availability zones](../availability-zones/az-overview.md) enabled, you reduce the risk of outage further and enable high availablity. With availability zones, the outage risk is further spread across three physically separated facilities, and the service has enough capacity reserves to instantly cope with the complete, catastrophic loss of the entire facility.
15+
Azure Event Hubs already spreads the risk of catastrophic failures of individual machines or even complete racks across clusters that span multiple failure domains within a datacenter. It implements transparent failure detection and failover mechanisms such that the service will continue to operate within the assured service-levels and typically without noticeable interruptions in the event of such failures. If you create an Event Hubs namespace with [availability zones](../availability-zones/az-overview.md) enabled, you reduce the risk of outage further and enable high availability. With availability zones, the outage risk is further spread across three physically separated facilities, and the service has enough capacity reserves to instantly cope with the complete, catastrophic loss of the entire facility.
1316

1417
The all-active Azure Event Hubs cluster model with availability zone support provides resiliency against grave hardware failures and even catastrophic loss of entire datacenter facilities. Still, there might be grave situations with widespread physical destruction that even those measures cannot sufficiently defend against.
1518

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
title: 'Azure Event Hubs geo-replication'
3+
description: 'This article describes the Azure Event Hubs geo-replication feature'
4+
ms.topic: article
5+
ms.date: 06/10/2024
6+
ms.custom: references_regions
7+
---
8+
9+
# Geo-replication (Public Preview)
10+
11+
There are two features that provide Geo-disaster recovery in Azure Event Hubs. There's ***Geo-disaster recovery*** (Metadata DR) that just provides replication of metadata and then a second feature, in public preview, ***Geo-replication*** that provides replication of both metadata and the data itself. Neither geo-disaster recovery feature should be confused with Availability Zones. Both geographic recovery features provide resilience between Azure regions such as East US and West US. Availability Zone support provides resilience within a specific geographic region, such as East US. For more details on Availability Zones, read the documentation here: [Event Hubs Availability Zone support](./event-hubs-availability-and-consistency.md).
12+
13+
**High level feature differences**
14+
15+
The Metadata DR feature replicates configuration information for a namespace from a primary namespace to a secondary namespace. It supports a one time only failover to the secondary region. During customer initiated failover, the alias name for the namespace is repointed to the secondary namespace and then the pairing is broken. No data is replicated other than configuration information nor are permission assignments replicated.
16+
17+
The Geo-replication feature replicates configuration information and all of the data from a primary namespace to one, or more secondary namespaces. When a failover is performed, the selected secondary becomes the primary and the previous primary becomes a secondary. Users can perform a failover back to the original primary when desired.
18+
19+
This rest of this document is focused on the Geo-replication feature. For details on the metadata DR feature, read [Event Hubs Geo-disater recovery for metadata](./event-hubs-geo-dr.md).
20+
21+
## Geo-replication
22+
For public preview, Geo-replication is initially only enabled in a small subset of regions. The public preview of the Geo-replication feature is supported for namespaces in Event Hubs self-serve scaling Dedicated clusters. You can use the feature with new, or existing namespaces in dedicated self-serve clusters. The following features aren't supported with Geo-replication:
23+
- Customer Managed Keys (CMK)
24+
- Managed Identity for Capture
25+
- VNet features (service endpoints, or private endpoints)
26+
- Large messages support (now in public preview)
27+
- Kafka Transactions (now in public preview)
28+
29+
Some of the key aspects of Geo Data Replication public preview are:
30+
- Primary-secondary replication model – Geo-replication is built on primary-secondary replication model, where at a given time there’s only one Primary namespace that serves event producers and event consumers.
31+
- Event Hubs performs fully managed byte-to-byte replication of metadata, event data and consumer offset across secondaries with the configured consistency levels.
32+
- Stable namespace FQDN – The FQDN does not need to change when promotion is performed.
33+
- Replication consistency - There are two replication consistency settings, synchronous and asynchronous.
34+
- User-managed promotion of a secondary to being the new primary.
35+
36+
Changing a secondary to being a new primary is done two ways:
37+
- Planned: a promotion of the secondary to primary where traffic isn't processed until the new primary catches up with all of the data held by the former primary instance.
38+
- Forced: as a failover where the secondary becomes primary as fast as possible.
39+
The Geo-replication feature replicates all data and metadata from the primary region to the selected secondary regions. The namespace FQDN always points to the primary region.
40+
41+
:::image type="content" source="./media/geo-replication/a-as-primary.png" alt-text="Diagram showing when region A is primary, B is secondary.":::
42+
43+
When a customer initiates a promotion of a secondary, the FQDN points to the region selected to be the new primary. The old primary then becomes a secondary. You can promote your secondary to be the new primary for reasons other than a failover. Those reasons can include application upgrades, failover testing, or any number of other things. In those situations, it's common to switch back when those activities are completed.
44+
45+
:::image type="content" source="./media/geo-replication/b-as-primary.png" alt-text="Diagram showing when B is made the primary, that A becomes the new secondary.":::
46+
47+
Secondary regions are added, or removed at the customer's discretion.
48+
There are some current limitations worth noting:
49+
- There's no ability to support read-only views on secondary regions.
50+
- There's no automatic promotion/failover capability. All promotions are customer initiated.
51+
- Secondary regions must be different from the primary region. You can't select another dedicated cluster in the same region.
52+
- Only one secondary is supported for public preview.
53+
54+
## Replication consistency
55+
There are two replication consistency configurations, synchronous and asynchronous. It's important to know the differences between the two configurations as they have an impact on your applications and your data consistency.
56+
57+
**Asynchronous replication**
58+
59+
With asynchronous replication enabled, all messages are committed in the primary and then sent to the secondary. Users can configure an acceptable amount of lag time that the secondary has to catch-up. When the lag for an active secondary is greater than user lag configuration, the primary region throttles incoming publish requests.
60+
61+
**Synchronous replication**
62+
63+
When synchronous replication is enabled, published events are replicated to the secondary, which must confirm the message before it's committed in the primary. With synchronous replication, your application publishes at the rate it takes to publish, replicate, acknowledge and commit. It also means that your application is tied to the availability of both regions. If the secondary region goes down, messages can't be acknowledged or committed.
64+
65+
**Replication consistency comparison**
66+
67+
With synchronous replication:
68+
- Latency is longer due to the distributed commit.
69+
- Availability is tied to the availability of two regions. If one region goes down, your namespace is unavailable.
70+
- Received data always resides in at least two regions (only two regions supported in the initial public preview).
71+
72+
Synchronous replication provides the greatest assurance that your data is safe. If you have synchronous replication, then when it's committed, then it's committed in all of the regions configured for Geo-replication. When synchronous replication is enabled though, your application availability can be reduced due to depending on the availability of both regions.
73+
Enabling asynchronous replication doesn't impact latency very much, and service availability isn't impacted by the loss of a secondary region. Asynchronous replication doesn’t have the absolute guarantee that all regions have the data before it's committed it like synchronous replication does. You can also set the amount of time that your secondary can be out of sync before incoming traffic is throttled. The setting can be from 5 minutes to 1,440 minutes, which is one day. If you're looking to use regions with a large distance between them, then asynchronous replication is likely the best option for you.
74+
75+
Replication consistency configuration can change after Geo-replication configuration. You can go from synchronous to asynchronous, or from asynchronous to synchronous. If you go from synchronous to asynchronous, your latency, and application availability improves. If you go from asynchronous to synchronous, your secondary becomes configured as synchronous after lag reaches zero. If you're running with a continual lag for whatever reason, then you may need to pause your publishers in order for lag to reach zero and your mode to be able to switch to synchronous.
76+
77+
The general reasons to have synchronous replication enabled are tied to the importance of the data, specific business needs, or compliance reasons. If your primary goal is application availability rather than data assurance, then asynchronous consistency is likely the better choice.
78+
79+
## Secondary region selection
80+
To enable the Geo-replication feature, you need to use a primary and secondary region where the Geo-replication feature is enabled. You also need to have Event Hubs cluster already existing in both the primary and secondary regions.
81+
82+
The Geo-replication feature depends on being able to replicate published events from the primary to the secondary region. If the secondary region is on another continent, it has a major impact on replication lag from the primary to the secondary region. If using Geo-replication for availability and reliability reasons, you're best off with secondary regions being at least on the same continent where possible. To get a better understanding of the latency induced by geographic distance, you can learn more from [Azure network round-trip latency statistics | Microsoft Learn](../networking/azure-network-latency.md).
83+
84+
## Geo-replication management
85+
The Geo-replication feature enables customers to configure a secondary region to replicate configuration and data to. Customers can:
86+
- Configure Geo-replication- Secondary regions can be configured on any existing namespace in a self-serve dedicated cluster in a region with the Geo-replication feature set enabled. It can also be configured during namespace creation on the same dedicated clusters. To select a secondary region, you must have a dedicated cluster available in that secondary region and the secondary region also must have the Geo-replication feature set enabled for that region.
87+
- Configure the replication consistency - Synchronous and asynchronous replication is set when Geo-replication is configured but can also be switched afterwards. With asynchronous consistency, you can configure the amount of time that a secondary region is allowed to lag.
88+
- Trigger promotion/failover - All promotions, or failovers are customer initiated. During promotion you can choose to make it Forced from the start, or even change your mind after a promotion has started and make it forced.
89+
- Remove a secondary - If at any time you want to remove the geo-pairing between primary and secondary regions, you can do so and the data in the secondary region will be deleted.
90+
91+
## Monitoring data replication
92+
Users can monitor the progress of the replication job by monitoring the replication lag metric in Application Metrics logs.
93+
- Enable Application Metrics logs in your Event Hubs namespace following [Monitoring Azure Event Hubs - Azure Event Hubs | Microsoft Learn](./monitor-event-hubs.md).
94+
- Once Application Metrics logs are enabled, you need to produce and consume data from namespace for a few minutes before you start to see the logs.
95+
- To view Application Metrics logs, navigate to Monitoring section of Event Hubs and click on the ‘Logs’ blade. You can use the following query to find the replication lag (in seconds) between the primary and secondary namespaces.
96+
```
97+
AzureDiagnostics
98+
| where TimeGenerated > ago(1h)
99+
| where Category == "ApplicationMetricsLogs"
100+
| where ActivityName_s == "ReplicationLag
101+
```
102+
- The column count_d indicates the replication lag in seconds between the primary and secondary region.
103+
104+
## Publishing Data
105+
Event publishing applications can publish data to geo-replicated namespaces via stable namespace FQDN of the geo replicated namespace. The event publishing approach is the same as the non-Geo DR case and no changes to client applications are required.
106+
Event publishing may not be available during the following circumstances:
107+
- During Failover grace period, the existing primary region rejects any new events that are published to Event Hubs.
108+
- When replication lag between primary and secondary regions reaches the max replication lag duration, the publisher ingress workload may get throttled.
109+
Publisher applications can't directly access any namespaces in the secondary regions.
110+
111+
**Consuming Data**
112+
Event consuming applications can consume data using the stable namespace FQDN of a geo-replicated namespace. The consumer operations aren't supported, from when the failover is initiated until it's completed.
113+
114+
### Checkpointing/Offset Management
115+
Event consuming applications can continue to maintain offset management as they would do it with a single namespace.
116+
117+
**Kafka**
118+
119+
Offset are committed to Event Hubs directly and offsets are replicated across regions. Therefore, consumers can start consuming from where it left off in the primary region.
120+
121+
**Event Hubs SDK/AMQP**
122+
123+
Clients that use the Event Hubs SDK need to upgrade to the April 2024 version of the SDK. The latest version of the Event Hubs SDK supports failover with an update to the checkpoint. The checkpoint is managed by users with a checkpoint store such as Azure Blob storage, or a custom storage solution. If there's a failover, the checkpoint store must be available from the secondary region so that clients can retrieve checkpoint data and avoid loss of messages.
124+
125+
## Pricing
126+
Event Hubs dedicated clusters are priced independently of geo-replication. Use of geo-replication with Event Hubs dedicated requires you to have at least two dedicated clusters in separate regions. The dedicated clusters used as secondary instances for geo-replication can be used for other workloads.
127+
There is a charge for geo-replication based on the published bandwidth * the number of secondary regions. The geo-replication charge is waived in early public preview.
23.8 KB
Loading
13.3 KB
Loading
11.3 KB
Loading
55.5 KB
Loading
132 KB
Loading
134 KB
Loading
134 KB
Loading

0 commit comments

Comments
 (0)