Skip to content

Commit a004cc6

Browse files
authored
Merge pull request #285700 from guywi-ms/resilience-overview
resilience updates
2 parents 929fdac + 22baa62 commit a004cc6

File tree

3 files changed

+62
-21
lines changed

3 files changed

+62
-21
lines changed

articles/azure-monitor/best-practices-logs.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,6 @@ This article provides architectural best practices for Azure Monitor Logs. The g
1616
## Reliability
1717
[Reliability](/azure/well-architected/resiliency/overview) refers to the ability of a system to recover from failures and continue to function. The goal is to minimize the effects of a single failing component. Use the following information to minimize failure of your Log Analytics workspaces and to protect the data they collect.
1818

19-
This video provides an overview of reliability and resilience options available for Log Analytics workspaces:
20-
21-
> [!VIDEO https://www.youtube.com/embed/CYspm1Yevx8?cc_load_policy=1&cc_lang_pref=auto]
22-
2319
[!INCLUDE [waf-logs-reliability](includes/waf-logs-reliability.md)]
2420

2521

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
author: guywi-ms
3+
ms.author: guywild
4+
ms.service: azure-monitor
5+
ms.topic: include
6+
ms.date: 08/28/2024
7+
---
8+
9+
Each Azure region that supports availability zones has a set of datacenters equipped with independent power, cooling, and networking infrastructure.
10+
11+
Azure Monitor Logs availability zones are [redundant](../../reliability/availability-zones-overview.md#zonal-and-zone-redundant-services), which means that Microsoft spreads service requests and replicates data across different zones in supported regions. If an incident affects one zone, Microsoft uses a different availability zone in the region instead, automatically. You don't need to take any action because switching between zones is seamless.
12+
13+
In most regions, Azure Monitor Logs availability zones support **data resilience**, which means your stored data is protected against data loss related to zonal failures, but service operations might still be impacted by regional incidents. If the service is unable to run queries, you can't view the logs until the issue is resolved.
14+
15+
A subset of the availability zones that support data resilience also support **service resilience**, which means that Azure Monitor Logs service operations - for example, log ingestion, queries, and alerts - can continue in the event of a zone failure.
16+
17+
Availability zones protect against infrastructure-related incidents, such as storage failures. They don’t protect against application-level issues, such as faulty code deployments or certificate failures, which impact the entire region.

articles/azure-monitor/includes/waf-logs-reliability.md

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,38 +6,66 @@ ms.topic: include
66
ms.date: 08/24/2023
77
---
88

9-
Log Analytics workspaces offer a high degree of reliability. Conditions where a temporary loss of access to the workspace can result in data loss are often mitigated by features such as data buffering with the Azure Monitor Agent and protection mechanisms built into the ingestion pipeline.
9+
Log Analytics workspaces offer a high degree of reliability. The ingestion pipeline, which sends collected data to the Log Analytics workspace, validates that the Log Analytics workspace successfully processes each log record before it removes the record from the pipe. If the ingestion pipeline isn’t available, the agents that send the data buffer and retry sending the logs for many hours.
1010

11-
The resiliency features described in this section can provide additional protection from data loss and business continuity. Some are in-region solutions, and others provide cross-regional redundancy; some are applied automatically and others require manual triggering. The table below summarizes and compares these features.
1211

13-
Some availability features require a dedicated cluster, which currently requires a commitment of at least 100 GB per day from all workspaces linked to this cluster (aggregated).
12+
### Azure Monitor Logs features that enhance resilience
13+
14+
Azure Monitor Logs offers several features that enhance workspaces resilience to various types of issues. You can use these features individually or in combination, depending on your needs.
15+
16+
This video provides an overview of reliability and resilience options available for Log Analytics workspaces:
17+
18+
> [!VIDEO https://www.youtube.com/embed/CYspm1Yevx8?cc_load_policy=1&cc_lang_pref=auto]
19+
20+
#### In-region protection using availability zones
21+
22+
[!INCLUDE [logs-availability-zones](../includes/logs-availability-zones.md)]
23+
24+
#### Backup of data from specific tables using continuous export
25+
26+
You can [continuously export data sent to specific tables in your Log Analytics workspace](../logs/logs-data-export.md) to Azure storage accounts.
27+
28+
The storage account you export data to must be in the same region as your Log Analytics workspace. To protect and have access to your ingested logs, even if the workspace region is down, use a geo-redundant storage account, as explained in [Configuration recommendations](#configuration-recommendations).
29+
30+
The export mechanism doesn’t provide protection from incidents impacting the ingestion pipeline or the export process itself.
31+
32+
> [!NOTE]
33+
> You can access data in a storage account from Azure Monitor Logs using the [externaldata operator](/kusto/query/externaldata-operator?view=azure-monitor). However, the exported data is stored in five-minute blobs and analyzing data spanning multiple blobs can be cumbersome. Therefore, exporting data to a storage account is a good data backup mechanism, but having the backed up data in a storage account is not ideal if you need it for analysis in Azure Monitor Logs. You can query large volumes of blob data using [Azure Data Explorer](/azure/data-explorer/query-exported-azure-monitor-data), [Azure Data Factory](/azure/data-factory/introduction#connect-and-collect), or any other storage access tool.
34+
35+
#### Cross-regional data protection and service resilience using workspace replication (preview)
36+
37+
Workspace replication (preview) is the most extensive resilience solution as it replicates the Log Analytics workspace and incoming logs to another region.
38+
39+
Workspace replication protects both your logs and the service operations, and allows you to continue monitoring your systems in the event of infrastructure or application-related region-wide incidents.
40+
41+
In contrast with availability zones, which Microsoft manages end-to-end, you need to monitor your primary workspace's health and decide when to switch over to the workspace in the secondary region and back.
42+
1443

1544
### Design checklist
1645

1746
> [!div class="checklist"]
18-
> - If you collect enough data for a dedicated cluster, create a dedicated cluster in an availability zone.
19-
> - If you require the workspace to be available in the case of a region failure, or you don't collect enough data for a dedicated cluster, configure data collection to send critical data to multiple workspaces in different regions.
20-
> - If you require data to be protected in the case of datacenter or region failure, configure data export from the workspace to save data in an alternate location.
21-
> - For mission-critical workloads requiring high availability, consider implementing a federated workspace model.
47+
> - To ensure service and data resilience to region-wide incidents, enable workspace replication.
48+
> - To ensure in-region protection against datacenter failure, create your workspace in a region that supports availability zones.
49+
> - For cross-regional backup of data in specific tables, use the continuous export feature to send data to a geo-replicated storage account.
2250
> - Monitor the health of your Log Analytics workspaces.
2351
2452
### Configuration recommendations
2553

2654
| Recommendation | Benefit |
2755
|:---|:---|
28-
| If you collect enough data, create a dedicated cluster in a region that supports availability zones. | Workspaces linked to a [dedicated cluster](../logs/logs-dedicated-clusters.md) located in a region that supports [availability zones](../logs/availability-zones.md#supported-regions) remain available if a datacenter fails.<br><br> A dedicated cluster requires a commitment of at least 100 GB per day from all workspaces in the same region. If you don't collect this much data, then you need to weight the cost of this commitment with reliability features that it provides. |
29-
| If you require data in your workspace to be available in the event of a region failure, send critical data to multiple workspaces in different regions. | Send data to multiple workspaces in different regions. For example, configure DCRs to send data to multiple workspaces from Azure Monitor Agent running on virtual machines, and configure multiple diagnostic settings to collect resource logs from Azure resources to multiple workspaces. <br><br>Even though the data will be available in the alternate workspace in case of failure, resources that rely on the data, such as alerts and workbooks, won't know to use the alternate workspace. Consider storing ARM templates for critical resources with configuration for the alternate workspace in Azure DevOps or as disabled [policies](../../governance/policy/overview.md) that can quickly be enabled in a failover scenario.<br><br>Tradeoff: This configuration results in duplicate ingestion and retention charges so only use it for critical data. |
30-
| For mission-critical workloads requiring high availability, consider implementing a federated workspace model that uses multiple workspaces to provide high availability in the case of regional failure. | [Mission-critical](/azure/well-architected/mission-critical/mission-critical-overview) provides prescriptive best practice guidance for architecting highly reliable applications on Azure. The design methodology includes a federated workspace model with multiple Log Analytics workspaces to deliver [high availability](/azure/well-architected/mission-critical/mission-critical-design-methodology#select-a-reliability-tier) in the case of multiple failures, including the failure of an Azure region.<br><br> This strategy eliminates egress costs across regions and remains operational with a region failure, but it requires additional complexity that you must manage with configuration and processes described in [Health modeling and observability of mission-critical workloads on Azure](/azure/well-architected/mission-critical/mission-critical-health-modeling).|
31-
| If you require data to be protected in the case of datacenter or region failure, configure data export from the workspace to save data in an alternate location. | The [data export feature of Azure Monitor](../logs/logs-data-export.md) allows you to continuously export data sent to specific tables to Azure storage where it can be retained for extended periods. Use [Azure Storage redundancy options](../../storage/common/storage-redundancy.md#redundancy-in-a-secondary-region), including GRS and GZRS, to replicate this data to other regions. If you require export of [tables that aren't supported by data export](../logs/logs-data-export.md?tabs=portal#limitations), you can use other methods of exporting data, including Logic apps, to protect your data. This is primarily a solution to meet compliance for data retention since the data can be difficult to analyze and restore to the workspace.<br><br>This option is similar to the previous option of multicasting the data to different workspaces, but has a lower cost because the extra data is written to storage.<br><br> Data export is susceptible to regional incidents because it relies on the stability of the Azure Monitor ingestion pipeline in your region. It doesn't provide resiliency against incidents impacting the regional ingestion pipeline.|
56+
| To ensure the greatest degree of resilience, enable workspace replication. |**Cross-regional resilience for workspace data and service operations.** <br><br>[Workspace replication (preview)](../logs/workspace-replication.md) ensures high availability by creating a secondary instance of your workspace in another region and ingesting your logs to both workspaces.<br><br>When needed, switch to your secondary workspace until the issues impacting your primary workspace are resolved. You can continue ingesting logs, querying data, using dashboards, alerts, and Sentinel in your secondary workspace. You also have access to logs ingested before the region switch.<br><br>This is a paid feature, so consider whether you want to replicate all of your incoming logs, or only some data streams. |
57+
| If possible, create your workspace in a region that supports Azure Monitor service-resilience. | **In-region resilience of workspace data and service operations in the event of datacenter issues.** <br><br>Availability zones that support service resilience also support data resilience. This means that even if an entire datacenter becomes unavailable, the redundancy between zones allows Azure Monitor service operations, like ingestion and querying, to continue to work, and your ingested logs to remain available.<br><br>Availability zones provide in-region protection, but don't protect against issues that impact the entire region.<br><br>For information about which regions support data resilience, see [Enhance data and service resilience in Azure Monitor Logs with availability zones](../logs/availability-zones.md). |
58+
| Create your workspace in a region that supports data resilience. | **In-region protection against loss of the logs in your workspace in the event of datacenter issues.** <br><br>Creating your workspace in a region that supports data resilience means that even if the entire datacenter becomes unavailable, your ingested logs are safe. <br>If the service is unable to run queries, you can't view the logs until the issue is resolved.<br><br>For information about which regions support data resilience, see [Enhance data and service resilience in Azure Monitor Logs with availability zones](../logs/availability-zones.md). |
59+
| Configure data export from specific tables to a storage account that's replicated across regions. | **Maintain a backup copy of your log data in a different region.**<br><br>The [data export feature of Azure Monitor](../logs/logs-data-export.md) allows you to continuously export data sent to specific tables to Azure storage where it can be retained for extended periods. Use a geo-redundant storage (GRS) or geo-zone-redundant storage (GZRS) account to keep your data safe even if an entire region becomes unavailable. To make your data readable from the other regions, configure your storage account for read access to the secondary region. For more information, see [Azure Storage redundancy on a secondary region](/azure/storage/common/storage-redundancy#redundancy-in-a-secondary-region) and [Azure Storage read access to data in the secondary region](/azure/storage/common/storage-redundancy#read-access-to-data-in-the-secondary-region).<br><br>For [tables that don't supported continuous data export](../logs/logs-data-export.md?tabs=portal#limitations), you can use other methods of exporting data, including Logic Apps, to protect your data. This is primarily a solution to meet compliance for data retention since the data can be difficult to analyze and restore to the workspace.<br><br> Data export is susceptible to regional incidents because it relies on the stability of the Azure Monitor ingestion pipeline in your region. It doesn't provide resiliency against incidents impacting the regional ingestion pipeline.|
3260
| Monitor the health of your Log Analytics workspaces. | Use [Log Analytics workspace insights](../logs/workspace-design.md) to track failed queries and create [health status alert](../logs/log-analytics-workspace-health.md#view-log-analytics-workspace-health-and-set-up-health-status-alerts) to proactively notify you if a workspace becomes unavailable because of a datacenter or regional failure. |
3361

34-
### Compare resilience features and capabilities
62+
#### Compare Azure Monitor Logs resilience features
3563

3664
| Feature | Service resilience | Data backup | High availability | Scope of protection | Setup | Cost |
37-
|------------------------|------------------------------|------------------|-----|---------------------------------|--------------------------|------------------------------------------------------------------------------|--------------------------------------------------|---------|
38-
| Availability zones | :white_check_mark: <br>In supported regions | :white_check_mark: | :white_check_mark: | In-region | Automatically enabled on dedicated clusters in supported regions. | No cost |
39-
| Continuous data export | | :white_check_mark: | | Protection from regional failure <sup>1</sup> | Enable per table. | Cost of data export + Storage blob or Event Hubs |
40-
| Dual ingestion | :white_check_mark: | :white_check_mark: | :white_check_mark: | Protection from regional failure | Enable per monitored resource. | Up to twice the cost of retention (depending on how much data you dual ingest) + egress charges. |
65+
|------------------------|--------------------|-------------|-------------------|-------------------|--------------------------|------------------------------------------------------------------------------|
66+
| Workspace replication | :white_check_mark: | :white_check_mark: | :white_check_mark: | Cross-region protection against region-wide incidents | Enable replication of the workspace and related data collection rules. Switch between regions as needed. | Based on the number of replicated GBs and region. |
67+
| Availability zones | :white_check_mark: <br>In supported regions | :white_check_mark: | :white_check_mark: | In-region protection against datacenter issues | Automatically enabled in supported regions. | No cost |
68+
| Continuous data export | | :white_check_mark: | | Protection from data loss because of a regional failure <sup>1</sup> | Enable per table. | Cost of data export + Storage blob or Event Hubs |
4169

4270

43-
<sup>1</sup> Data export provides cross-region protection if you export logs to a different region. In the event of an incident, previously exported data is backed up and readily available; however, further export might fail, depending on the nature of the incident.
71+
<sup>1</sup> Data export provides cross-region protection if you export logs to a geo-replicated storage account. In the event of an incident, previously exported data is backed up and readily available; however, further export might fail, depending on the nature of the incident.

0 commit comments

Comments
 (0)