Skip to content

Commit bd0643c

Browse files
authored
Merge pull request #273071 from anaharris-ms/272901
Merge into Create reliability-hdinsight-on-aks.md #272901
2 parents 42b0f7f + 421fd82 commit bd0643c

File tree

5 files changed

+86
-8
lines changed

5 files changed

+86
-8
lines changed

articles/hdinsight-aks/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ items:
2020
href: get-started.md
2121
- name: Concepts
2222
items:
23+
- name: Reliability
24+
href: ../reliability/reliability-hdinsight-on-aks.md?toc=/azure/hdinsight-aks/toc.json
2325
- name: Versioning
2426
href: versions.md
2527
- name: Enterprise security

articles/reliability/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,8 @@
320320
href: /azure/devops/organizations/security/data-protection?view=azure-devops.md&preserve-view=true#data-availability&toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
321321
- name: Azure Elastic SAN
322322
href: reliability-elastic-san.md
323+
- name: Azure HDInsight on AKS
324+
href: reliability-hdinsight-on-aks.md
323325
- name: Azure Health Data Services - Azure API for FHIR
324326
href: ../healthcare-apis/azure-api-for-fhir/disaster-recovery.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
325327
- name: Azure Health Insights

articles/reliability/overview-reliability-guidance.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ For a more detailed overview of reliability principles in Azure, see [Reliabilit
109109
|Azure Deployment Environments| [Reliability in Azure Deployment Environments](reliability-deployment-environments.md)|[Reliability in Azure Deployment Environments](reliability-deployment-environments.md)|
110110
|Azure DevOps|| [Azure DevOps Data protection - data availability](/azure/devops/organizations/security/data-protection?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json&preserve-view=true&#data-availability)|
111111
|Azure Elastic SAN|[Availability zone support](reliability-elastic-san.md#availability-zone-support)|[Disaster recovery and business continuity](reliability-elastic-san.md#disaster-recovery-and-business-continuity)|
112+
|Azure HDInsight on AKS |[Reliability in HDInsight on AKS](reliability-hdinsight-on-aks.md) | [Reliability in HDInsight on AKS](reliability-hdinsight-on-aks.md) |
112113
|Azure Health Data Services - Azure API for FHIR|| [Disaster recovery for Azure API for FHIR](../healthcare-apis/azure-api-for-fhir/disaster-recovery.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json) |
113114
|Azure Health Insights|[Reliability in Azure Health Insights](reliability-health-insights.md)|[Reliability in Azure Health Insights](reliability-health-insights.md)|
114115
|Azure IoT Hub| [IoT Hub high availability and disaster recovery](../iot-hub/iot-hub-ha-dr.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)| [IoT Hub high availability and disaster recovery](../iot-hub/iot-hub-ha-dr.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json) |
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: Reliability in Azure HDInsight on Azure Kubernetes Service
3+
description: Find out about reliability in Azure HDInsight on Azure Kubernetes Service.
4+
author: fengzhou-msft
5+
ms.author: fenzhou
6+
ms.topic: reliability-article
7+
ms.custom: subject-reliability, references_regions
8+
ms.service: hdinsight-aks
9+
ms.date: 04/15/2024
10+
CustomerIntent: As a cloud architect/engineer, I want to understand reliability support for Azure HDInsight on Azure Kubernetes Service so that I can respond to and/or avoid failures in order to minimize downtime and data loss.
11+
---
12+
13+
# Reliability in Azure HDInsight on Azure Kubernetes Service
14+
15+
This article describes reliability support in [Azure HDInsight on Azure Kubernetes Service (AKS)](../hdinsight-aks/overview.md), and covers both [specific reliability recommendations](#reliability-recommendations) and [disaster recovery and business continuity](#disaster-recovery-and-business-continuity). For a more detailed overview of reliability principles in Azure, see [Azure reliability](/azure/architecture/framework/resiliency/overview).
16+
17+
## Reliability recommendations
18+
19+
[!INCLUDE [Reliability recommendations](includes/reliability-recommendations-include.md)]
20+
21+
### Reliability recommendations summary
22+
23+
| Category | Priority |Recommendation |
24+
|---------------|--------|---|
25+
| Availability |:::image type="icon" source="media/icon-recommendation-medium.svg":::| [Default and minimum virtual machine size recommendations](../hdinsight-aks/virtual-machine-recommendation-capacity-planning.md#clusters) |
26+
| |:::image type="icon" source="media/icon-recommendation-low.svg":::| [Auto Scale HDInsight on AKS Clusters](../hdinsight-aks/hdinsight-on-aks-autoscale-clusters.md) |
27+
| Monitoring |:::image type="icon" source="media/icon-recommendation-low.svg"::: |[How to integrate with Log Analytics](../hdinsight-aks/how-to-azure-monitor-integration.md) |
28+
| |:::image type="icon" source="media/icon-recommendation-low.svg"::: |[Monitoring with Azure Managed Prometheus and Grafana](../hdinsight-aks/monitor-with-prometheus-grafana.md) |
29+
| Security |:::image type="icon" source="media/icon-recommendation-low.svg":::| [Use NSG to restrict traffic to HDInsight on AKS](../hdinsight-aks/secure-traffic-by-nsg.md) |
30+
31+
## Availability zone support
32+
33+
[!INCLUDE [next step](includes/reliability-availability-zone-description-include.md)]
34+
35+
Currently, Azure HDInsight on AKS doesn't support availability zone in its service offerings.
36+
37+
## Disaster recovery and business continuity
38+
39+
[!INCLUDE [introduction to disaster recovery](includes/reliability-disaster-recovery-description-include.md)]
40+
41+
Currently, Azure HDInsight on AKS CP(Control Plane) service and databases are deployed across regions of Azure. Among these regions, the Azure HDInsight on AKS instances and database instances are isolated. When an outage at region level occurs, one region is down. All the resources in this region, including the RP(Resource Provider) of Azure HDInsight on AKS CP, database of Azure HDInsight on AKS CP and all customer clusters in this region. In this case, we can only wait for the regional outage to end. When the outage is recovered, the Azure HDInsight on AKS service is back and all customer clusters are back, too. It's possible that there may be some problems due to data inconsistency after the outage and needs a manual fix.
42+
43+
### Multi-region disaster recovery
44+
45+
Azure HDInsight on AKS currently doesn't support cross-region failover. Improving business continuity using cross region high availability disaster recovery requires architectural designs of higher complexity and higher cost. Customers may choose to design their own solution to back up key data and job status across different regions.
46+
47+
#### Outage detection, notification, and management
48+
49+
- Use Azure monitoring tools on HDInsight on AKS to detect abnormal behavior in the cluster and set corresponding alert notifications. You can enable Log Analytics in various ways and use managed Prometheus service with Azure Grafana dashboards for monitoring. For more information, see [Azure Monitor integration](../hdinsight-aks/concept-azure-monitor-integration.md).
50+
51+
- Subscribe to Azure health alerts to be notified about service issues, planned maintenance, health and security advisories for a subscription, service, or region. Health notifications that include the issue cause and resolute ETA help you to better execute failover and failbacks. For more information, see [Manage service health](../hdinsight-aks/service-health.md) and [Azure Service Health documentation](../service-health/index.yml).
52+
53+
### Single-region disaster recovery
54+
55+
Currently, Azure HDInsight on AKS only has one standard service offering and clusters are created in a single-region geography. Customers are responsible for diaster recovery.
56+
57+
### Capacity and proactive disaster recovery resiliency
58+
59+
Azure HDInsight on AKS and its customers operate under the Shared responsibility model, which means that the customer must address DR for the service they deploy and control. To ensure that recovery is proactive, customers should always predeploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.
60+
61+
Unlike the original version of HDInsight, the Virtual Machines used in HDInsight on AKS clusters require the same Quota as Azure VMs. For more information, see [Capacity planning](../hdinsight-aks/virtual-machine-recommendation-capacity-planning.md#capacity-planning).
62+
63+
## Related content
64+
65+
To learn more about the items discussed in this article, see:
66+
67+
* [What is Azure HDInsight on AKS](../hdinsight-aks/overview.md)
68+
* [Get started with one-click deployment](../hdinsight-aks/get-started.md)
69+
70+
71+
* [Reliability for HDInsight](./reliability-hdinsight.md)
72+
* [Reliability in Azure](./overview.md)

articles/reliability/reliability-hdinsight.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Reliability in Azure HDInsight
3-
description: Find out about reliability in Azure HDInsight
3+
description: Find out about reliability in Azure HDInsight.
44
author: apurbasroy
55
ms.service: azure
66
ms.topic: reliability-article
@@ -129,7 +129,7 @@ Improving business continuity using cross region high availability disaster reco
129129
|Data Storage|Duplicating primary data/tables in a secondary region|Replicate only curated data|
130130
|Data Egress|Outbound cross region data transfers come at a price. Review Bandwidth pricing guidelines|Replicate only curated data to reduce the region egress footprint|
131131
|Cluster Compute|Additional HDInsight cluster/s in secondary region|Use automated scripts to deploy secondary compute after primary failure. Use Autoscaling to keep secondary cluster size to a minimum. Use cheaper VM SKUs. Create secondaries in regions where VM SKUs may be discounted.|
132-
|Authentication |Multiuser scenarios in secondary region will incur additional Microsoft Entra Domain Services setups|Avoid multiuser setups in secondary region.|
132+
|Authentication |Multiuser scenarios in the secondary region incurs extra Microsoft Entra Domain Services setups|Avoid multiuser setups in secondary region.|
133133

134134
### Complexity optimizations
135135

@@ -143,7 +143,7 @@ Improving business continuity using cross region high availability disaster reco
143143

144144
When you create your multi region disaster recovery plan, consider the following recommendations:
145145

146-
* Determine the minimal business functionality you will need if there is a disaster and why. For example, evaluate if you need failover capabilities for the data transformation layer (shown in yellow) *and* the data serving layer (shown in blue), or if you only need failover for the data service layer.
146+
* Determine the minimal business functionality you need if there is a disaster and why. For example, evaluate if you need failover capabilities for the data transformation layer (shown in yellow) *and* the data serving layer (shown in blue), or if you only need failover for the data service layer.
147147

148148
:::image type="content" source="../hdinsight/media/hdinsight-business-continuity/data-layers.png" alt-text="data transformation and data serving layers":::
149149

@@ -192,7 +192,7 @@ functionality. Service incidents in one or more of the following services in a s
192192
To learn more, see [high availability services supported by Azure HDInsight](../hdinsight/hdinsight-high-availability-components.md).
193193

194194

195-
- **Metastore(s): Azure SQL Database**. HDInsight uses [Azure SQL Database](https://azure.microsoft.com/support/legal/sla/azure-sql-database/v1_4/) as a metastore, which provides an SLA of 99.99%. Three replicas of data persist within a data center with synchronous replication. If there is a replica loss, an alternate replica is served seamlessly. [Active geo-replication](/azure/azure-sql/database/active-geo-replication-overview) is supported out of the box with a maximum of four data centers. When there is a failover, either manual or data center, the first replica in the hierarchy will automatically become read-write capable. For more information, see [Azure SQL Database business continuity](/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview).
195+
- **Metastore(s): Azure SQL Database**. HDInsight uses [Azure SQL Database](https://azure.microsoft.com/support/legal/sla/azure-sql-database/v1_4/) as a metastore, which provides an SLA of 99.99%. Three replicas of data persist within a data center with synchronous replication. If there is a replica loss, an alternate replica is served seamlessly. [Active geo-replication](/azure/azure-sql/database/active-geo-replication-overview) is supported out of the box with a maximum of four data centers. When there is a failover, either manual or data center, the first replica in the hierarchy automatically becomes read-write capable. For more information, see [Azure SQL Database business continuity](/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview).
196196

197197

198198
- **Storage: Azure Data Lake Gen2 or Blob storage**. HDInsight recommends Azure Data Lake Storage Gen2 as the underlying storage layer. [Azure Storage](https://azure.microsoft.com/support/legal/sla/storage/v1_5/), including Azure Data Lake Storage Gen2, provides an SLA of 99.9%. HDInsight uses the LRS service in which three replicas of data persist within a data center, and replication is synchronous. When there is a replica loss, a replica is served seamlessly.
@@ -208,13 +208,14 @@ functionality. Service incidents in one or more of the following services in a s
208208
:::image type="content" source="../hdinsight/media/hdinsight-business-continuity/hdinsight-components.png" alt-text="HDInsight components":::
209209

210210

211-
## Next steps
211+
## Related content
212212

213-
To learn more about the items discussed in this article, see:
214213

215214
* [Azure HDInsight business continuity architectures](../hdinsight/hdinsight-business-continuity-architecture.md)
216215
* [Azure HDInsight highly available solution architecture case study](../hdinsight/hdinsight-high-availability-case-study.md)
217216
* [What is Apache Hive and HiveQL on Azure HDInsight?](../hdinsight/hadoop/hdinsight-use-hive.md)
218217

219-
> [!div class="nextstepaction"]
220-
> [Reliability in Azure](availability-zones-overview.md)
218+
219+
* [Reliability for HDInsight on AKS](./reliability-hdinsight-on-aks.md)
220+
* [Reliability in Azure](./overview.md)
221+

0 commit comments

Comments
 (0)