Skip to content

Commit 922347b

Browse files
authored
Merge pull request #223811 from ericd-mst-github/erd-vm-resiliency
Erd vm resiliency
2 parents 7ca6d43 + 6924ecf commit 922347b

File tree

5 files changed

+186
-5
lines changed

5 files changed

+186
-5
lines changed

articles/reliability/TOC.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@
238238
- name: Azure Virtual Machine Scale Sets
239239
href: ../virtual-machines/availability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
240240
- name: Azure Virtual Machines
241-
href: ../virtual-machines/availability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
241+
href: ../virtual-machines/virtual-machines-reliability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
242242
- name: Azure Virtual Network
243243
href: ../vpn-gateway/create-zone-redundant-vnet-gateway.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json
244244
- name: Azure VPN Gateway

articles/reliability/reliability-guidance-overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Azure reliability guidance is a collection of service-specific reliability guide
3232
[Azure SQL](/azure/azure-sql/database/high-availability-sla?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
3333
[Azure Storage: Blob Storage](../storage/common/storage-disaster-recovery-guidance.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
3434
[Azure Virtual Machine Scale Sets](../virtual-machines/availability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
35-
[Azure Virtual Machines](../virtual-machines/availability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
35+
[Azure Virtual Machines](../virtual-machines/virtual-machines-reliability.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
3636
[Azure Virtual Network](../vpn-gateway/create-zone-redundant-vnet-gateway.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
3737
[Azure VPN Gateway](../vpn-gateway/about-zone-redundant-vnet-gateways.md?toc=/azure/reliability/toc.json&bc=/azure/reliability/breadcrumb/toc.json)|
3838

articles/virtual-machines/TOC.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1906,6 +1906,11 @@
19061906
- name: Overview
19071907
displayName: Backup and recovery
19081908
href: backup-recovery.md
1909+
- name: Reliability in Virtual Machines
1910+
items:
1911+
- name: Reliability in Virtual Machines
1912+
displayName: Reliability in Virtual Machines
1913+
href: virtual-machines-reliability.md
19091914
- name: Service disruptions
19101915
href: virtual-machines-disaster-recovery-guidance.md
19111916
- name: Back up VMs

articles/virtual-machines/backup-recovery.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: cynthn
55
ms.service: virtual-machines
66
ms.subservice: recovery
77
ms.topic: conceptual
8-
ms.date: 10/22/2021
8+
ms.date: 01/12/2023
99
ms.author: cynthn
1010
---
1111

@@ -28,7 +28,7 @@ For more information on how Azure Backup works, see [Plan your VM backup infrast
2828

2929
Azure Site Recovery protects your VMs from a major disaster scenario. These scenarios may include widespread service interruptions or regional outages caused by natural disasters. You can configure Azure Site Recovery for your VMs so that your applications are recoverable in matter of minutes with a single click. You can replicate to an Azure region of your choice, since recovery isn't restricted to paired regions.
3030

31-
You can run disaster-recovery drills with on-demand test failovers, without affecting your production workloads or ongoing replication. Create recovery plans to orchestrate failover and failback of the entire application running on multiple VMs. The recovery plan feature is integrated with Azure automation runbooks.
31+
You can run disaster-recovery drills with on-demand test failovers, without affecting your production workloads or ongoing replication. Create recovery plans to orchestrate failover and failback of the entire application running on multiple VMs. The recovery plan feature is integrated with Azure Automation runbooks.
3232

3333
You can get started by [replicating your virtual machines](../site-recovery/azure-to-azure-quickstart.md).
3434

@@ -53,4 +53,6 @@ Once created, VM restore points can then be used to restore individual disks. To
5353
Learn more about [working with VM restore points](virtual-machines-create-restore-points.md) and the [restore point collections](/rest/api/compute/restore-point-collections) API.
5454

5555
## Next steps
56-
You can try out Azure Backup by following the [Azure Backup quickstart](../backup/quick-backup-vm-portal.md).
56+
You can try out Azure Backup by following the [Azure Backup quickstart](../backup/quick-backup-vm-portal.md).
57+
58+
You can also plan and implement reliability for your virtual machine configuration. For more information see [Virtual Machine Reliability](./virtual-machines-reliability.md).
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
---
2+
title: Reliability in Azure Virtual Machines
3+
description: Find out about reliability in Azure Virtual Machines
4+
author: ericd-mst-github
5+
ms.author: erd
6+
ms.topic: overview
7+
ms.custom: subject-reliability
8+
ms.service: virtual-machines
9+
ms.date: 01/12/2023
10+
---
11+
12+
# What is reliability in Virtual Machines?
13+
14+
This article describes reliability support in Virtual Machines (VM), and covers both regional resiliency with availability zones and cross-region resiliency with disaster recovery. For a more detailed overview of reliability in Azure, see [Azure reliability](/azure/architecture/framework/resiliency/overview).
15+
16+
17+
## Availability zone support
18+
19+
Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.
20+
21+
Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see [Regions and availability zones](/azure/availability-zones/az-overview.md).
22+
23+
Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see [Build solutions with availability zones](/azure/architecture/high-availability/building-solutions-for-high-availability).
24+
25+
Virtual machines support availability zones with three availability zones per supported Azure region and are also zone-redundant and zonal. For more information, see [availability zones support](/azure/reliability/availability-zones-service-support). The customer will be responsible for configuring and migrating their virtual machines for availability. Refer to the following readiness options below for availability zone enablement:
26+
27+
- See [availability options for VMs](/azure/virtual-machines/availability)
28+
- Review [availability zone service and region support](/azure/reliability/availability-zones-service-support)
29+
- [Migrate existing VMs](/azure/reliability/migrate-vm) to availability zones
30+
31+
32+
### Prerequisites
33+
34+
Your virtual machine SKUs must be available across the zones in for your region. To review which regions support availability zones, see the [list of supported regions](/azure/reliability/availability-zones-service-support#azure-regions-with-availability-zone-support). Check for VM SKU availability by using PowerShell, the Azure CLI, or review list of foundational services. For more information, see [reliability prerequisites](/azure/reliability/migrate-vm#prerequisites).
35+
36+
### SLA improvements
37+
38+
Because availability zones are physically separate and provide distinct power source, network, and cooling, SLAs (Service-level agreements) increase. For more information, see the [SLA for Virtual Machines](https://azure.microsoft.com/support/legal/sla/virtual-machines/v1_9/).
39+
40+
#### Create a resource with availability zone enabled
41+
42+
Get started by creating a virtual machine (VM) with availability zone enabled from the following deployment options below:
43+
- [Azure CLI](/azure/virtual-machines/linux/create-cli-availability-zone)
44+
- [PowerShell](/azure/virtual-machines/windows/create-powershell-availability-zone)
45+
- [Azure portal](/azure/virtual-machines/create-portal-availability-zone)
46+
47+
### Zonal failover support
48+
49+
Customers can set up virtual machines to failover to another zone using the Site Recovery service. For more information, see [Site Recovery](/azure/site-recovery/site-recovery-overview).
50+
51+
### Fault tolerance
52+
53+
Virtual machines can failover to another server in a cluster, with the VM's operating system restarting on the new server. Customers should refer to the failover process for disaster recovery, gathering virtual machines in recovery planning, and running disaster recovery drills to ensure their fault tolerance solution is successful.
54+
55+
For more information, see the [site recovery processes](/azure/site-recovery/site-recovery-failover#before-you-start).
56+
57+
58+
### Zone down experience
59+
60+
During a zone-wide outage, the customer should expect brief degradation of performance, until the virtual machine service self-healing re-balances underlying capacity to adjust to healthy zones. This isn't dependent on zone restoration; it is expected that the Microsoft-managed service self-healing state will compensate for a lost zone, leveraging capacity from other zones.
61+
62+
Customers should also prepare for the possibility that there's an outage of an entire region. If there's a service disruption for an entire region, the locally redundant copies of your data would temporarily be unavailable. If geo-replication is enabled, three additional copies of your Azure Storage blobs and tables are stored in a different region. In the event of a complete regional outage or a disaster in which the primary region isn't recoverable, Azure remaps all of the DNS entries to the geo-replicated region.
63+
64+
65+
66+
67+
#### Zone outage preparation and recovery
68+
69+
The following guidance is provided for Azure virtual machines in the case of a service disruption of the entire region where your Azure virtual machine application is deployed:
70+
71+
- Configure [Azure Site Recovery](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance#option-1-initiate-a-failover-by-using-azure-site-recovery) for your VMs
72+
- Check the [Azure Service Health Dashboard](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance#option-2-wait-for-recovery) status if Azure Site Recovery hasn't been configured
73+
- Review how the [Azure Backup service](/azure/backup/backup-azure-vms-introduction) works for VMs
74+
- See the [support matrix](/azure/backup/backup-support-matrix-iaas) for Azure VM backups
75+
- Determine which [VM restore option and scenario](/azure/backup/about-azure-vm-restore) will work best for your environment
76+
77+
78+
79+
### Low-latency design
80+
81+
Cross Region (secondary region), Cross Subscription (preview), and Cross Zonal (preview) are available options to consider when designing a low-latency virtual machine solution. For more information on these options, see the [supported restore methods](/azure/backup/backup-support-matrix-iaas#supported-restore-methods).
82+
83+
>[!IMPORTANT]
84+
>By opting out of zone-aware deployment, you forego protection from isolation of underlying faults. Use of SKUs that don't support availability zones or opting out from availability zone configuration forces reliance on resources that don't obey zone placement and separation (including underlying dependencies of these resources). These resources shouldn't be expected to survive zone-down scenarios. Solutions that leverage such resources should define a disaster recovery strategy and configure a recovery of the solution in another region.
85+
86+
### Safe deployment techniques
87+
88+
When you opt for availability zones isolation, you should utilize safe deployment techniques for application code, as well as application upgrades. In addition to configuring Azure Site Recovery, below are recommended safe deployment techniques for VMs:
89+
90+
- [Virtual Machine Scale Sets](/azure/virtual-machines/flexible-virtual-machine-scale-sets)
91+
- [Availability Sets](/azure/virtual-machines/availability-set-overview)
92+
- [Azure Load Balancer](/azure/load-balancer/load-balancer-overview)
93+
- [Azure Storage Redundancy](/azure/storage/common/storage-redundancy)
94+
95+
96+
97+
As Microsoft periodically performs planned maintenance updates, there may be rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. To learn more, see [availability considerations](/azure/virtual-machines/maintenance-and-updates#availability-considerations-during-scheduled-maintenance) during scheduled maintenance.
98+
99+
Follow the health signals below for monitoring before upgrading your next set of nodes in another zone:
100+
101+
- Check the [Azure Service Health Dashboard](https://azure.microsoft.com/status/) for the virtual machines service status for your expected regions
102+
- Ensure that [replication](/azure/site-recovery/azure-to-azure-quickstart) is enabled on your VMs
103+
104+
105+
106+
107+
### Availability zone redeployment and migration
108+
109+
For migrating existing virtual machine resources to a zone redundant configuration, refer to the below resources:
110+
111+
- Move a VM to another subscription or resource group
112+
- [CLI](/azure/virtual-machines/linux/move-vm)
113+
- [PowerShell](/azure/virtual-machines/windows/move-vm)
114+
- [Azure Resource Mover](/resource-mover/tutorial-move-region-virtual-machines)
115+
- [Move Azure VMs to availability zones](/azure/site-recovery/move-azure-vms-avset-azone)
116+
- [Move region maintenance configuration resources](/azure/virtual-machines/move-region-maintenance-configuration-resources)
117+
118+
119+
120+
121+
## Disaster recovery: cross-region failover
122+
123+
In the case of a region-wide disaster, Azure can provide protection from regional or large geography disasters with disaster recovery by making use of another region. For more information on Azure disaster recovery architecture, see [Azure to Azure disaster recovery architecture](/azure/site-recovery/azure-to-azure-architecture).
124+
125+
Customers can use Cross Region to restore Azure VMs via paired regions. You can restore all the Azure VMs for the selected recovery point if the backup is done in the secondary region. For more details on Cross Region restore, refer to the Cross Region table row entry in our [restore options](/azure/backup/backup-azure-arm-restore-vms#restore-options).
126+
127+
128+
### Cross-region disaster recovery in multi-region geography
129+
130+
While Microsoft is working diligently to restore the virtual machine service for region-wide service disruptions, customers will have to rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on [Data strategies for disaster recovery](/azure/architecture/reliability/disaster-recovery#disaster-recovery-plan).
131+
132+
#### Outage detection, notification, and management
133+
134+
When the hardware or the physical infrastructure for the virtual machine fails unexpectedly. This can include local network failures, local disk failures, or other rack level failures. When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same data center. During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. The attached OS and data disks are always preserved.
135+
136+
For more detailed information on virtual machine service disruptions, see [disaster recovery guidance](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance).
137+
138+
#### Set up disaster recovery and outage detection
139+
140+
When setting up disaster recovery for virtual machines, understand what [Azure Site Recovery provides](/azure/site-recovery/site-recovery-overview#what-does-site-recovery-provide). Enable disaster recovery for virtual machines with the below methods:
141+
142+
- Set up disaster recovery to a [secondary Azure region for an Azure VM](/azure/site-recovery/azure-to-azure-quickstart)
143+
- Create a Recovery Services vault
144+
- [Bicep](/azure/site-recovery/quickstart-create-vault-bicep)
145+
- [ARM template](/azure/site-recovery/quickstart-create-vault-template)
146+
- Enable disaster recovery for [Linux virtual machines](/azure/virtual-machines/linux/tutorial-disaster-recovery)
147+
- Enable disaster recovery for [Windows virtual machines](/azure/virtual-machines/windows/tutorial-disaster-recovery)
148+
- Failover virtual machines to [another region](/azure/site-recovery/azure-to-azure-tutorial-failover-failback)
149+
- Failover virtual machines to the [primary region](/azure/site-recovery/azure-to-azure-tutorial-failback#fail-back-to-the-primary-region)
150+
151+
### Single-region geography disaster recovery
152+
153+
154+
With disaster recovery set up, Azure VMs will continuously replicate to a different target region. If an outage occurs, you can fail over VMs to the secondary region, and access them from there.
155+
156+
For more information, see [Azure VMs architectural components](/azure/site-recovery/azure-to-azure-architecture#architectural-components) and [region pairing](/azure/virtual-machines/regions#region-pairs).
157+
158+
### Capacity and proactive disaster recovery resiliency
159+
160+
Microsoft and its customers operate under the Shared responsibility model. This means that for customer-enabled DR (customer-responsible services), the customer must address DR for any service they deploy and control. To ensure that recovery is proactive, customers should always pre-deploy secondaries because there's no guarantee of capacity at time of impact for those who haven't pre-allocated.
161+
162+
For deploying virtual machines, customers can use [flexible orchestration](/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes#scale-sets-with-flexible-orchestration) mode on Virtual Machine Scale Sets. All VM sizes can be used with flexible orchestration mode. Flexible orchestration mode also offers high availability guarantees (up to 1000 VMs) by spreading VMs across fault domains in a region or within an Availability Zone.
163+
164+
## Additional guidance
165+
166+
- [Well-Architected Framework for virtual machines](/azure/architecture/framework/services/compute/virtual-machines/virtual-machines-review)
167+
- [Azure to Azure disaster recovery architecture](/azure/site-recovery/azure-to-azure-architecture)
168+
- [Accelerated networking with Azure VM disaster recovery](/azure-vm-disaster-recovery-with-accelerated-networking)
169+
- [Express Route with Azure VM disaster recovery](/azure/site-recovery/azure-vm-disaster-recovery-with-expressroute)
170+
- [Virtual Machine Scale Sets](/azure/virtual-machine-scale-sets/)
171+
172+
## Next steps
173+
> [!div class="nextstepaction"]
174+
> [Resiliency in Azure](/azure/availability-zones/overview.md)

0 commit comments

Comments
 (0)