|
| 1 | +--- |
| 2 | +title: Reliability in Azure Virtual Machines |
| 3 | +description: Find out about reliability in Azure Virtual Machines |
| 4 | +author: ericd-mst-github |
| 5 | +ms.author: erd |
| 6 | +ms.topic: overview |
| 7 | +ms.custom: subject-reliability |
| 8 | +ms.service: virtual-machines |
| 9 | +ms.date: 01/12/2023 |
| 10 | +--- |
| 11 | + |
| 12 | +# What is reliability in Virtual Machines? |
| 13 | + |
| 14 | +This article describes reliability support in Virtual Machines (VM), and covers both regional resiliency with availability zones and cross-region resiliency with disaster recovery. For a more detailed overview of reliability in Azure, see [Azure reliability](/azure/architecture/framework/resiliency/overview). |
| 15 | + |
| 16 | + |
| 17 | +## Availability zone support |
| 18 | + |
| 19 | +Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones. |
| 20 | + |
| 21 | +Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see [Regions and availability zones](/azure/availability-zones/az-overview.md). |
| 22 | + |
| 23 | +Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see [Build solutions with availability zones](/azure/architecture/high-availability/building-solutions-for-high-availability). |
| 24 | + |
| 25 | +Virtual machines support availability zones with three availability zones per supported Azure region and are also zone-redundant and zonal. For more information, see [availability zones support](/azure/reliability/availability-zones-service-support). The customer will be responsible for configuring and migrating their virtual machines for availability. Refer to the following readiness options below for availability zone enablement: |
| 26 | + |
| 27 | +- See [availability options for VMs](/azure/virtual-machines/availability) |
| 28 | +- Review [availability zone service and region support](/azure/reliability/availability-zones-service-support) |
| 29 | +- [Migrate existing VMs](/azure/reliability/migrate-vm) to availability zones |
| 30 | + |
| 31 | + |
| 32 | +### Prerequisites |
| 33 | + |
| 34 | +Your virtual machine SKUs must be available across the zones in for your region. To review which regions support availability zones, see the [list of supported regions](/azure/reliability/availability-zones-service-support#azure-regions-with-availability-zone-support). Check for VM SKU availability by using PowerShell, the Azure CLI, or review list of foundational services. For more information, see [reliability prerequisites](/azure/reliability/migrate-vm#prerequisites). |
| 35 | + |
| 36 | +### SLA improvements |
| 37 | + |
| 38 | +Because availability zones are physically separate and provide distinct power source, network, and cooling, SLAs (Service-level agreements) increase. For more information, see the [SLA for Virtual Machines](https://azure.microsoft.com/support/legal/sla/virtual-machines/v1_9/). |
| 39 | + |
| 40 | +#### Create a resource with availability zone enabled |
| 41 | + |
| 42 | +Get started by creating a virtual machine (VM) with availability zone enabled from the following deployment options below: |
| 43 | +- [Azure CLI](/azure/virtual-machines/linux/create-cli-availability-zone) |
| 44 | +- [PowerShell](/azure/virtual-machines/windows/create-powershell-availability-zone) |
| 45 | +- [Azure portal](/azure/virtual-machines/create-portal-availability-zone) |
| 46 | + |
| 47 | +### Zonal failover support |
| 48 | + |
| 49 | +Customers can set up virtual machines to failover to another zone using the Site Recovery service. For more information, see [Site Recovery](/azure/site-recovery/site-recovery-overview). |
| 50 | + |
| 51 | +### Fault tolerance |
| 52 | + |
| 53 | +Virtual machines can failover to another server in a cluster, with the VM's operating system restarting on the new server. Customers should refer to the failover process for disaster recovery, gathering virtual machines in recovery planning, and running disaster recovery drills to ensure their fault tolerance solution is successful. |
| 54 | + |
| 55 | +For more information, see the [site recovery processes](/azure/site-recovery/site-recovery-failover#before-you-start). |
| 56 | + |
| 57 | + |
| 58 | +### Zone down experience |
| 59 | + |
| 60 | +During a zone-wide outage, the customer should expect brief degradation of performance, until the virtual machine service self-healing re-balances underlying capacity to adjust to healthy zones. This isn't dependent on zone restoration; it is expected that the Microsoft-managed service self-healing state will compensate for a lost zone, leveraging capacity from other zones. |
| 61 | + |
| 62 | +Customers should also prepare for the possibility that there's an outage of an entire region. If there's a service disruption for an entire region, the locally redundant copies of your data would temporarily be unavailable. If geo-replication is enabled, three additional copies of your Azure Storage blobs and tables are stored in a different region. In the event of a complete regional outage or a disaster in which the primary region isn't recoverable, Azure remaps all of the DNS entries to the geo-replicated region. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +#### Zone outage preparation and recovery |
| 68 | + |
| 69 | +The following guidance is provided for Azure virtual machines in the case of a service disruption of the entire region where your Azure virtual machine application is deployed: |
| 70 | + |
| 71 | +- Configure [Azure Site Recovery](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance#option-1-initiate-a-failover-by-using-azure-site-recovery) for your VMs |
| 72 | +- Check the [Azure Service Health Dashboard](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance#option-2-wait-for-recovery) status if Azure Site Recovery hasn't been configured |
| 73 | +- Review how the [Azure Backup service](/azure/backup/backup-azure-vms-introduction) works for VMs |
| 74 | + - See the [support matrix](/azure/backup/backup-support-matrix-iaas) for Azure VM backups |
| 75 | +- Determine which [VM restore option and scenario](/azure/backup/about-azure-vm-restore) will work best for your environment |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +### Low-latency design |
| 80 | + |
| 81 | +Cross Region (secondary region), Cross Subscription (preview), and Cross Zonal (preview) are available options to consider when designing a low-latency virtual machine solution. For more information on these options, see the [supported restore methods](/azure/backup/backup-support-matrix-iaas#supported-restore-methods). |
| 82 | + |
| 83 | +>[!IMPORTANT] |
| 84 | +>By opting out of zone-aware deployment, you forego protection from isolation of underlying faults. Use of SKUs that don't support availability zones or opting out from availability zone configuration forces reliance on resources that don't obey zone placement and separation (including underlying dependencies of these resources). These resources shouldn't be expected to survive zone-down scenarios. Solutions that leverage such resources should define a disaster recovery strategy and configure a recovery of the solution in another region. |
| 85 | +
|
| 86 | +### Safe deployment techniques |
| 87 | + |
| 88 | +When you opt for availability zones isolation, you should utilize safe deployment techniques for application code, as well as application upgrades. In addition to configuring Azure Site Recovery, below are recommended safe deployment techniques for VMs: |
| 89 | + |
| 90 | +- [Virtual Machine Scale Sets](/azure/virtual-machines/flexible-virtual-machine-scale-sets) |
| 91 | +- [Availability Sets](/azure/virtual-machines/availability-set-overview) |
| 92 | +- [Azure Load Balancer](/azure/load-balancer/load-balancer-overview) |
| 93 | +- [Azure Storage Redundancy](/azure/storage/common/storage-redundancy) |
| 94 | + |
| 95 | + |
| 96 | + |
| 97 | + As Microsoft periodically performs planned maintenance updates, there may be rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. To learn more, see [availability considerations](/azure/virtual-machines/maintenance-and-updates#availability-considerations-during-scheduled-maintenance) during scheduled maintenance. |
| 98 | + |
| 99 | +Follow the health signals below for monitoring before upgrading your next set of nodes in another zone: |
| 100 | + |
| 101 | +- Check the [Azure Service Health Dashboard](https://azure.microsoft.com/status/) for the virtual machines service status for your expected regions |
| 102 | +- Ensure that [replication](/azure/site-recovery/azure-to-azure-quickstart) is enabled on your VMs |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | +### Availability zone redeployment and migration |
| 108 | + |
| 109 | +For migrating existing virtual machine resources to a zone redundant configuration, refer to the below resources: |
| 110 | + |
| 111 | +- Move a VM to another subscription or resource group |
| 112 | + - [CLI](/azure/virtual-machines/linux/move-vm) |
| 113 | + - [PowerShell](/azure/virtual-machines/windows/move-vm) |
| 114 | +- [Azure Resource Mover](/resource-mover/tutorial-move-region-virtual-machines) |
| 115 | +- [Move Azure VMs to availability zones](/azure/site-recovery/move-azure-vms-avset-azone) |
| 116 | +- [Move region maintenance configuration resources](/azure/virtual-machines/move-region-maintenance-configuration-resources) |
| 117 | + |
| 118 | + |
| 119 | + |
| 120 | + |
| 121 | +## Disaster recovery: cross-region failover |
| 122 | + |
| 123 | +In the case of a region-wide disaster, Azure can provide protection from regional or large geography disasters with disaster recovery by making use of another region. For more information on Azure disaster recovery architecture, see [Azure to Azure disaster recovery architecture](/azure/site-recovery/azure-to-azure-architecture). |
| 124 | + |
| 125 | +Customers can use Cross Region to restore Azure VMs via paired regions. You can restore all the Azure VMs for the selected recovery point if the backup is done in the secondary region. For more details on Cross Region restore, refer to the Cross Region table row entry in our [restore options](/azure/backup/backup-azure-arm-restore-vms#restore-options). |
| 126 | + |
| 127 | + |
| 128 | +### Cross-region disaster recovery in multi-region geography |
| 129 | + |
| 130 | +While Microsoft is working diligently to restore the virtual machine service for region-wide service disruptions, customers will have to rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on [Data strategies for disaster recovery](/azure/architecture/reliability/disaster-recovery#disaster-recovery-plan). |
| 131 | + |
| 132 | +#### Outage detection, notification, and management |
| 133 | + |
| 134 | +When the hardware or the physical infrastructure for the virtual machine fails unexpectedly. This can include local network failures, local disk failures, or other rack level failures. When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same data center. During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. The attached OS and data disks are always preserved. |
| 135 | + |
| 136 | +For more detailed information on virtual machine service disruptions, see [disaster recovery guidance](/azure/virtual-machines/virtual-machines-disaster-recovery-guidance). |
| 137 | + |
| 138 | +#### Set up disaster recovery and outage detection |
| 139 | + |
| 140 | +When setting up disaster recovery for virtual machines, understand what [Azure Site Recovery provides](/azure/site-recovery/site-recovery-overview#what-does-site-recovery-provide). Enable disaster recovery for virtual machines with the below methods: |
| 141 | + |
| 142 | +- Set up disaster recovery to a [secondary Azure region for an Azure VM](/azure/site-recovery/azure-to-azure-quickstart) |
| 143 | +- Create a Recovery Services vault |
| 144 | + - [Bicep](/azure/site-recovery/quickstart-create-vault-bicep) |
| 145 | + - [ARM template](/azure/site-recovery/quickstart-create-vault-template) |
| 146 | +- Enable disaster recovery for [Linux virtual machines](/azure/virtual-machines/linux/tutorial-disaster-recovery) |
| 147 | +- Enable disaster recovery for [Windows virtual machines](/azure/virtual-machines/windows/tutorial-disaster-recovery) |
| 148 | +- Failover virtual machines to [another region](/azure/site-recovery/azure-to-azure-tutorial-failover-failback) |
| 149 | +- Failover virtual machines to the [primary region](/azure/site-recovery/azure-to-azure-tutorial-failback#fail-back-to-the-primary-region) |
| 150 | + |
| 151 | +### Single-region geography disaster recovery |
| 152 | + |
| 153 | + |
| 154 | +With disaster recovery set up, Azure VMs will continuously replicate to a different target region. If an outage occurs, you can fail over VMs to the secondary region, and access them from there. |
| 155 | + |
| 156 | +For more information, see [Azure VMs architectural components](/azure/site-recovery/azure-to-azure-architecture#architectural-components) and [region pairing](/azure/virtual-machines/regions#region-pairs). |
| 157 | + |
| 158 | +### Capacity and proactive disaster recovery resiliency |
| 159 | + |
| 160 | +Microsoft and its customers operate under the Shared responsibility model. This means that for customer-enabled DR (customer-responsible services), the customer must address DR for any service they deploy and control. To ensure that recovery is proactive, customers should always pre-deploy secondaries because there's no guarantee of capacity at time of impact for those who haven't pre-allocated. |
| 161 | + |
| 162 | +For deploying virtual machines, customers can use [flexible orchestration](/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes#scale-sets-with-flexible-orchestration) mode on Virtual Machine Scale Sets. All VM sizes can be used with flexible orchestration mode. Flexible orchestration mode also offers high availability guarantees (up to 1000 VMs) by spreading VMs across fault domains in a region or within an Availability Zone. |
| 163 | + |
| 164 | +## Additional guidance |
| 165 | + |
| 166 | +- [Well-Architected Framework for virtual machines](/azure/architecture/framework/services/compute/virtual-machines/virtual-machines-review) |
| 167 | +- [Azure to Azure disaster recovery architecture](/azure/site-recovery/azure-to-azure-architecture) |
| 168 | +- [Accelerated networking with Azure VM disaster recovery](/azure-vm-disaster-recovery-with-accelerated-networking) |
| 169 | +- [Express Route with Azure VM disaster recovery](/azure/site-recovery/azure-vm-disaster-recovery-with-expressroute) |
| 170 | +- [Virtual Machine Scale Sets](/azure/virtual-machine-scale-sets/) |
| 171 | + |
| 172 | +## Next steps |
| 173 | +> [!div class="nextstepaction"] |
| 174 | +> [Resiliency in Azure](/azure/availability-zones/overview.md) |
0 commit comments