You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/reliability/concept-business-continuity-high-availability-disaster-recovery.md
+38-23Lines changed: 38 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -160,8 +160,8 @@ To achieve HA requirements, a workload can include a number of design elements.
160
160
161
161
Many Azure services are designed to be highly available, and can be used to build highly available workloads. Here are some examples:
162
162
163
-
- Azure Virtual Machine Scale Sets provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
164
-
- Azure App Service provides high availability through a variety of approaches, including automatically moving workers from an unhealthy node to a healthy node, and by providing capabilities for self-healing from many common fault types.
163
+
-[Azure Virtual Machine Scale Sets ](/azure/reliability/reliability-virtual-machine-scale-sets)provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
164
+
-[Azure App Service](/azure/reliability/reliability-app-service) provides high availability through a variety of approaches, including automatically moving workers from an unhealthy node to a healthy node, and by providing capabilities for self-healing from many common fault types.
165
165
166
166
Use each [service reliability guide](./overview-reliability-guidance.md) to understand the capabilities of the service, decide which tiers to use, and determine which capabilities to include in your high availability strategy.
167
167
@@ -185,9 +185,9 @@ Redundancy can be achieved by distributing replicas or redundant instances in on
185
185
186
186
Here are some of examples of how some Azure services provide redundancy options:
187
187
188
-
- Azure App Service enables you to run multiple instances of your application, to ensure that the application remains available even if one instance fails. If you enable zone redundancy, those instances are spread across multiple availability zones in the Azure region you use.
189
-
- Azure Storage provides high availability by automatically replicating data at least three times. You can distribute those replicas across availability zones by enabling zone-redundant storage (ZRS), and in many regions you can also replicate your storage data across regions by using geo-redundant storage (GRS).
190
-
- Azure SQL Database has multiple replicas to ensure that the data remains available even if one replica fails.
188
+
-[ Azure App Service](/azure/reliability/reliability-app-service) enables you to run multiple instances of your application, to ensure that the application remains available even if one instance fails. If you enable zone redundancy, those instances are spread across multiple availability zones in the Azure region you use.
189
+
-[Azure Storage](/azure/storage/common/storage-disaster-recovery-guidance) provides high availability by automatically replicating data at least three times. You can distribute those replicas across availability zones by enabling zone-redundant storage (ZRS), and in many regions you can also replicate your storage data across regions by using geo-redundant storage (GRS).
190
+
-[Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) has multiple replicas to ensure that the data remains available even if one replica fails.
191
191
192
192
To learn more about redundancy, see [Recommendations for designing for redundancy](/azure/well-architected/reliability/redundancy) and [Recommendations for using availability zones and regions](/azure/well-architected/reliability/regions-availability-zones).
193
193
@@ -197,41 +197,51 @@ Scalability and elasticity are the abilities of a system to handle increased loa
197
197
198
198
Many Azure services support scalability. Here are some examples:
199
199
200
-
- Azure Virtual Machine Scale Sets,Azure API Management, and several other services support Azure Monitor autoscale, which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
201
-
- Azure Functions can dynamically provision instances to serve your requests.
202
-
- Azure Cosmos DB supports autoscale throughput, where the service can automatically manage the resources assigned to your databases based on policies you specify.
200
+
-[Azure Virtual Machine Scale Sets,](/azure/virtual-machine-scale-sets/overview)[Azure API Management](/azure/api-management/api-management-key-concepts), and several other services support Azure Monitor autoscale, which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
201
+
-[Azure Functions](/azure/azure-functions/functions-overview) can dynamically provision instances to serve your requests.
202
+
-[ Azure Cosmos DB ](/azure/cosmos-db/introduction)supports autoscale throughput, where the service can automatically manage the resources assigned to your databases based on policies you specify.
203
203
204
204
Scalability is a key factor to consider during partial or complete malfunction. If a replica or compute instance is unavailable, the remaining components might need to bear more load to handle the load that was previously being handled by the faulted node. Consider *overprovisioning* if your system can't scale quickly enough to handle your expected changes in load.
205
205
206
206
For more information on how to design a scalable and elastic system, see [Recommendations for designing a reliable scaling strategy](/azure/well-architected/reliability/scaling).
207
207
208
208
#### Zero-downtime deployment techniques
209
209
210
-
Zero-downtime deployments enable you to deploy updates and make configuration changes without requiring downtime. Deployments and other changes introduce significant risk of downtime. Achieving high availability requires that you deploy in a controlled way, such as by updating a subset of your resources at a time, controlling the amount of traffic that reaches the new deployment, monitoring for any impact to your users, and rapidly remediating the issue or rolling back to a previous known-good deployment. To learn more about zero-downtime deployment techniques, see [Safe deployment practices](/devops/operate/safe-deployment-practices).
210
+
Deployments and other system changes usually introduce a significant risk of downtime. Because downtime risk is a challenge to high availability requirements, it's important to use zero-downtime deployment practices to make updates and configuration changes without any required downtime.
211
+
212
+
Zero-downtime deployment techniques can include:
213
+
214
+
- Updating a subset of your resources at a time.
215
+
- Controlling the amount of traffic that reaches the new deployment.
216
+
- Monitoring for any impact to your users.
217
+
- Rapidly remediating the issue.
218
+
- Rolling back to a previous known-good deployment.
219
+
220
+
To learn more about zero-downtime deployment techniques, see [Safe deployment practices](/devops/operate/safe-deployment-practices).
211
221
212
222
Azure itself uses zero-downtime deployment approaches for our own services. When you build your own applications, you can adopt zero-downtime deployments through a variety of approaches, such as:
213
223
214
-
- Azure Container Apps provides multiple revisions of your application, which can be used to achieve zero-downtime deployments.
215
-
- Azure Kubernetes Service (AKS) supports a variety of zero-downtime deployment techniques.
224
+
-[Azure Container Apps](/azure/container-apps/overview) provides multiple revisions of your application, which can be used to achieve zero-downtime deployments.
225
+
-[Azure Kubernetes Service](/azure/aks/what-is-aks) (AKS) supports a variety of zero-downtime deployment techniques.
216
226
217
-
While zero-downtime deployments are often associated with application deployments, they should also be used for configuration changes too. Here are some ways you can apply configuration changes safely:
227
+
While zero-downtime deployments are often associated with application deployments, they should also be used for configuration changes. Here are some ways you can apply configuration changes safely:
218
228
219
-
- Azure Storage enables you to change your storage account access keys in multiple stages, which prevents downtime during key rotation operations.
220
-
- Azure App Configuration provides feature flags, snapshots, and other capabilities to help you to control how configuration changes are applied.
229
+
-[Azure Storage ](/azure/storage/common/storage-introduction)enables you to change your storage account access keys in multiple stages, which prevents downtime during key rotation operations.
230
+
-[Azure App Configuration ](/azure/azure-app-configuration/overview)provides feature flags, snapshots, and other capabilities to help you to control how configuration changes are applied.
221
231
222
-
If you decide not to implement zero-downtime deployments, define *maintenance windows* so you can make system changes at a time your users expect.
232
+
If you decide not to implement zero-downtime deployments, make sure that you define *maintenance windows* so that you can make system changes at a time when your users expect it.
223
233
224
234
#### Automated testing
225
235
226
-
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. *Chaos engineering* involves testing your solution's ability to automatically tolerate or recover from a variety of fault types. Chaos engineering is critical for mature organizations with stringent standards for HA. Azure Chaos Studio is a chaos engineering tool that can simulate some common fault types.
236
+
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. Testing your solution's ability to automatically tolerate or recover from a variety of fault types is called *chaos engineering*. Chaos engineering is critical for mature organizations with stringent standards for HA. [Azure Chaos Studio ](/azure/chaos-studio/chaos-studio-overview)is a chaos engineering tool that can simulate some common fault types.
227
237
228
238
To learn more, see [Recommendations for designing a reliability testing strategy](/azure/well-architected/reliability/testing-strategy).
229
239
230
240
#### Monitoring and alerting
231
241
232
-
Monitoring lets you know the health of your system, even when automated mitigations take place. Monitoring is critical to understand how your solution is behaving, and to watch for early signals of failures like increased error rates or high resource consumption. Alerts enable you to be proactively notified of important changes in your environment.
242
+
Monitoring lets you know the health of your system, even when automated mitigations take place. Monitoring is critical for understanding how your solution is behaving, and to watch for early signals of failures like increased error rates or high resource consumption. With alerts, you can proactively receive important changes in your environment.
233
243
234
-
Use Azure Service Health, Azure Resource Health, and Azure Monitor, as well as Scheduled Events for virtual machines.
244
+
Use [Azure Service Health](/azure/service-health/overview), [Azure Resource Health](/azure/service-health/resource-health-overview), and [Azure Monitor](/azure/azure-monitor/overview), as well as [Scheduled Events ](/azure/virtual-machines/windows/scheduled-event-service)for virtual machines.
235
245
236
246
For more information, see [Recommendations for designing a reliable monitoring and alerting strategy](/azure/well-architected/reliability/monitoring-alerting-strategy).
237
247
@@ -284,25 +294,30 @@ It's also important to consider *failback*, which is the process by which you re
284
294
285
295
#### Backups
286
296
287
-
Backups involve taking a copy of your data and storing it safely for a defined period of time. Backups help you to recover from disasters when you can't automatically fail over to another replica, or where data corruption has occurred. Restoring from a backup is usually a last resort, because it involves data loss. A disaster recovery plan should specify the sequence of steps and recovery attempts that must take place before restoring from a backup.
297
+
Backups involve taking a copy of your data and storing it safely for a defined period of time. With backups you can recover from disasters when automatic failover to another replica isn't possible, or when data corruption has occurred.
298
+
299
+
When using backups as part of a disaster recovery plan it's important to take the following into consideration:
300
+
301
+
-*Data loss*. Because backups are typically taken infrequently, backup restoration usually involves data loss. For this reason, backup recovery should be used as a last resort and a disaster recovery plan should specify the sequence of steps and recovery attempts that must take place *before* restoring from a backup.
302
+
303
+
-*RPO alignment*. It's important to make sure that the workload RPO is aligned with the backup interval. Also, because backup restoration often takes time, it's critical to test your backups and restoration processes to verify their integrity and understand how long the restoration process takes.
288
304
289
305
Many Azure data and storage services support backups, such as the following:
290
306
291
307
-[Azure Backup](/azure/reliability/reliability-backup) provides automated backups for virtual machine disks, storage accounts, AKS, and a variety of other sources.
292
308
- Many Azure database services, including [Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) and [Azure Cosmos DB](/azure/reliability/reliability-cosmos-db-nosql) , have an automated backup capability for your databases.
293
309
-[Azure Key Vault](/azure/key-vault/general/disaster-recovery-guidance) provides features to back up your secrets, certificates, and keys.
294
310
295
-
Because backups are typically taken infrequently, restoring a backup can result in some data loss. Ensure that the RPO of the workload is aligned with the backup interval. Also, restoring a backup often takes time. It's important to test your backups and restoration processes so you verify their integrity and understand how long the restoration process takes.
296
311
297
312
#### Automated deployments
298
313
299
-
Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration files, should be part of a disaster recovery strategy. If you need to deploy new resources to respond to a disaster, then using IaC can help you to rapidly deploy and configure those resources based on your requirements, and reduce your RTO compared to manually deploying and configuring resources.
314
+
To rapidly deploy and configure required resources in the event of a disaster, use Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration file. Using IaC reduces your RTO and potential for error, compared to manually deploying and configuring resources.
300
315
301
316
#### Testing and drills
302
317
303
-
It's critical that you routinely validate and test your DR plans, and your wider reliability strategy. If you haven't tested your recovery processes then there can be major problems when you need to use them in a real disaster.
318
+
It's critical to routinely validate and test your DR plans, as well as your wider reliability strategy. If you haven't tested your recovery processes in a disaster simulation, you're more likely to face major problems when using them in an actual disaster.
304
319
305
-
By testing your DR plans, you also help to validate that your RTO is feasible based on the processes you need to follow.
320
+
Also, by testing your DR plans and required processes, you can validate the feasibility of your RTO.
306
321
307
322
To learn more, see [Recommendations for designing a reliability testing strategy](/azure/well-architected/reliability/testing-strategy).
0 commit comments