Skip to content

Commit 5f9c570

Browse files
Apply suggestions from code review
Co-authored-by: Anastasia Harris <[email protected]>
1 parent e6f5b70 commit 5f9c570

File tree

1 file changed

+38
-23
lines changed

1 file changed

+38
-23
lines changed

articles/reliability/concept-business-continuity-high-availability-disaster-recovery.md

Lines changed: 38 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -160,8 +160,8 @@ To achieve HA requirements, a workload can include a number of design elements.
160160

161161
Many Azure services are designed to be highly available, and can be used to build highly available workloads. Here are some examples:
162162

163-
- Azure Virtual Machine Scale Sets provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
164-
- Azure App Service provides high availability through a variety of approaches, including automatically moving workers from an unhealthy node to a healthy node, and by providing capabilities for self-healing from many common fault types.
163+
- [Azure Virtual Machine Scale Sets ](/azure/reliability/reliability-virtual-machine-scale-sets)provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
164+
- [Azure App Service](/azure/reliability/reliability-app-service) provides high availability through a variety of approaches, including automatically moving workers from an unhealthy node to a healthy node, and by providing capabilities for self-healing from many common fault types.
165165

166166
Use each [service reliability guide](./overview-reliability-guidance.md) to understand the capabilities of the service, decide which tiers to use, and determine which capabilities to include in your high availability strategy.
167167

@@ -185,9 +185,9 @@ Redundancy can be achieved by distributing replicas or redundant instances in on
185185

186186
Here are some of examples of how some Azure services provide redundancy options:
187187

188-
- Azure App Service enables you to run multiple instances of your application, to ensure that the application remains available even if one instance fails. If you enable zone redundancy, those instances are spread across multiple availability zones in the Azure region you use.
189-
- Azure Storage provides high availability by automatically replicating data at least three times. You can distribute those replicas across availability zones by enabling zone-redundant storage (ZRS), and in many regions you can also replicate your storage data across regions by using geo-redundant storage (GRS).
190-
- Azure SQL Database has multiple replicas to ensure that the data remains available even if one replica fails.
188+
-[ Azure App Service](/azure/reliability/reliability-app-service) enables you to run multiple instances of your application, to ensure that the application remains available even if one instance fails. If you enable zone redundancy, those instances are spread across multiple availability zones in the Azure region you use.
189+
- [Azure Storage](/azure/storage/common/storage-disaster-recovery-guidance) provides high availability by automatically replicating data at least three times. You can distribute those replicas across availability zones by enabling zone-redundant storage (ZRS), and in many regions you can also replicate your storage data across regions by using geo-redundant storage (GRS).
190+
- [Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) has multiple replicas to ensure that the data remains available even if one replica fails.
191191

192192
To learn more about redundancy, see [Recommendations for designing for redundancy](/azure/well-architected/reliability/redundancy) and [Recommendations for using availability zones and regions](/azure/well-architected/reliability/regions-availability-zones).
193193

@@ -197,41 +197,51 @@ Scalability and elasticity are the abilities of a system to handle increased loa
197197

198198
Many Azure services support scalability. Here are some examples:
199199

200-
- Azure Virtual Machine Scale Sets, Azure API Management, and several other services support Azure Monitor autoscale, which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
201-
- Azure Functions can dynamically provision instances to serve your requests.
202-
- Azure Cosmos DB supports autoscale throughput, where the service can automatically manage the resources assigned to your databases based on policies you specify.
200+
- [Azure Virtual Machine Scale Sets,](/azure/virtual-machine-scale-sets/overview) [Azure API Management](/azure/api-management/api-management-key-concepts), and several other services support Azure Monitor autoscale, which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
201+
- [Azure Functions](/azure/azure-functions/functions-overview) can dynamically provision instances to serve your requests.
202+
-[ Azure Cosmos DB ](/azure/cosmos-db/introduction)supports autoscale throughput, where the service can automatically manage the resources assigned to your databases based on policies you specify.
203203

204204
Scalability is a key factor to consider during partial or complete malfunction. If a replica or compute instance is unavailable, the remaining components might need to bear more load to handle the load that was previously being handled by the faulted node. Consider *overprovisioning* if your system can't scale quickly enough to handle your expected changes in load.
205205

206206
For more information on how to design a scalable and elastic system, see [Recommendations for designing a reliable scaling strategy](/azure/well-architected/reliability/scaling).
207207

208208
#### Zero-downtime deployment techniques
209209

210-
Zero-downtime deployments enable you to deploy updates and make configuration changes without requiring downtime. Deployments and other changes introduce significant risk of downtime. Achieving high availability requires that you deploy in a controlled way, such as by updating a subset of your resources at a time, controlling the amount of traffic that reaches the new deployment, monitoring for any impact to your users, and rapidly remediating the issue or rolling back to a previous known-good deployment. To learn more about zero-downtime deployment techniques, see [Safe deployment practices](/devops/operate/safe-deployment-practices).
210+
Deployments and other system changes usually introduce a significant risk of downtime. Because downtime risk is a challenge to high availability requirements, it's important to use zero-downtime deployment practices to make updates and configuration changes without any required downtime.
211+
212+
Zero-downtime deployment techniques can include:
213+
214+
- Updating a subset of your resources at a time.
215+
- Controlling the amount of traffic that reaches the new deployment.
216+
- Monitoring for any impact to your users.
217+
- Rapidly remediating the issue.
218+
- Rolling back to a previous known-good deployment.
219+
220+
To learn more about zero-downtime deployment techniques, see [Safe deployment practices](/devops/operate/safe-deployment-practices).
211221

212222
Azure itself uses zero-downtime deployment approaches for our own services. When you build your own applications, you can adopt zero-downtime deployments through a variety of approaches, such as:
213223

214-
- Azure Container Apps provides multiple revisions of your application, which can be used to achieve zero-downtime deployments.
215-
- Azure Kubernetes Service (AKS) supports a variety of zero-downtime deployment techniques.
224+
- [Azure Container Apps](/azure/container-apps/overview) provides multiple revisions of your application, which can be used to achieve zero-downtime deployments.
225+
- [Azure Kubernetes Service](/azure/aks/what-is-aks) (AKS) supports a variety of zero-downtime deployment techniques.
216226

217-
While zero-downtime deployments are often associated with application deployments, they should also be used for configuration changes too. Here are some ways you can apply configuration changes safely:
227+
While zero-downtime deployments are often associated with application deployments, they should also be used for configuration changes. Here are some ways you can apply configuration changes safely:
218228

219-
- Azure Storage enables you to change your storage account access keys in multiple stages, which prevents downtime during key rotation operations.
220-
- Azure App Configuration provides feature flags, snapshots, and other capabilities to help you to control how configuration changes are applied.
229+
- [Azure Storage ](/azure/storage/common/storage-introduction)enables you to change your storage account access keys in multiple stages, which prevents downtime during key rotation operations.
230+
- [Azure App Configuration ](/azure/azure-app-configuration/overview)provides feature flags, snapshots, and other capabilities to help you to control how configuration changes are applied.
221231

222-
If you decide not to implement zero-downtime deployments, define *maintenance windows* so you can make system changes at a time your users expect.
232+
If you decide not to implement zero-downtime deployments, make sure that you define *maintenance windows* so that you can make system changes at a time when your users expect it.
223233

224234
#### Automated testing
225235

226-
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. *Chaos engineering* involves testing your solution's ability to automatically tolerate or recover from a variety of fault types. Chaos engineering is critical for mature organizations with stringent standards for HA. Azure Chaos Studio is a chaos engineering tool that can simulate some common fault types.
236+
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. Testing your solution's ability to automatically tolerate or recover from a variety of fault types is called *chaos engineering*. Chaos engineering is critical for mature organizations with stringent standards for HA. [Azure Chaos Studio ](/azure/chaos-studio/chaos-studio-overview)is a chaos engineering tool that can simulate some common fault types.
227237

228238
To learn more, see [Recommendations for designing a reliability testing strategy](/azure/well-architected/reliability/testing-strategy).
229239

230240
#### Monitoring and alerting
231241

232-
Monitoring lets you know the health of your system, even when automated mitigations take place. Monitoring is critical to understand how your solution is behaving, and to watch for early signals of failures like increased error rates or high resource consumption. Alerts enable you to be proactively notified of important changes in your environment.
242+
Monitoring lets you know the health of your system, even when automated mitigations take place. Monitoring is critical for understanding how your solution is behaving, and to watch for early signals of failures like increased error rates or high resource consumption. With alerts, you can proactively receive important changes in your environment.
233243

234-
Use Azure Service Health, Azure Resource Health, and Azure Monitor, as well as Scheduled Events for virtual machines.
244+
Use [Azure Service Health](/azure/service-health/overview), [Azure Resource Health](/azure/service-health/resource-health-overview), and [Azure Monitor](/azure/azure-monitor/overview), as well as [Scheduled Events ](/azure/virtual-machines/windows/scheduled-event-service)for virtual machines.
235245

236246
For more information, see [Recommendations for designing a reliable monitoring and alerting strategy](/azure/well-architected/reliability/monitoring-alerting-strategy).
237247

@@ -284,25 +294,30 @@ It's also important to consider *failback*, which is the process by which you re
284294

285295
#### Backups
286296

287-
Backups involve taking a copy of your data and storing it safely for a defined period of time. Backups help you to recover from disasters when you can't automatically fail over to another replica, or where data corruption has occurred. Restoring from a backup is usually a last resort, because it involves data loss. A disaster recovery plan should specify the sequence of steps and recovery attempts that must take place before restoring from a backup.
297+
Backups involve taking a copy of your data and storing it safely for a defined period of time. With backups you can recover from disasters when automatic failover to another replica isn't possible, or when data corruption has occurred.
298+
299+
When using backups as part of a disaster recovery plan it's important to take the following into consideration:
300+
301+
- *Data loss*. Because backups are typically taken infrequently, backup restoration usually involves data loss. For this reason, backup recovery should be used as a last resort and a disaster recovery plan should specify the sequence of steps and recovery attempts that must take place *before* restoring from a backup.
302+
303+
- *RPO alignment*. It's important to make sure that the workload RPO is aligned with the backup interval. Also, because backup restoration often takes time, it's critical to test your backups and restoration processes to verify their integrity and understand how long the restoration process takes.
288304

289305
Many Azure data and storage services support backups, such as the following:
290306

291307
- [Azure Backup](/azure/reliability/reliability-backup) provides automated backups for virtual machine disks, storage accounts, AKS, and a variety of other sources.
292308
- Many Azure database services, including [Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) and [Azure Cosmos DB](/azure/reliability/reliability-cosmos-db-nosql) , have an automated backup capability for your databases.
293309
- [Azure Key Vault](/azure/key-vault/general/disaster-recovery-guidance) provides features to back up your secrets, certificates, and keys.
294310

295-
Because backups are typically taken infrequently, restoring a backup can result in some data loss. Ensure that the RPO of the workload is aligned with the backup interval. Also, restoring a backup often takes time. It's important to test your backups and restoration processes so you verify their integrity and understand how long the restoration process takes.
296311

297312
#### Automated deployments
298313

299-
Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration files, should be part of a disaster recovery strategy. If you need to deploy new resources to respond to a disaster, then using IaC can help you to rapidly deploy and configure those resources based on your requirements, and reduce your RTO compared to manually deploying and configuring resources.
314+
To rapidly deploy and configure required resources in the event of a disaster, use Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration file. Using IaC reduces your RTO and potential for error, compared to manually deploying and configuring resources.
300315

301316
#### Testing and drills
302317

303-
It's critical that you routinely validate and test your DR plans, and your wider reliability strategy. If you haven't tested your recovery processes then there can be major problems when you need to use them in a real disaster.
318+
It's critical to routinely validate and test your DR plans, as well as your wider reliability strategy. If you haven't tested your recovery processes in a disaster simulation, you're more likely to face major problems when using them in an actual disaster.
304319

305-
By testing your DR plans, you also help to validate that your RTO is feasible based on the processes you need to follow.
320+
Also, by testing your DR plans and required processes, you can validate the feasibility of your RTO.
306321

307322
To learn more, see [Recommendations for designing a reliability testing strategy](/azure/well-architected/reliability/testing-strategy).
308323

0 commit comments

Comments
 (0)