Skip to content

Commit 71ee6b6

Browse files
committed
Updates
1 parent 5f9c570 commit 71ee6b6

File tree

1 file changed

+15
-17
lines changed

1 file changed

+15
-17
lines changed

articles/reliability/concept-business-continuity-high-availability-disaster-recovery.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Understand business continuity, high availability, and disaster rec
44
author: anaharris-ms
55
ms.service: azure
66
ms.topic: conceptual
7-
ms.date: 01/15/2025
7+
ms.date: 01/16/2025
88
ms.author: anaharris
99
ms.custom: subject-reliability
1010
ms.subservice: azure-reliability
@@ -160,7 +160,7 @@ To achieve HA requirements, a workload can include a number of design elements.
160160

161161
Many Azure services are designed to be highly available, and can be used to build highly available workloads. Here are some examples:
162162

163-
- [Azure Virtual Machine Scale Sets ](/azure/reliability/reliability-virtual-machine-scale-sets)provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
163+
- [Azure Virtual Machine Scale Sets](/azure/reliability/reliability-virtual-machine-scale-sets) provide high availability for virtual machines (VMs) by automatically creating and managing VM instances, and distributing those VM instances to reduce the impact of infrastructure failures.
164164
- [Azure App Service](/azure/reliability/reliability-app-service) provides high availability through a variety of approaches, including automatically moving workers from an unhealthy node to a healthy node, and by providing capabilities for self-healing from many common fault types.
165165

166166
Use each [service reliability guide](./overview-reliability-guidance.md) to understand the capabilities of the service, decide which tiers to use, and determine which capabilities to include in your high availability strategy.
@@ -197,51 +197,50 @@ Scalability and elasticity are the abilities of a system to handle increased loa
197197

198198
Many Azure services support scalability. Here are some examples:
199199

200-
- [Azure Virtual Machine Scale Sets,](/azure/virtual-machine-scale-sets/overview) [Azure API Management](/azure/api-management/api-management-key-concepts), and several other services support Azure Monitor autoscale, which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
200+
- [Azure Virtual Machine Scale Sets](/azure/virtual-machine-scale-sets/overview), [Azure API Management](/azure/api-management/api-management-key-concepts), and several other services support [Azure Monitor autoscale](/azure/azure-monitor/autoscale/autoscale-overview), which enables you to specify policies like "when my CPU consistently goes above 80%, add another instance".
201201
- [Azure Functions](/azure/azure-functions/functions-overview) can dynamically provision instances to serve your requests.
202-
-[ Azure Cosmos DB ](/azure/cosmos-db/introduction)supports autoscale throughput, where the service can automatically manage the resources assigned to your databases based on policies you specify.
202+
- [Azure Cosmos DB](/azure/cosmos-db/introduction) supports [autoscale throughput](/azure/cosmos-db/how-to-choose-offer), where the service can automatically manage the resources assigned to your databases based on policies you specify.
203203

204204
Scalability is a key factor to consider during partial or complete malfunction. If a replica or compute instance is unavailable, the remaining components might need to bear more load to handle the load that was previously being handled by the faulted node. Consider *overprovisioning* if your system can't scale quickly enough to handle your expected changes in load.
205205

206206
For more information on how to design a scalable and elastic system, see [Recommendations for designing a reliable scaling strategy](/azure/well-architected/reliability/scaling).
207207

208208
#### Zero-downtime deployment techniques
209209

210-
Deployments and other system changes usually introduce a significant risk of downtime. Because downtime risk is a challenge to high availability requirements, it's important to use zero-downtime deployment practices to make updates and configuration changes without any required downtime.
210+
Deployments and other system changes introduce a significant risk of downtime. Because downtime risk is a challenge to high availability requirements, it's important to use zero-downtime deployment practices to make updates and configuration changes without any required downtime.
211211

212212
Zero-downtime deployment techniques can include:
213213

214214
- Updating a subset of your resources at a time.
215215
- Controlling the amount of traffic that reaches the new deployment.
216216
- Monitoring for any impact to your users.
217-
- Rapidly remediating the issue.
218-
- Rolling back to a previous known-good deployment.
217+
- Rapidly remediating the issue, such as by rolling back to a previous known-good deployment.
219218

220219
To learn more about zero-downtime deployment techniques, see [Safe deployment practices](/devops/operate/safe-deployment-practices).
221220

222221
Azure itself uses zero-downtime deployment approaches for our own services. When you build your own applications, you can adopt zero-downtime deployments through a variety of approaches, such as:
223222

224-
- [Azure Container Apps](/azure/container-apps/overview) provides multiple revisions of your application, which can be used to achieve zero-downtime deployments.
223+
- [Azure Container Apps](/azure/container-apps/overview) provides [multiple revisions of your application](/azure/container-apps/revisions), which can be used to achieve zero-downtime deployments.
225224
- [Azure Kubernetes Service](/azure/aks/what-is-aks) (AKS) supports a variety of zero-downtime deployment techniques.
226225

227226
While zero-downtime deployments are often associated with application deployments, they should also be used for configuration changes. Here are some ways you can apply configuration changes safely:
228227

229-
- [Azure Storage ](/azure/storage/common/storage-introduction)enables you to change your storage account access keys in multiple stages, which prevents downtime during key rotation operations.
230-
- [Azure App Configuration ](/azure/azure-app-configuration/overview)provides feature flags, snapshots, and other capabilities to help you to control how configuration changes are applied.
228+
- [Azure Storage](/azure/storage/common/storage-introduction) enables you to change your [storage account access keys](/azure/storage/common/storage-account-keys-manage) in multiple stages, which prevents downtime during key rotation operations.
229+
- [Azure App Configuration](/azure/azure-app-configuration/overview) provides [feature flags](/azure/azure-app-configuration/concept-feature-management), [snapshots](/azure/azure-app-configuration/concept-snapshots), and other capabilities to help you to control how configuration changes are applied.
231230

232231
If you decide not to implement zero-downtime deployments, make sure that you define *maintenance windows* so that you can make system changes at a time when your users expect it.
233232

234233
#### Automated testing
235234

236-
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. Testing your solution's ability to automatically tolerate or recover from a variety of fault types is called *chaos engineering*. Chaos engineering is critical for mature organizations with stringent standards for HA. [Azure Chaos Studio ](/azure/chaos-studio/chaos-studio-overview)is a chaos engineering tool that can simulate some common fault types.
235+
It's important to test your solution's ability to withstand the outages and failures that you consider to be in scope for HA. Many of these failures can be simulated in test environments. Testing your solution's ability to automatically tolerate or recover from a variety of fault types is called *chaos engineering*. Chaos engineering is critical for mature organizations with stringent standards for HA. [Azure Chaos Studio](/azure/chaos-studio/chaos-studio-overview) is a chaos engineering tool that can simulate some common fault types.
237236

238237
To learn more, see [Recommendations for designing a reliability testing strategy](/azure/well-architected/reliability/testing-strategy).
239238

240239
#### Monitoring and alerting
241240

242241
Monitoring lets you know the health of your system, even when automated mitigations take place. Monitoring is critical for understanding how your solution is behaving, and to watch for early signals of failures like increased error rates or high resource consumption. With alerts, you can proactively receive important changes in your environment.
243242

244-
Use [Azure Service Health](/azure/service-health/overview), [Azure Resource Health](/azure/service-health/resource-health-overview), and [Azure Monitor](/azure/azure-monitor/overview), as well as [Scheduled Events ](/azure/virtual-machines/windows/scheduled-event-service)for virtual machines.
243+
Use [Azure Service Health](/azure/service-health/overview), [Azure Resource Health](/azure/service-health/resource-health-overview), and [Azure Monitor](/azure/azure-monitor/overview), as well as [Scheduled Events](/azure/virtual-machines/windows/scheduled-event-service) for virtual machines.
245244

246245
For more information, see [Recommendations for designing a reliable monitoring and alerting strategy](/azure/well-architected/reliability/monitoring-alerting-strategy).
247246

@@ -298,20 +297,19 @@ Backups involve taking a copy of your data and storing it safely for a defined p
298297

299298
When using backups as part of a disaster recovery plan it's important to take the following into consideration:
300299

301-
- *Data loss*. Because backups are typically taken infrequently, backup restoration usually involves data loss. For this reason, backup recovery should be used as a last resort and a disaster recovery plan should specify the sequence of steps and recovery attempts that must take place *before* restoring from a backup.
300+
- *Data loss*. Because backups are typically taken infrequently, backup restoration usually involves data loss. For this reason, backup recovery should be used as a last resort and a disaster recovery plan should specify the sequence of steps and recovery attempts that must take place *before* restoring from a backup. It's important to make sure that the workload RPO is aligned with the backup interval.
302301

303-
- *RPO alignment*. It's important to make sure that the workload RPO is aligned with the backup interval. Also, because backup restoration often takes time, it's critical to test your backups and restoration processes to verify their integrity and understand how long the restoration process takes.
302+
- *Recovery time*. Because backup restoration often takes time, it's critical to test your backups and restoration processes to verify their integrity and understand how long the restoration process takes. Ensure the workload's RTO accounts for the time it takes to restore your backup.
304303

305304
Many Azure data and storage services support backups, such as the following:
306305

307306
- [Azure Backup](/azure/reliability/reliability-backup) provides automated backups for virtual machine disks, storage accounts, AKS, and a variety of other sources.
308-
- Many Azure database services, including [Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) and [Azure Cosmos DB](/azure/reliability/reliability-cosmos-db-nosql) , have an automated backup capability for your databases.
307+
- Many Azure database services, including [Azure SQL Database](/azure/azure-sql/database/high-availability-sla-local-zone-redundancy) and [Azure Cosmos DB](/azure/reliability/reliability-cosmos-db-nosql), have an automated backup capability for your databases.
309308
- [Azure Key Vault](/azure/key-vault/general/disaster-recovery-guidance) provides features to back up your secrets, certificates, and keys.
310309

311-
312310
#### Automated deployments
313311

314-
To rapidly deploy and configure required resources in the event of a disaster, use Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration file. Using IaC reduces your RTO and potential for error, compared to manually deploying and configuring resources.
312+
To rapidly deploy and configure required resources in the event of a disaster, use Infrastructure as code (IaC) assets, such as Bicep files, ARM templates, or Terraform configuration file. Using IaC reduces your recovery time and potential for error, compared to manually deploying and configuring resources.
315313

316314
#### Testing and drills
317315

0 commit comments

Comments
 (0)