Merge pull request #291930 from anaharris-ms/rh-shared-responsibility

JamesJBarnett · web-flow · commit 3f17e68b33b3 · 2024-12-14T15:30:31.000-07:00
Reliability Hub - Add shared responsibility article
diff --git a/articles/reliability/TOC.yml b/articles/reliability/TOC.yml
@@ -2,6 +2,12 @@
   href: index.yml
 - name: What is reliability?
   href: overview.md
+- name: Reliability fundamentals
+  items:
+  - name: Shared responsibility for resiliency
+    href: concept-shared-responsibility.md
+  - name: Azure service incident response
+    href: incident-response.md
 - name: Availability zone support
   items:
   - name: What are Azure availability zones?
@@ -444,8 +450,6 @@
     href: /azure/well-architected/resiliency/chaos-engineering
   - name: Reliability in Microsoft Azure Well-Architected Framework
     href: /azure/well-architected/reliability
-  - name: Azure service incident response
-    href: ./incident-response.md
   - name:  Azure Service Manager retirement
     items:
     - name: Overview
diff --git a/articles/reliability/availability-zones-overview.md b/articles/reliability/availability-zones-overview.md
@@ -74,7 +74,7 @@ Many regions also have a [*paired region*](./cross-region-replication-azure.md#a
 
 ## Shared responsibility model
 
-The [shared responsibility model](/azure/security/fundamentals/shared-responsibility) describes how responsibilities are divided between the cloud provider (Microsoft) and you. Depending on the type of services you use, you might take on more or less responsibility for operating the service.
+The [shared responsibility model](./concept-shared-responsibility.md) describes how responsibilities are divided between the cloud provider (Microsoft) and you. Depending on the type of services you use, you might take on more or less responsibility for operating the service.
 
 Microsoft provides availability zones and regions to give you flexibility in how you design your solution to meet your requirements. When you use managed services, Microsoft takes on more of the management responsibilities for your resources, which might even include data replication, failover, failback, and other tasks related to operating a distributed system.
 
diff --git a/articles/reliability/business-continuity-management-program.md b/articles/reliability/business-continuity-management-program.md
@@ -39,7 +39,7 @@ A good example of the shared responsibility model is the deployment of virtual m
 
 Customer-enabled disaster recovery services all have public-facing documentation to guide you. For an example of public-facing documentation for customer-enabled disaster recovery, see [Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-disaster-recovery.md).
 
-For more information on the shared responsibility model, see [Microsoft Trust Center](../security/fundamentals/shared-responsibility.md).
+For more information, see [Shared responsibility for resiliency](./concept-shared-responsibility.md).
 
 ## Business continuity compliance: Service-level responsibility
 
diff --git a/articles/reliability/concept-shared-responsibility.md b/articles/reliability/concept-shared-responsibility.md
@@ -0,0 +1,78 @@
+---
+title: Shared responsibility for resiliency
+description: Learn about the shared responsibility model for resiliency in the Azure cloud platform.
+ms.service: azure
+ms.subservice: azure-availability-zones
+ms.topic: conceptual
+ms.date: 12/14/2024
+ms.author: anaharris
+author: anaharris-ms
+ms.custom: subject-reliability
+---
+
+# Shared responsibility for resiliency
+
+In the Azure public cloud platform, resiliency is a shared responsibility between Microsoft and you. Because there are different levels of resiliency in each workload that you design and deploy, it's important that you understand who has primary responsibility for each one of those levels from a resiliency perspective.
+
+To help you better understand how shared responsibility works, especially when confronting an outage or disaster, this article describes the shared responsibility *model* for resiliency. For more information on how to actually use this model to plan for disaster recovery, see [Recommendations for designing a disaster recovery strategy](/azure/well-architected/reliability/disaster-recovery).
+
+## Shared responsibility model for resiliency
+
+The shared responsibility model for resiliency is comprised of three levels:
+
+- [Core platform reliability](#core-platform-reliability). The Azure platform provides a base level of reliability for all customers and all services through the underlying infrastructure, services, and processes.
+- [Resilience-enhancing capabilities](#resilience-enhancing-capabilities) Azure offers a suite of built-in features and services that enhance resiliency, such as using availability zones, deploying across multiple regions, and implementing backup strategies. While Azure provides these capabilities, it's your responsibility to evaluate and configure them to align with your specific requirements. Requirements can include reliability, cost, performance, and compliance with regulatory standards.
+- [Applications](#applications). To make effective use of the other levels, your application and workload must be designed for resiliency.
+
+:::image type="content" source="media/shared-responsibility/shared-responsibility-model.jpg" alt-text="Diagram showing shared responsibility model for resiliency: Core platform reliability, resilience-enhancing capabilities, and applications." border="false":::
+
+Microsoft is solely responsible for core platform reliability. Microsoft is also responsible for providing resilience-enhancing capabilities that you can use. You're responsible for selecting and using the appropriate components.
+
+Whether you choose SaaS, PaaS, or IaaS service categories determines what kind of decisions you make. For example, if you use a SaaS service, you typically don't need to opt into using availability zones. If you use PaaS services for your data tier, you might have automated capabilities for backup available to you. If you use IaaS services, you typically need to plan and implement many resiliency capabilities yourself.
+
+> [!NOTE]
+> Service categories (SaaS, PaaS, and IaaS) are useful as a broad grouping of services, but it's important to understand your responsibilities for each individual service you use.
+>
+> The [reliability guides](./overview-reliability-guidance.md) provide an overview of how each service works from a resiliency perspective, and help you to make informed decisions about how to configure your services to meet your needs.
+
+You're also responsible for your application and workload design, and for defining your reliability requirements, which helps you to decide how to design and configure your solution.
+
+### Core platform reliability
+
+The Microsoft cloud platform consists of a large amount of infrastructure, hardware, software, and processes to support service deployment and management. Each component is designed to be highly resilient, with multiple redundancies for hardware and with research-based software processes. Together, these components comprise the core platform reliability level. Some examples of how Microsoft provides a reliable platform include the following:
+
+- Networks have redundant links and can dynamically bypass faulty segments.
+- Within each region, datacenters are connected through a low-latency network, which enables a variety of data replication approaches.
+- Datacenter facilities have redundant power, cooling, and network connections. They're operated by onsite teams who secure, monitor, and manage them.
+- Hardware, including clusters and racks, have redundancy at multiple layers.
+- Updates to compute clusters, racks, and hosts follow a controlled process. We use techniques like hotpatching to reduce or eliminate impact to hosts.
+- Software platform updates and configuration changes are applied by following our safe deployment practices.
+- Microsoft audits critical external suppliers to ensure that a third-party outage doesn't disrupt Azure services.
+- Each Azure service must have a detailed disaster recovery plan. We conduct full-region down drills in regions that match production environments.
+
+All Azure services benefit from these core platform reliability capabilities, and with the ongoing improvements Microsoft makes.
+
+### Resilience-enhancing capabilities
+
+Azure provides many different resilience-enhancing capabilities. Although Microsoft is responsible for providing these capabilities, you are entirely responsible for selecting and using the appropriate ones for your needs. Some examples of these capabilities include:
+
+- **Regions.** Azure has over 60 regions, and you can use multiple regions in a single solution to achieve geo-redundancy, meet your data residency needs, and enable low-latency communication to users globally.
+
+- **Availability zones.** Many Azure regions support availability zones, which enable you to distribute your workloads across multiple independent sets of datacenters. Azure services support availability zones in a way that suits their intended purpose, usually by supporting zonal deployments (pinned to a single zone) and/or zone-redundant deployments (spread across multiple zones). To learn more about availability zones, see [What are availability zones?](./availability-zones-overview.md).
+
+- **Service tiers.** Services provide a range of offerings and tiers that suit different requirements. For example, when you create a virtual machine, you can choose between a standard disk, which provides a low-cost option, or a premium disk to achieve a higher level of availability.
+
+- **Backups.** Many Azure services that store data support backups, which might be automatic, manual, or both. With backups, you can protect your workload against outages as well as data corruption and other data loss events.
+
+- **Governance.** Platform capabilities like Azure Policy, role-based access control, and Microsoft Entra ID identity protection capabilities, can be configured to enforce your organization's requirements consistently. With these approaches you can protect your workloads against security incidents and accidental changes that might cause downtime or other problems with your workload.
+
+> [!IMPORTANT]
+> It's important to understand the *service level agreements* (SLAs) for each Azure service. SLAs provide important information on the expected uptime of the service, and any conditions you need to meet to be eligible for the SLA. For SLAs for each service, see [Service Level Agreements (SLA) for Online Services](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services).
+
+### Applications
+
+It's your responsibility to make sure that your applications are designed to be resilient. Use the [Azure Well-Architected Framework](/azure/well-architected) pillars to drive architectural excellence at the fundamental level of a workload. The [reliability pillar](/azure/well-architected/reliability/) focuses on how you can make your workload and applications resilient to different types of failures, and to enable recovery when failures occur.
+
+## Next steps
+
+The shared responsibility model applies to other parts of your solution beyond resiliency. For more information on the shared responsibility model for security, see [Microsoft Trust Center](../security/fundamentals/shared-responsibility.md).
diff --git a/articles/reliability/cross-region-replication-azure.md b/articles/reliability/cross-region-replication-azure.md
@@ -22,7 +22,7 @@ Some Azure services support cross-region replication to ensure business continui
 
 ## Shared responsibility
 
-Not all Azure services automatically replicate data or automatically fall back from a failed region to cross-replicate to another enabled region. In these scenarios, you are responsible for recovery and replication. These examples are illustrations of the *shared responsibility model*. It's a fundamental pillar in your disaster recovery strategy. For more information about the shared responsibility model and to learn about business continuity and disaster recovery in Azure, see [Business continuity management in Azure](business-continuity-management-program.md).
+Not all Azure services automatically replicate data or automatically fall back from a failed region to cross-replicate to another enabled region. In these scenarios, you are responsible for recovery and replication. These examples are illustrations of the *shared responsibility model*. It's a fundamental pillar in your disaster recovery strategy. For more information, see [Shared responsibility for resiliency](./concept-shared-responsibility.md).
 
 Shared responsibility becomes the crux of your strategic decision-making when it comes to disaster recovery. Azure doesn't require you to use cross-region replication, and you can use services to build resiliency without cross-replicating to another enabled region. But we strongly recommend that you configure your essential services across regions to benefit from [isolation](../security/fundamentals/isolation-choices.md) and improve [availability](availability-zones-overview.md). 
 
diff --git a/articles/reliability/disaster-recovery-overview.md b/articles/reliability/disaster-recovery-overview.md
@@ -4,7 +4,7 @@ description: Disaster recovery overview for Microsoft Azure products and service
 author: anaharris-ms
 ms.service: azure
 ms.topic: conceptual
-ms.date: 08/25/2023
+ms.date: 12/06/2024
 ms.author: anaharris
 ms.custom: subject-reliability, subject-reliability
 ms.subservice: azure-reliability
@@ -30,8 +30,7 @@ Each major process or workload that an application implements should have separa
 
 ## Design for disaster recovery
 
-Disaster recovery isn't an automatic feature, but must be designed, built, and tested. To support a solid DR strategy, you must build an application with DR in mind from the ground up. Azure offers services, features, and guidance to help you support DR when you create apps.
-
+Disaster recovery isn't an automatic feature, but must be designed, built, and tested. To support a solid DR strategy, you must build an application with DR in mind from the ground up. Azure offers services, features, and guidance to help you support DR when you create apps. To understand what you need to do to support DR, you must first understand the shared responsibility model for resiliency. For more information, see [Shared responsibility for resiliency](./concept-shared-responsibility.md).
 
 
 
@@ -70,6 +69,8 @@ Most services that run on Azure platform as a service (PaaS) offerings like [Azu
 
 ## Next steps
 
+- [Shared responsibility for resiliency](./concept-shared-responsibility.md).
+
 - [Disaster recovery guidance by service](./disaster-recovery-guidance-overview.md)
 
 - [Cloud Adaption Framework for Azure - Business continuity and disaster recovery](/azure/cloud-adoption-framework/ready/landing-zone/design-area/management-business-continuity-disaster-recovery)
diff --git a/articles/reliability/includes/reliability-disaster-recovery-description-include.md b/articles/reliability/includes/reliability-disaster-recovery-description-include.md
@@ -1,6 +1,6 @@
 ---
- title: include file
- description: include file
+ title: Description of disaster recovery
+ description: Description of disaster recovery
  author: anaharris-ms
  ms.service: azure
  ms.topic: include
@@ -10,10 +10,10 @@
 ---
 
 
-Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR.  Before you begin to think about creating your disaster recovery plan, see [Recommendations for designing a disaster recovery strategy](/azure/well-architected/reliability/disaster-recovery). 
+Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see [Recommendations for designing a disaster recovery strategy](/azure/well-architected/reliability/disaster-recovery). 
 
 
-When it comes to DR, Microsoft uses the [shared responsibility model](../business-continuity-management-program.md#shared-responsibility-model). In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload.  Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use [service-specific features to support fast recovery](../reliability-guidance-overview.md) to help develop your DR plan.
+When it comes to DR, Microsoft uses the [shared responsibility model](../concept-shared-responsibility.md). In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you're responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use [service-specific features to support fast recovery](../reliability-guidance-overview.md) to help develop your DR plan.
 
 
 
diff --git a/articles/reliability/media/shared-responsibility/shared-responsibility-model.jpg b/articles/reliability/media/shared-responsibility/shared-responsibility-model.jpg