Skip to content

Commit 819317b

Browse files
committed
incorporated tech review edits
1 parent 0cffdb7 commit 819317b

File tree

1 file changed

+4
-159
lines changed

1 file changed

+4
-159
lines changed

articles/lab-services/reliability-in-azure-lab-services.md

Lines changed: 4 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -3,55 +3,15 @@ title: Reliability in Azure Lab Services
33
description: Learn about reliability in Azure Lab Services
44
ms.topic: overview
55
ms.custom: subject-resiliency
6-
ms.date: 07/12/2022
6+
ms.date: 08/11/2022
77
---
88

9-
<!--#Customer intent: As a < type of user >, I want to understand resiliency support for [TODO-service-name] so that I can respond to and/or avoid failures in order to minimize downtime and data loss. -->
10-
11-
<!--
12-
13-
Template for the main resiliency article for Azure services.
14-
Keep the required sections and add/modify any content for any information specific to your service.
15-
This article should be in your TOC, under a Resiliency node. The name of this page should be *resiliency-[TODO-service-name].md* and the TOC title should be "Resiliency in [TODO-service-name]".
16-
Keep the headings in this order.
17-
18-
This template uses comment pseudo code to indicate where you must choose between two options or more.
19-
20-
Conditions are used in this document in the following manner and can be easily searched for:
21-
-->
22-
23-
<!-- IF (AZ SUPPORTED) -->
24-
<!-- some text -->
25-
<!-- END IF (AZ SUPPORTED)-->
26-
27-
<!-- BEGIN IF (SLA INCREASE) -->
28-
<!-- some text -->
29-
<!-- END IF (SLA INCREASE) -->
30-
31-
<!-- IF (SERVICE IS ZONAL) -->
32-
<!-- some text -->
33-
<!-- END IF (SERVICE IS ZONAL) -->
34-
35-
<!-- IF (SERVICE IS ZONE REDUNDANT) -->
36-
<!-- some text -->
37-
<!-- END IF (SERVICE IS ZONAL) -->
38-
39-
<!--
40-
41-
IMPORTANT:
42-
- Do a search and replace of TODO-service-name with the name of your service. That will make the template easier to read
43-
- ALL sections are required unless noted otherwise
44-
- MAKE SURE YOU REMOVE ALL COMMENTS BEFORE PUBLISH!!!!!!!!
45-
46-
-->
47-
489
# What is reliability in Azure Lab Services?
4910

5011
Reliability is a system’s ability to recover from failures and continue to function. It’s not only about avoiding failures but also involves responding to failures in a way that minimizes downtime or data loss. Because failures can occur at various levels, it’s important to have protection for all types based on service reliability requirements. Reliability in Azure supports and advances capabilities that respond to outages in real time to ensure continuous service and data protection assurance for mission-critical applications that require near-zero downtime and high customer confidence.
5112

52-
This article describes reliability support in Azure Lab Services, and covers <!-- IF (AZ SUPPORTED) --> both regional resiliency with availability zones and <!-- END IF (AZ SUPPORTED)--> cross-region resiliency with disaster recovery. For a more detailed overview of reliability in Azure, see [Azure resiliency](/azure/availability-zones/overview.md).
13+
This article describes reliability support in Azure Lab Services, and covers regional resiliency with availability zones. For a more detailed overview of reliability in Azure, see [Azure resiliency](/azure/availability-zones/overview.md).
5314

54-
<!-- IF (AZ SUPPORTED) -->
5515
## Availability zone support
5616

5717
Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones. Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see [Regions and availability zones](/azure/availability-zones/az-overview.md).
@@ -62,46 +22,30 @@ Azure Lab Services provide zone redundancy for service infrastructure for specif
6222

6323
Currently, the service is not zonal. That is, you can’t configure a lab or the VMs in the lab to align to a specific zone. A lab and VMs may be distributed across zones in a region.
6424

65-
### Prerequisites
66-
67-
N/A - single product SKU - remove?
68-
69-
<!-- List any specific SKUs that are supported. If all are supported or if the service has only one default SKU, mention this. -->
70-
71-
<!-- List regions that support availability zones, or regions that don't support availability zones (whichever is less) -->
72-
73-
<!-- Indicate any workflows or scenarios that aren't supported or ones that are, whichever is less. Provide links to any relevant information. -->
74-
7525
### SLA improvements
7626

7727
There are no increased SLAs for Azure Lab Services. For more information on the Azure Lab Services SLAs, see [SLA for Azure Lab Services](https://azure.microsoft.com/support/legal/sla/lab-services/v1_0/).
7828

79-
#### Create a resource with availability zone enabled
80-
81-
N/A - remove
82-
83-
<!-- IF (SERVICE IS ZONE REDUNDANT) -->
8429
### Zone down experience
8530

8631
#### Azure Lab Services infrastructure
8732

88-
Azure Lab Services is zone-redundant. The Azure Lab Services infrastructure uses Cosmos DB storage, which has redundancy enabled for the following regions:
33+
Azure Lab Services infrastructure is zone-redundant. The Azure Lab Services infrastructure uses Cosmos DB storage, which has redundancy enabled for the following regions:
8934

9035
- Australia East
9136
- Canada Central
9237
- France Central
9338
- Korea Central
9439
- East Asia
9540

96-
For these regions, resources are distributed across zones automatically. The storage region is the same as the region where the lab plan is located.
41+
Resources apart from the Lab resources and virtual machines are zone redundant in these regions. The storage region is the same as the region where the lab plan is located.
9742

9843
In the event of a zone outage in these regions, you can still perform the following tasks:
9944

10045
- Access the Azure Lab Services website
10146
- Create/manage lab plans
10247
- Create Users
10348
- Configure lab schedules
104-
- Configure lab policies
10549
- Create new labs and VMs in regions unaffected by the zone outage.
10650

10751
Data loss may occur only with an unrecoverable disaster in the Cosmos DB region. For more information, see [Region Outages](/azure/cosmos-db/high-availability#region-outages).
@@ -126,21 +70,6 @@ As a result, the following operations are not guaranteed in the event of a zone
12670
If there's a zone outage in the region, there's no expectation that you can use any labs or VMs in the region.
12771
Labs and VMs in other regions will be unaffected by the outage.
12872

129-
130-
<!-- Select the scenario that best describes customer experience or combine/provide your own description:
131-
- During a zone-wide outage, no action is required during zone recovery, Offering will self-heal and re-balance itself to take advantage of the healthy zone automatically.
132-
133-
- During a zone-wide outage, the customer should expect brief degradation of performance, until the service self-healing re-balances underlying capacity to adjust to healthy zones. This is not dependent on zone restoration; it is expected that the Microsoft-managed service self-healing state will compensate for a lost zone, leveraging capacity from other zones.
134-
135-
- In a zone-wide outage scenario, users should experience no impact on provisioned resources in a zone-redundant deployment. During a zone-wide outage , customers should be prepared to experience brief interruption for communication to provisioned resources; typically, this is manifested by client receiving 409 error code; this prompts re-try logic with appropriate intervals. New requests will be directed to healthy nodes with zero impact on user. During zone-wide outages, users will be able to create new offering resources and successfully scale existing ones.
136-
137-
The table may contain:
138-
139-
- CRUD and Scale-out operations (Create Read Update Delete)
140-
- Application communication scenarios – data plane operations (for example, insert/update/delete for a database).
141-
142-
-->
143-
14473
#### Zone outage preparation and recovery
14574

14675
Lab and VM services will be restored as soon as the zone availability is restored (the outage is resolved).
@@ -151,90 +80,6 @@ If infrastructure is impacted, it will be restored when the zone availability is
15180

15281
If you want to preserve access to Azure Lab Services infrastructure during a zone outage, create the lab plan in one of the zone-redundant regions listed above.
15382

154-
<!-- To prepare for availability zone failure, customers should over-provision capacity of service to ensure that the solution can tolerate ⅓ loss of capacity and continue to function without degraded performance during zone-wide outages. Provide any information as to how customers should achieve this. -->
155-
156-
### Safe deployment techniques
157-
158-
N/A - application safe deployment is not relevant - remove?
159-
160-
### Availability zone redeployment and migration
161-
162-
N/A - no migration - remove?
163-
164-
<!-- END IF (AZ SUPPORTED)-->
165-
166-
## Disaster recovery: cross-region failover
167-
168-
N/A - remove? And remove all sub-sections.
169-
170-
### Cross-region disaster recovery in multi-region geography
171-
172-
Microsoft is 100% responsible for outage detection, notifications, and support for outage scenarios.
173-
174-
<!-- If (MICROSOFT 100% RESPONSIBLE) -->
175-
176-
#### Outage detection, notification, and management
177-
<!--
178-
179-
- Explain how Microsoft detects and handles outages for this offering.
180-
181-
- Explain when the offering will be failed to another region (decision guidance).
182-
183-
- If service is deployed geographically, explain how this works.
184-
185-
- Specify whether the offering runs Active-Active with auto-failover or Active-Passive.
186-
187-
- Explain how customer is notified or how customer can check service health.
188-
-->
189-
190-
<!--
191-
- Explain how customer storage is handled, how much data loss occurs and whether R/W or only R/O for 00:__ (duration).
192-
193-
- If this single offering fails over, indicate whether it continues to support primary region or only secondary region.
194-
195-
- Provide all other guidance of what the customer can expect in region loss scenario.
196-
197-
- Describe service SLA and high availability.
198-
199-
- Define RTO and RPO expectations.
200-
201-
202-
<!-- END IF (MICROSOFT 100% RESPONSIBLE) -->
203-
204-
### Single-region geography disaster recovery
205-
206-
<!--
207-
Explain how offering supports single-region geography and how it differs from other multi-regions geography (for example, if offering is in a multi-region geography, DR is Microsoft-responsible; if in a single-region geography, DR is customer-responsible.)
208-
209-
If DR is the identical for single-region and multi-region geographies, state this explicitly. (for example, CEDR for both 3+1 and 3+0.)
210-
211-
If single-region DR is customer-responsible, can both data plane and control plane be configured by customers or only data plane?
212-
213-
Clarify customer implication when recovery classification is CEDR: Is customer losing data/features/functions when recovery classification is CEDR in region-down scenario?
214-
215-
Specify SLA and availability consideration in this configuration?
216-
217-
Specify RTO and RPO expectations in 3+0 scenario.
218-
219-
Provide instructions on setup for cross-region/outside geography DR.
220-
221-
Provide instructions to set up and execute DR using Portal, Azure CLI, PowerShell. Does documentation provide options to configure DR via portal, CLI, PowerShell?
222-
223-
Provide instructions to test DR plan and failover to simulate disaster.
224-
225-
Provide detailed instructions for customer to clean up DR setup to free up resources.
226-
-->
227-
228-
### Capacity and proactive disaster recovery resiliency
229-
230-
<!-- Microsoft and its customers operate under the Shared responsibility model. This means that for customer-enabled DR (customer-responsible services), the customer must address DR for any service they deploy and control. To ensure that recovery is proactive, customers should always pre-deploy secondaries because there is no guarantee of capacity at time of impact for those who have not pre-allocated.
231-
232-
In this section, provide details of customer knowledge required re: capacity planning and proactive deployments.-->
233-
234-
### Additional guidance
235-
236-
<!-- Provide any additional guidance here -->
237-
23883
## Next steps
23984

24085
> [!div class="nextstepaction"]

0 commit comments

Comments
 (0)