You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/lab-services/reliability-in-azure-lab-services.md
+4-159Lines changed: 4 additions & 159 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,55 +3,15 @@ title: Reliability in Azure Lab Services
3
3
description: Learn about reliability in Azure Lab Services
4
4
ms.topic: overview
5
5
ms.custom: subject-resiliency
6
-
ms.date: 07/12/2022
6
+
ms.date: 08/11/2022
7
7
---
8
8
9
-
<!--#Customer intent: As a < type of user >, I want to understand resiliency support for [TODO-service-name] so that I can respond to and/or avoid failures in order to minimize downtime and data loss. -->
10
-
11
-
<!--
12
-
13
-
Template for the main resiliency article for Azure services.
14
-
Keep the required sections and add/modify any content for any information specific to your service.
15
-
This article should be in your TOC, under a Resiliency node. The name of this page should be *resiliency-[TODO-service-name].md* and the TOC title should be "Resiliency in [TODO-service-name]".
16
-
Keep the headings in this order.
17
-
18
-
This template uses comment pseudo code to indicate where you must choose between two options or more.
19
-
20
-
Conditions are used in this document in the following manner and can be easily searched for:
21
-
-->
22
-
23
-
<!-- IF (AZ SUPPORTED) -->
24
-
<!-- some text -->
25
-
<!-- END IF (AZ SUPPORTED)-->
26
-
27
-
<!-- BEGIN IF (SLA INCREASE) -->
28
-
<!-- some text -->
29
-
<!-- END IF (SLA INCREASE) -->
30
-
31
-
<!-- IF (SERVICE IS ZONAL) -->
32
-
<!-- some text -->
33
-
<!-- END IF (SERVICE IS ZONAL) -->
34
-
35
-
<!-- IF (SERVICE IS ZONE REDUNDANT) -->
36
-
<!-- some text -->
37
-
<!-- END IF (SERVICE IS ZONAL) -->
38
-
39
-
<!--
40
-
41
-
IMPORTANT:
42
-
- Do a search and replace of TODO-service-name with the name of your service. That will make the template easier to read
43
-
- ALL sections are required unless noted otherwise
44
-
- MAKE SURE YOU REMOVE ALL COMMENTS BEFORE PUBLISH!!!!!!!!
45
-
46
-
-->
47
-
48
9
# What is reliability in Azure Lab Services?
49
10
50
11
Reliability is a system’s ability to recover from failures and continue to function. It’s not only about avoiding failures but also involves responding to failures in a way that minimizes downtime or data loss. Because failures can occur at various levels, it’s important to have protection for all types based on service reliability requirements. Reliability in Azure supports and advances capabilities that respond to outages in real time to ensure continuous service and data protection assurance for mission-critical applications that require near-zero downtime and high customer confidence.
51
12
52
-
This article describes reliability support in Azure Lab Services, and covers <!-- IF (AZ SUPPORTED) --> both regional resiliency with availability zones and <!-- END IF (AZ SUPPORTED)--> cross-region resiliency with disaster recovery. For a more detailed overview of reliability in Azure, see [Azure resiliency](/azure/availability-zones/overview.md).
13
+
This article describes reliability support in Azure Lab Services, and covers regional resiliency with availability zones. For a more detailed overview of reliability in Azure, see [Azure resiliency](/azure/availability-zones/overview.md).
53
14
54
-
<!-- IF (AZ SUPPORTED) -->
55
15
## Availability zone support
56
16
57
17
Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones. Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see [Regions and availability zones](/azure/availability-zones/az-overview.md).
@@ -62,46 +22,30 @@ Azure Lab Services provide zone redundancy for service infrastructure for specif
62
22
63
23
Currently, the service is not zonal. That is, you can’t configure a lab or the VMs in the lab to align to a specific zone. A lab and VMs may be distributed across zones in a region.
64
24
65
-
### Prerequisites
66
-
67
-
N/A - single product SKU - remove?
68
-
69
-
<!-- List any specific SKUs that are supported. If all are supported or if the service has only one default SKU, mention this. -->
70
-
71
-
<!-- List regions that support availability zones, or regions that don't support availability zones (whichever is less) -->
72
-
73
-
<!-- Indicate any workflows or scenarios that aren't supported or ones that are, whichever is less. Provide links to any relevant information. -->
74
-
75
25
### SLA improvements
76
26
77
27
There are no increased SLAs for Azure Lab Services. For more information on the Azure Lab Services SLAs, see [SLA for Azure Lab Services](https://azure.microsoft.com/support/legal/sla/lab-services/v1_0/).
78
28
79
-
#### Create a resource with availability zone enabled
80
-
81
-
N/A - remove
82
-
83
-
<!-- IF (SERVICE IS ZONE REDUNDANT) -->
84
29
### Zone down experience
85
30
86
31
#### Azure Lab Services infrastructure
87
32
88
-
Azure Lab Services is zone-redundant. The Azure Lab Services infrastructure uses Cosmos DB storage, which has redundancy enabled for the following regions:
33
+
Azure Lab Services infrastructure is zone-redundant. The Azure Lab Services infrastructure uses Cosmos DB storage, which has redundancy enabled for the following regions:
89
34
90
35
- Australia East
91
36
- Canada Central
92
37
- France Central
93
38
- Korea Central
94
39
- East Asia
95
40
96
-
For these regions, resources are distributed across zones automatically. The storage region is the same as the region where the lab plan is located.
41
+
Resources apart from the Lab resources and virtual machines are zone redundant in these regions. The storage region is the same as the region where the lab plan is located.
97
42
98
43
In the event of a zone outage in these regions, you can still perform the following tasks:
99
44
100
45
- Access the Azure Lab Services website
101
46
- Create/manage lab plans
102
47
- Create Users
103
48
- Configure lab schedules
104
-
- Configure lab policies
105
49
- Create new labs and VMs in regions unaffected by the zone outage.
106
50
107
51
Data loss may occur only with an unrecoverable disaster in the Cosmos DB region. For more information, see [Region Outages](/azure/cosmos-db/high-availability#region-outages).
@@ -126,21 +70,6 @@ As a result, the following operations are not guaranteed in the event of a zone
126
70
If there's a zone outage in the region, there's no expectation that you can use any labs or VMs in the region.
127
71
Labs and VMs in other regions will be unaffected by the outage.
128
72
129
-
130
-
<!-- Select the scenario that best describes customer experience or combine/provide your own description:
131
-
- During a zone-wide outage, no action is required during zone recovery, Offering will self-heal and re-balance itself to take advantage of the healthy zone automatically.
132
-
133
-
- During a zone-wide outage, the customer should expect brief degradation of performance, until the service self-healing re-balances underlying capacity to adjust to healthy zones. This is not dependent on zone restoration; it is expected that the Microsoft-managed service self-healing state will compensate for a lost zone, leveraging capacity from other zones.
134
-
135
-
- In a zone-wide outage scenario, users should experience no impact on provisioned resources in a zone-redundant deployment. During a zone-wide outage , customers should be prepared to experience brief interruption for communication to provisioned resources; typically, this is manifested by client receiving 409 error code; this prompts re-try logic with appropriate intervals. New requests will be directed to healthy nodes with zero impact on user. During zone-wide outages, users will be able to create new offering resources and successfully scale existing ones.
136
-
137
-
The table may contain:
138
-
139
-
- CRUD and Scale-out operations (Create Read Update Delete)
140
-
- Application communication scenarios – data plane operations (for example, insert/update/delete for a database).
141
-
142
-
-->
143
-
144
73
#### Zone outage preparation and recovery
145
74
146
75
Lab and VM services will be restored as soon as the zone availability is restored (the outage is resolved).
@@ -151,90 +80,6 @@ If infrastructure is impacted, it will be restored when the zone availability is
151
80
152
81
If you want to preserve access to Azure Lab Services infrastructure during a zone outage, create the lab plan in one of the zone-redundant regions listed above.
153
82
154
-
<!-- To prepare for availability zone failure, customers should over-provision capacity of service to ensure that the solution can tolerate ⅓ loss of capacity and continue to function without degraded performance during zone-wide outages. Provide any information as to how customers should achieve this. -->
155
-
156
-
### Safe deployment techniques
157
-
158
-
N/A - application safe deployment is not relevant - remove?
159
-
160
-
### Availability zone redeployment and migration
161
-
162
-
N/A - no migration - remove?
163
-
164
-
<!-- END IF (AZ SUPPORTED)-->
165
-
166
-
## Disaster recovery: cross-region failover
167
-
168
-
N/A - remove? And remove all sub-sections.
169
-
170
-
### Cross-region disaster recovery in multi-region geography
171
-
172
-
Microsoft is 100% responsible for outage detection, notifications, and support for outage scenarios.
173
-
174
-
<!-- If (MICROSOFT 100% RESPONSIBLE) -->
175
-
176
-
#### Outage detection, notification, and management
177
-
<!--
178
-
179
-
- Explain how Microsoft detects and handles outages for this offering.
180
-
181
-
- Explain when the offering will be failed to another region (decision guidance).
182
-
183
-
- If service is deployed geographically, explain how this works.
184
-
185
-
- Specify whether the offering runs Active-Active with auto-failover or Active-Passive.
186
-
187
-
- Explain how customer is notified or how customer can check service health.
188
-
-->
189
-
190
-
<!--
191
-
- Explain how customer storage is handled, how much data loss occurs and whether R/W or only R/O for 00:__ (duration).
192
-
193
-
- If this single offering fails over, indicate whether it continues to support primary region or only secondary region.
194
-
195
-
- Provide all other guidance of what the customer can expect in region loss scenario.
196
-
197
-
- Describe service SLA and high availability.
198
-
199
-
- Define RTO and RPO expectations.
200
-
201
-
202
-
<!-- END IF (MICROSOFT 100% RESPONSIBLE) -->
203
-
204
-
### Single-region geography disaster recovery
205
-
206
-
<!--
207
-
Explain how offering supports single-region geography and how it differs from other multi-regions geography (for example, if offering is in a multi-region geography, DR is Microsoft-responsible; if in a single-region geography, DR is customer-responsible.)
208
-
209
-
If DR is the identical for single-region and multi-region geographies, state this explicitly. (for example, CEDR for both 3+1 and 3+0.)
210
-
211
-
If single-region DR is customer-responsible, can both data plane and control plane be configured by customers or only data plane?
212
-
213
-
Clarify customer implication when recovery classification is CEDR: Is customer losing data/features/functions when recovery classification is CEDR in region-down scenario?
214
-
215
-
Specify SLA and availability consideration in this configuration?
216
-
217
-
Specify RTO and RPO expectations in 3+0 scenario.
218
-
219
-
Provide instructions on setup for cross-region/outside geography DR.
220
-
221
-
Provide instructions to set up and execute DR using Portal, Azure CLI, PowerShell. Does documentation provide options to configure DR via portal, CLI, PowerShell?
222
-
223
-
Provide instructions to test DR plan and failover to simulate disaster.
224
-
225
-
Provide detailed instructions for customer to clean up DR setup to free up resources.
226
-
-->
227
-
228
-
### Capacity and proactive disaster recovery resiliency
229
-
230
-
<!-- Microsoft and its customers operate under the Shared responsibility model. This means that for customer-enabled DR (customer-responsible services), the customer must address DR for any service they deploy and control. To ensure that recovery is proactive, customers should always pre-deploy secondaries because there is no guarantee of capacity at time of impact for those who have not pre-allocated.
231
-
232
-
In this section, provide details of customer knowledge required re: capacity planning and proactive deployments.-->
0 commit comments