You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-high-availability-machine-learning.md
+19-13Lines changed: 19 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,15 +9,15 @@ ms.topic: how-to
9
9
ms.author: larryfr
10
10
author: blackmist
11
11
ms.reviewer: andyaviles
12
-
ms.date: 03/28/2024
12
+
ms.date: 04/17/2024
13
13
monikerRange: 'azureml-api-2'
14
14
---
15
15
16
16
# Failover for business continuity and disaster recovery
17
17
18
18
To maximize your uptime, plan ahead to maintain business continuity and prepare for disaster recovery with Azure Machine Learning.
19
19
20
-
Microsoft strives to ensure that Azure services are always available. However, unplanned service outages may occur. We recommend having a disaster recovery plan in place for handling regional service outages. In this article, you'll learn how to:
20
+
Microsoft strives to ensure that Azure services are always available. However, unplanned service outages might occur. We recommend having a disaster recovery plan in place for handling regional service outages. In this article, you learn how to:
21
21
22
22
* Plan for a multi-regional deployment of Azure Machine Learning and associated resources.
23
23
* Maximize chances to recover logs, notebooks, docker images, and other metadata.
@@ -71,8 +71,8 @@ The rest of this article describes the actions you need to take to make each of
71
71
72
72
A multi-regional deployment relies on creation of Azure Machine Learning and other resources (infrastructure) in two Azure regions. If a regional outage occurs, you can switch to the other region. When planning on where to deploy your resources, consider:
73
73
74
-
*__Regional availability__: Use regions that are close to your users. To check regional availability for Azure Machine Learning, see [Azure products by region](https://azure.microsoft.com/global-infrastructure/services/).
75
-
*__Azure paired regions__: Paired regions coordinate platform updates and prioritize recovery efforts where needed. For more information, see [Azure paired regions](/azure/reliability/cross-region-replication-azure).
74
+
*__Regional availability__: If possible, use a region in the same geographic area, not necessarily the one that is closest. To check regional availability for Azure Machine Learning, see [Azure products by region](https://azure.microsoft.com/global-infrastructure/services/).
75
+
*__Azure paired regions__: Paired regions coordinate platform updates and prioritize recovery efforts where needed. However, not all regions support paired regions. For more information, see [Azure paired regions](/azure/reliability/cross-region-replication-azure).
76
76
*__Service availability__: Decide whether the resources used by your solution should be hot/hot, hot/warm, or hot/cold.
77
77
78
78
*__Hot/hot__: Both regions are active at the same time, with one region ready to begin use immediately.
@@ -91,7 +91,7 @@ Azure Machine Learning builds on top of other services. Some services can be con
91
91
| Machine Learning registry | You | Create the registry in multiple regions. |
92
92
| Key Vault | Microsoft | Use the same Key Vault instance with the Azure Machine Learning workspace and resources in both regions. Key Vault automatically fails over to a secondary region. For more information, see [Azure Key Vault availability and redundancy](/azure/key-vault/general/disaster-recovery-guidance).|
93
93
| Container Registry | Microsoft | Configure the Container Registry instance to geo-replicate registries to the paired region for Azure Machine Learning. Use the same instance for both workspace instances. For more information, see [Geo-replication in Azure Container Registry](/azure/container-registry/container-registry-geo-replication). |
94
-
| Storage Account | You | Azure Machine Learning does not support __default storage-account__ failover using geo-redundant storage (GRS), geo-zone-redundant storage (GZRS), read-access geo-redundant storage (RA-GRS), or read-access geo-zone-redundant storage (RA-GZRS). Create a separate storage account for the default storage of each workspace. </br>Create separate storage accounts or services for other data storage. For more information, see [Azure Storage redundancy](/azure/storage/common/storage-redundancy). |
94
+
| Storage Account | You | Azure Machine Learning doesn't support __default storage-account__ failover using geo-redundant storage (GRS), geo-zone-redundant storage (GZRS), read-access geo-redundant storage (RA-GRS), or read-access geo-zone-redundant storage (RA-GZRS). Create a separate storage account for the default storage of each workspace. </br>Create separate storage accounts or services for other data storage. For more information, see [Azure Storage redundancy](/azure/storage/common/storage-redundancy). |
95
95
| Application Insights | You | Create Application Insights for the workspace in both regions. To adjust the data-retention period and details, see [Data collection, retention, and storage in Application Insights](/azure/azure-monitor/logs/data-retention-archive). |
96
96
97
97
To enable fast recovery and restart in the secondary region, we recommend the following development practices:
@@ -104,11 +104,11 @@ To enable fast recovery and restart in the secondary region, we recommend the fo
104
104
105
105
### Compute and data services
106
106
107
-
Depending on your needs, you may have more compute or data services that are used by Azure Machine Learning. For example, you may use Azure Kubernetes Services or Azure SQL Database. Use the following information to learn how to configure these services for high availability.
107
+
Depending on your needs, you might have more compute or data services that are used by Azure Machine Learning. For example, you might use Azure Kubernetes Services or Azure SQL Database. Use the following information to learn how to configure these services for high availability.
108
108
109
109
__Compute resources__
110
110
111
-
***Azure Kubernetes Service**: See [Best practices for business continuity and disaster recovery in Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview) and [Create an Azure Kubernetes Service (AKS) cluster that uses availability zones](/azure/aks/availability-zones). If the AKS cluster was created by using the Azure Machine Learning studio, SDK, or CLI, cross-region high availability is not supported.
111
+
***Azure Kubernetes Service**: See [Best practices for business continuity and disaster recovery in Azure Kubernetes Service (AKS)](/azure/aks/ha-dr-overview) and [Create an Azure Kubernetes Service (AKS) cluster that uses availability zones](/azure/aks/availability-zones). If the AKS cluster was created by using the Azure Machine Learning studio, SDK, or CLI, cross-region high availability isn't supported.
112
112
***Azure Databricks**: See [Regional disaster recovery for Azure Databricks clusters](/azure/databricks/scenarios/howto-regional-disaster-recovery).
113
113
***Container Instances**: An orchestrator is responsible for failover. See [Azure Container Instances and container orchestrators](/azure/container-instances/container-instances-orchestrator-relationship).
114
114
***HDInsight**: See [High availability services supported by Azure HDInsight](/azure/hdinsight/hdinsight-high-availability-components).
@@ -123,9 +123,15 @@ __Data services__
123
123
124
124
## Design for high availability
125
125
126
+
### Availability zones
127
+
128
+
Certain Azure services support availability zones. For regions that support availability zones, if a zone goes down any workload pauses and data should be saved. However, the data is unavailable to refresh until the zone is back online.
129
+
130
+
For more information, see [Availability zone service and regional support](/azure/reliability/availability-zones-service-support).
131
+
126
132
### Deploy critical components to multiple regions
127
133
128
-
Determine the level of business continuity that you are aiming for. The level may differ between the components of your solution. For example, you may want to have a hot/hot configuration for production pipelines or model deployments, and hot/cold for experimentation.
134
+
Determine the level of business continuity that you're aiming for. The level might differ between the components of your solution. For example, you might want to have a hot/hot configuration for production pipelines or model deployments, and hot/cold for experimentation.
129
135
130
136
### Manage training data on isolated storage
131
137
@@ -154,9 +160,9 @@ Jobs in Azure Machine Learning are defined by a job specification. This specific
154
160
155
161
### Continue work in the failover workspace
156
162
157
-
When your primary workspace becomes unavailable, you can switch over the secondary workspace to continue experimentation and development. Azure Machine Learning does not automatically submit jobs to the secondary workspace if there is an outage. Update your code configuration to point to the new workspace resource. We recommend to avoiding hardcoding workspace references. Instead, use a [workspace config file](how-to-configure-environment.md#local-and-dsvm-only-create-a-workspace-configuration-file) to minimize manual user steps when changing workspaces. Make sure to also update any automation, such as continuous integration and deployment pipelines to the new workspace.
163
+
When your primary workspace becomes unavailable, you can switch over the secondary workspace to continue experimentation and development. Azure Machine Learning doesn't automatically submit jobs to the secondary workspace if there's an outage. Update your code configuration to point to the new workspace resource. We recommend to avoiding hardcoding workspace references. Instead, use a [workspace config file](how-to-configure-environment.md#local-and-dsvm-only-create-a-workspace-configuration-file) to minimize manual user steps when changing workspaces. Make sure to also update any automation, such as continuous integration and deployment pipelines to the new workspace.
158
164
159
-
Azure Machine Learning cannot sync or recover artifacts or metadata between workspace instances. Dependent on your application deployment strategy, you might have to move artifacts or recreate experimentation inputs, such as data assets, in the failover workspace in order to continue job submission. In case you have configured your primary workspace and secondary workspace resources to share associated resources with geo-replication enabled, some objects might be directly available to the failover workspace. For example, if both workspaces share the same docker images, configured datastores, and Azure Key Vault resources. The following diagram shows a configuration where two workspaces share the same images (1), datastores (2), and Key Vault (3).
165
+
Azure Machine Learning can't sync or recover artifacts or metadata between workspace instances. Dependent on your application deployment strategy, you might have to move artifacts or recreate experimentation inputs, such as data assets, in the failover workspace in order to continue job submission. In case you have configured your primary workspace and secondary workspace resources to share associated resources with geo-replication enabled, some objects might be directly available to the failover workspace. For example, if both workspaces share the same docker images, configured datastores, and Azure Key Vault resources. The following diagram shows a configuration where two workspaces share the same images (1), datastores (2), and Key Vault (3).
@@ -186,9 +192,9 @@ The following artifacts can be exported and imported between workspaces by using
186
192
187
193
If you accidentally deleted your workspace, you might able to recover it. For recovery steps, see [Recover workspace data after accidental deletion with soft delete](concept-soft-delete.md).
188
194
189
-
Even if your workspace cannot be recovered, you may still be able to retrieve your notebooks from the workspace-associated Azure storage resource by following these steps:
190
-
* In the [Azure portal](https://portal.azure.com) navigate to the storage account that was linked to the deleted Azure Machine Learning workspace.
191
-
* In the Data storage section on the left, click on**File shares**.
195
+
Even if your workspace can't be recovered, you might still be able to retrieve your notebooks from the workspace-associated Azure storage resource by following these steps:
196
+
* In the [Azure portal](https://portal.azure.com), navigate to the storage account that was linked to the deleted Azure Machine Learning workspace.
197
+
* In the Data storage section on the left, select**File shares**.
192
198
* Your notebooks are located on the file share with the name that contains your workspace ID.
0 commit comments