Skip to content

Commit baed5a8

Browse files
author
Cory Fowler
authored
Merge pull request #50231 from SyntaxC4-MSFT/live
Merging to Live
2 parents f6e2a03 + e42b54d commit baed5a8

File tree

3 files changed

+267
-0
lines changed

3 files changed

+267
-0
lines changed

articles/azure-databricks/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@
4040
href: databricks-cli-from-azure-cloud-shell.md
4141
- name: Use BI tools
4242
href: https://docs.azuredatabricks.net/user-guide/bi/index.html
43+
- name: Plan disaster recovery
44+
href: howto-regional-disaster-recovery.md
4345
- name: Reference
4446
items:
4547
- name: REST API
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
title: Regional disaster recovery for Azure Databricks
3+
description: This article describes an approach to doing disaster recovery in Azure Databricks.
4+
services: azure-databricks
5+
author: jasonwhowell
6+
ms.author: jasonh
7+
ms.service: azure-databricks
8+
ms.workload: big-data
9+
ms.topic: conceptual
10+
ms.date: 08/27/2018
11+
---
12+
# Regional disaster recovery for Azure Databricks clusters
13+
14+
This article describes a disaster recovery architecture useful for Azure Databricks clusters, and the steps to accomplish that design.
15+
16+
## Control plan architecture
17+
18+
At a high level, when you create an Azure Databricks workspace from the Azure portal, a [managed appliance](../managed-applications/overview.md) is deployed as an Azure resource in your subscription, in the chose Azure region (for example, West US). This appliance is deployed in an [Azure Virtual Network](../virtual-network/virtual-networks-overview.md) with a [Network Security Group](../virtual-network/manage-network-security-group.md) and an Azure Storage account, available in your subscription. The virtual network provides perimeter level security to the Databricks workspace and is protected via network security group. Within the workspace, you can create Databricks cluster(s) by providing the worker and driver VM type and Databricks runtime version. The persisted data is available in your storage account, which can be Azure Blob Storage or Azure Data Lake Store. Once the cluster is created, you can run jobs via notebooks, REST APIs, ODBC/JDBC endpoints by attaching them to a specific cluster.
19+
20+
The Databricks control plane manages and monitors the Databricks workspace environment. Any management operation such as create cluster will be initiated from the Control Plane. All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.
21+
22+
![Databricks control plane architecture](media/howto-regional-disaster-recovery/databricks-control-plane.png)
23+
24+
One of the advantages of this architecture is that users can connect Azure Databricks to any storage resource in their account. A key benefit is that both compute (Azure Databricks) and storage can be scaled independently of each other.
25+
26+
## How to create a regional disaster recovery topology
27+
28+
As you notice in the preceding architecture description, there are a number of components used for a Big Data pipeline with Azure Databricks: Azure Storage, Azure Database, and other data sources. Azure Databricks is the *compute* for the Big Data pipeline. It is *ephemeral* in nature, meaning that while your data is still available in Azure Storage, the *compute* (Azure Databricks cluster) can be terminated so that you don’t have to pay for compute when you don’t need it. The *compute* (Azure Databricks) and storage sources must be in the same region so that jobs don’t experience high latency.
29+
30+
To create your own regional disaster recovery topology, follow these requirements:
31+
32+
1. Provision multiple Azure Databricks workspaces in separate Azure regions. For example, create the primary Azure Databricks workspace in East US2. Create the secondary disaster-recovery Azure Databricks workspace in a separate region, such as West US.
33+
34+
2. Use [Geo-redundant storage](../storage/common/storage-redundancy-grs.md#read-access-geo-redundant-storage). The data associated Azure Databricks is stored by default in Azure Storage. The results from Databricks jobs are also stored in Azure Blob Storage, so that the processed data is durable and remains highly available after cluster is terminated. As the Storage and Databricks cluster are co-located, you must use Geo-redundant storage so that data can be accessed in secondary region if primary region is no longer accessible.
35+
36+
3. Once the secondary region is created, you must migrate the users, user folders, notebooks, cluster configuration, jobs configuration, libraries, storage, init scripts, and reconfigure access control. Additional details are outlined in the following section.
37+
38+
## Detailed migration steps
39+
40+
1. **Set up the Databricks command-line interface on your computer**
41+
42+
This article shows a number of code examples that use the command-line interface for most of the automated steps, since it is an easy-to-user wrapper over Azure Databricks REST API.
43+
44+
Before performing any migration steps, install the databricks-cli on your desktop computer or a virtual machine where you plan to do the work. For more information, see [Install Databricks CLI](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html)
45+
46+
```bash
47+
pip install databricks-cli
48+
```
49+
50+
> [!NOTE]
51+
> Any python scripts provided in this article are expected to work with Python 2.7+ < 3.x.
52+
53+
2. **Configure two profiles.**
54+
55+
Configure one for the primary workspace, and another one for the secondary workspace:
56+
57+
```bash
58+
databricks configure --profile primary
59+
databricks configure --profile secondary
60+
```
61+
62+
The code blocks in this article switch between profiles in each subsequent step using the corresponding workspace command. Be sure that the names of the profiles you create are substituted into each code block.
63+
64+
```python
65+
EXPORT_PROFILE = "primary"
66+
IMPORT_PROFILE = "secondary"
67+
```
68+
69+
You can manually switch at the command line if needed:
70+
71+
```bash
72+
databricks workspace ls --profile primary
73+
databricks workspace ls --profile secondary
74+
```
75+
76+
3. **Migrate Azure Active Directory users**
77+
78+
Manually add the same Azure Active Directory users to the secondary workspace that exist in primary workspace.
79+
80+
4. **Migrate the user folders and notebooks**
81+
82+
Use the following python code to migrate the sandboxed user environments, which include the nested folder structure and notebooks per user.
83+
84+
> [!NOTE]
85+
> Libraries are not copied over in this step, as the underlying API doesn't support those.
86+
87+
Copy and save the following python script to a file, and run it in your Databricks command line. For example, `python scriptname.py`.
88+
89+
```python
90+
from subprocess import call, check_output
91+
92+
EXPORT_PROFILE = "primary"
93+
IMPORT_PROFILE = "secondary"
94+
95+
# Get a list of all users
96+
user_list_out = check_output(["databricks", "workspace", "ls", "/Users", "--profile", EXPORT_PROFILE])
97+
user_list = user_list_out.splitlines()
98+
99+
# Export sandboxed environment (folders, notebooks) for each user and import into new workspace.
100+
# Libraries are not included with these APIs / commands.
101+
102+
for user in user_list:
103+
print "Trying to migrate workspace for user " + user
104+
105+
call("mkdir -p " + user, shell=True)
106+
export_exit_status = call("databricks workspace export_dir /Users/" + user + " ./" + user + " --profile " + EXPORT_PROFILE, shell=True)
107+
108+
if export_exit_status==0:
109+
print "Export Success"
110+
import_exit_status = call("databricks workspace import_dir ./" + user + " /Users/" + user + " --profile " + IMPORT_PROFILE, shell=True)
111+
if import_exit_status==0:
112+
print "Import Success"
113+
else:
114+
print "Import Failure"
115+
else:
116+
print "Export Failure"
117+
118+
print "All done"
119+
```
120+
121+
5. **Migrate the cluster configurations**
122+
123+
Once notebooks have been migrated, you can optionally migrate the cluster configurations to the new workspace. It's almost a fully automated step using databricks-cli, unless you would like to do selective cluster config migration rather than for all.
124+
125+
> [!NOTE]
126+
> Unfortunately there is no create cluster config endpoint, and this script tries to create each cluster right away. If there aren't enough cores available in your subscription, the cluster creation may fail. The failure can be ignored, as long as the configuration is transferred successfully.
127+
128+
The following script provided prints a mapping from old to new cluster IDs, which could be used for job migration later (for jobs that are configured to use existing clusters).
129+
130+
Copy and save the following python script to a file, and run it in your Databricks command line. For example, `python scriptname.py`.
131+
132+
```python
133+
from subprocess import call, check_output import json
134+
135+
EXPORT_PROFILE = "primary"
136+
IMPORT_PROFILE = "secondary"
137+
138+
# Get all clusters info from old workspace
139+
clusters_out = check_output(["databricks", "clusters", "list", "--profile", EXPORT_PROFILE]) clusters_info_list = clusters_out.splitlines()
140+
141+
# Create a list of all cluster ids
142+
clusters_list = [] for cluster_info in clusters_info_list: clusters_list.append(cluster_info.split(None, 1)[0])
143+
144+
# Optionally filter cluster ids out manually, so as to create only required ones in new workspace
145+
146+
# Create a list of mandatory / optional create request elements
147+
cluster_req_elems = ["num_workers","autoscale","cluster_name","spark_version","spark_conf"," node_type_id","driver_node_type_id","custom_tags","cluster_log_conf","sp ark_env_vars","autotermination_minutes","enable_elastic_disk"]
148+
149+
# Try creating all / selected clusters in new workspace with same config as in old one.
150+
cluster_old_new_mappings = {} for cluster in clusters_list: print "Trying to migrate cluster " + cluster
151+
152+
cluster_get_out = check_output(["databricks", "clusters", "get", "--cluster-id", cluster, "--profile", EXPORT_PROFILE])
153+
print "Got cluster config from old workspace"
154+
155+
# Remove extra content from the config, as we need to build create request with allowed elements only
156+
cluster_req_json = json.loads(cluster_get_out)
157+
cluster_json_keys = cluster_req_json.keys()
158+
159+
for key in cluster_json_keys:
160+
if key not in cluster_req_elems:
161+
cluster_req_json.pop(key, None)
162+
163+
# Create the cluster, and store the mapping from old to new cluster ids
164+
cluster_create_out = check_output(["databricks", "clusters", "create", "--json", json.dumps(cluster_req_json), "--profile", IMPORT_PROFILE])
165+
cluster_create_out_json = json.loads(cluster_create_out)
166+
cluster_old_new_mappings[cluster] = cluster_create_out_json['cluster_id']
167+
168+
print "Sent cluster create request to new workspace successfully"
169+
170+
print "Cluster mappings: " + json.dumps(cluster_old_new_mappings)
171+
print "All done"
172+
```
173+
174+
6. **Migrate the jobs configuration**
175+
176+
If you migrated cluster configurations in the previous step, you can opt to migrate job configurations to the new workspace. It is a fully automated step using databricks-cli, unless you would like to do selective job config migration rather than doing it for all jobs.
177+
178+
> [!NOTE]
179+
> The configuration for a scheduled job contains the "schedule" information as well, so by default that will start working as per configured timing as soon as it's migrated. Hence, the following code block removes any schedule information during the migration (to avoid duplicate runs across old and new workspaces). Configure the schedules for such jobs once you're ready for cutover.
180+
181+
The job configuration requires settings for a new or an existing cluster. If using existing cluster, the script /code below will attempt to replace the old cluster ID with new cluster ID.
182+
183+
Copy and save the following python script to a file. Replace the value for `old_cluster_id` and `new_cluster_id`, with the output from cluster migration done in previous step. Run it in your databricks-cli command line, for example, `python scriptname.py`.
184+
185+
```python
186+
from subprocess import call, check_output
187+
import json
188+
189+
EXPORT_PROFILE = "primary"
190+
IMPORT_PROFILE = "secondary"
191+
192+
# Please replace the old to new cluster id mappings from cluster migration output
193+
cluster_old_new_mappings = {"old_cluster_id": "new_cluster_id"}
194+
195+
# Get all jobs info from old workspace
196+
try:
197+
jobs_out = check_output(["databricks", "jobs", "list", "--profile", EXPORT_PROFILE])
198+
jobs_info_list = jobs_out.splitlines()
199+
except:
200+
print "No jobs to migrate"
201+
sys.exit(0)
202+
203+
# Create a list of all job ids
204+
jobs_list = []
205+
for jobs_info in jobs_info_list:
206+
jobs_list.append(jobs_info.split(None, 1)[0])
207+
208+
# Optionally filter job ids out manually, so as to create only required ones in new workspace
209+
210+
# Create each job in the new workspace based on corresponding settings in the old workspace
211+
212+
for job in jobs_list:
213+
print "Trying to migrate " + job
214+
215+
job_get_out = check_output(["databricks", "jobs", "get", "--job-id", job, "--profile", EXPORT_PROFILE])
216+
print "Got job config from old workspace"
217+
218+
job_req_json = json.loads(job_get_out)
219+
job_req_settings_json = job_req_json['settings']
220+
221+
# Remove schedule information so job doesn't start before proper cutover
222+
job_req_settings_json.pop('schedule', None)
223+
224+
# Replace old cluster id with new cluster id, if job configured to run against an existing cluster
225+
if 'existing_cluster_id' in job_req_settings_json:
226+
if job_req_settings_json['existing_cluster_id'] in cluster_old_new_mappings:
227+
job_req_settings_json['existing_cluster_id'] = cluster_old_new_mappings[job_req_settings_json['existing_cluster_id']]
228+
else:
229+
print "Mapping not available for old cluster id " + job_req_settings_json['existing_cluster_id']
230+
continue
231+
232+
call(["databricks", "jobs", "create", "--json", json.dumps(job_req_settings_json), "--profile", IMPORT_PROFILE])
233+
print "Sent job create request to new workspace successfully"
234+
235+
print "All done"
236+
```
237+
238+
7. **Migrate libraries**
239+
240+
There's currently no straightforward way to migrate libraries from one workspace to another. Reinstall those libraries into the new workspace. Hence this step is mostly manual. This is possible to automate using combination of [DBFS CLI](https://github.com/databricks/databricks-cli#dbfs-cli-examples) to upload custom libraries to the workspace and [Libraries CLI](https://github.com/databricks/databricks-cli#libraries-cli).
241+
242+
8. **Migrate Azure blob storage and Azure Data Lake Store mounts**
243+
244+
Manually remount all [Azure Blob storage](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html) and [Azure Data Lake Store (Gen 1)](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html) mount points using a notebook-based solution. The storage resources would have been mounted in the primary workspace, and that has to be repeated in the secondary workspace. There is no external API for mounts.
245+
246+
9. **Migrate cluster init scripts**
247+
248+
Any cluster initialization scripts can be migrated from old to new workspace using the [DBFS CLI](https://github.com/databricks/databricks-cli#dbfs-cli-examples). First, copy the needed scripts from "dbfs:/dat abricks/init/.." to your local desktop or virtual machine. Next, copy those scripts into the new workspace at the same path.
249+
250+
```bash
251+
// Primary to local
252+
dbfs cp -r dbfs:/databricks/init ./old-ws-init-scripts --profile primary
253+
254+
// Local to Secondary workspace
255+
dbfs cp -r old-ws-init-scripts dbfs:/databricks/init --profile secondary
256+
```
257+
258+
10. **Manually reconfigure and reapply access control.**
259+
260+
If your existing primary workspace is configured to use the Premium tier (SKU), it's likely you also are using the [Access Control feature](https://docs.azuredatabricks.net/administration-guide/admin-settings/index.html#manage-access-control).
261+
262+
If you do use the Access Control feature, manually reapply the access control to the resources (Notebooks, Clusters, Jobs, Tables).
263+
264+
## Next steps
265+
For more information, see [Azure Databricks documentation](https://docs.azuredatabricks.net/user-guide/index.html).
50.7 KB
Loading

0 commit comments

Comments
 (0)