Skip to content

Commit 744f9cd

Browse files
committed
Add blog on databricks data migration from hive to unity
1 parent 11553de commit 744f9cd

File tree

3 files changed

+143
-0
lines changed

3 files changed

+143
-0
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: "Efficient Data Migration: From Hive Metastore to Unity Catalog in Databricks"
3+
authorId: "arihant"
4+
date: 2024-11-09
5+
draft: false
6+
featured: true
7+
weight: 1
8+
---
9+
10+
<img src="/images/blog/hive-to-unity-catalog-data-migration-databricks/cover.png" alt="Hive Metastore to Unity Catalog Data Migration in Databricks">
11+
12+
## **TL;DR**
13+
14+
We had to migrate extensive data from Hive Metastore to Unity Catalog in a regulated large-scale enterprise environment.
15+
This task required precise planning, synchronization across teams, and expertise to ensure zero downtime and no data
16+
loss. By removing outdated jobs, cutting costs, and streamlining data workflows, we optimized the data infrastructure
17+
and achieved a robust governance framework. This migration gave us a deep understanding of complex data systems and the
18+
nuances of managing dependencies, aligning stakeholders, and ensuring regulatory compliance.
19+
20+
## **Project Background: Migrating Enterprise-Scale Data**
21+
22+
Our team recently undertook the task to upgrade data management from Hive Metastore to Unity Catalog within Databricks.
23+
The Hive Metastore system posed challenges in data governance, access control, and lineage tracking. Unity Catalog, with
24+
its advanced security and governance features, was the natural choice to address these limitations and to meet our
25+
standards for compliance and operational efficiency.
26+
27+
## **Initial Setup and Core Challenges**
28+
29+
### **Hive Metastore Constraints**
30+
31+
Existing reliance on Hive Metastore introduced numerous constraints:
32+
33+
- **Limited Security and Governance:** Lack of native auditing, inadequate access controls, and missing lineage tracking
34+
created compliance challenges.
35+
- **Environment Restrictions:** Hive did not allow sharing schemas across separate environments, complicating the setup
36+
of development, staging, and production. Unity Catalog, by contrast, supports schema sharing across environments with
37+
detailed access controls, essential for regulated enterprises.
38+
39+
### **Migration Requirements**
40+
41+
With around **87 schemas** and **1,920-1,950** tables totaling **~22 TB** to migrate, our objective was clear: migrate
42+
active data to Unity Catalog while maintaining zero downtime and safeguarding against data divergence. This meant not
43+
only handling the transfer but also ensuring no production jobs were disrupted.
44+
45+
## **Strategy and Planning**
46+
47+
### **Assessing Existing Data and Stakeholder Engagement**
48+
49+
Our first steps included a thorough data assessment and stakeholder outreach. We leveraged the Databricks utility tools
50+
to catalog and assess tables within the Hive Metastore. This enabled us to filter out obsolete tables and prioritize
51+
essential data.
52+
53+
We collaborated with multiple stakeholders, from schema owners to data engineers, for input on which tables were
54+
critical and required for daily operations. This info-gathering stage identified schemas for migration and non-essential
55+
ones for removal, enhancing the organization’s data efficiency.
56+
57+
## **Execution Phase**
58+
59+
### **Detailed Migration Workflow**
60+
61+
- **Schema and Table Identification:**
62+
Using scripts, we generated a comprehensive list of schemas and tables, detailing table size, last update time, and
63+
owner feedback on necessity for migration. This streamlined our focus on only essential data, avoiding the migration
64+
of dead jobs and outdated test schemas.
65+
66+
To gather this information efficiently, we reached out to schema owners for their inputs. Here's an example of a Slack
67+
message we sent to one of the Hive schema owners requesting the necessary details for the migration:
68+
69+
<img src="/images/blog/hive-to-unity-catalog-data-migration-databricks/sample_slack_message.png" alt="Sample slack message requesting schema owners to update the hive schema details">
70+
71+
- **Scheduling and Timing:**
72+
Migrating tables with read-only operations was straightforward, as both Hive and Unity Catalog could be accessed
73+
simultaneously. However, tables with active write operations presented a challenge. If a job continued to write to
74+
Hive after migration, the data in Hive and Unity Catalog would diverge, potentially requiring a re-migration—a costly
75+
and complex task given the volume of tables. To avoid this, we scheduled migrations during job downtime to ensure data
76+
consistency.
77+
78+
- **Automation and Scalability:**
79+
We automated and streamlined a significant part of the migration process by identifying ideal times when jobs were
80+
inactive and initiating migrations with minimal manual intervention. Our scripts were versatile, handling different
81+
scenarios based on table types and data sizes, ensuring efficient transfers. This automated approach allowed us to
82+
manage both large and small datasets with ease, scaling resources as needed to maintain optimal performance.
83+
84+
**Note:** The jobs in this migration were primarily batch-oriented, which allowed us to perform migrations during
85+
scheduled downtimes without impacting production workloads.
86+
87+
## **Results and Key Improvements**
88+
89+
### **Data and Cost Efficiency Gains**
90+
91+
#### **Data Migration:**
92+
93+
Successfully transferred **~22 TB** to Unity Catalog, with **~75 TB** of deprecated data removed from Hive.
94+
95+
#### **Cost Savings:**
96+
97+
By removing 75 TB of deprecated data, we reduced both storage costs and data handling overheads. Here’s a rough
98+
breakdown of cost savings:
99+
100+
| **Cost Type** | **Details** | **Annual Cost** |
101+
|------------------------|------------------------------------------------------------------|-----------------|
102+
| **EC2 Compute** | 20 x i3.xlarge instances, 1 hour/day at $0.252/hour per instance | **$1,840.56** |
103+
| **Databricks Compute** | 20 instances, 1 hour/day at $0.55/DBU/hour per instance | **$4,015** |
104+
| **Storage Cost** | 75 TB of data stored on AWS S3 at $0.025/GB/month | **$23,040** |
105+
| **Total Annual Cost** | Sum of all costs | **$28,895.56** |
106+
107+
**Note:** Additional costs are incurred for reading data from upstream sources like Redshift and writing to S3. These
108+
read/write operations are not included in the above cost breakdown but contribute to the overall data processing costs.
109+
110+
### **Enhanced Governance and Operational Efficiency**
111+
112+
- **Improved Data Governance:** Unity Catalog introduced clear data lineage and granular access control, essential for
113+
maintaining regulatory compliance.
114+
- **Operational Efficiency:** Before the migration, engineers spent significant time maintaining outdated or unnecessary
115+
jobs. Considering, ~10 unused jobs required about 1 hour per week to manage, removing those jobs saved approximately
116+
10 hours of engineering effort each week. This freed up valuable time for engineers to focus on core operational
117+
tasks, accelerating product delivery and reducing maintenance overheads.
118+
119+
### **Key Migration Metrics**
120+
121+
| **Total Schemas** | **Tables Migrated** | **Data Migrated** | **Data Deleted** | **Job Optimizations** |
122+
|-------------------|---------------------|-------------------|------------------|--------------------------|
123+
| 87 | 1,920-1,950 | ~22 TB | ~75 TB | Significant cost savings |
124+
125+
## **Unseen Challenges: Key Insights We Gained During the Migration**
126+
127+
As with any large-scale project, there were a few unexpected challenges along the way. One critical lesson we learned
128+
was the importance of removing migrated tables from Hive immediately after the migration. Initially, we delayed this
129+
step, which led to users continuing to write to the old tables in Hive, causing data divergence.
130+
131+
The takeaway? **Don’t wait to delete migrated tables from Hive**—doing so ensures data consistency and smooths the
132+
transition to Unity Catalog. This small adjustment made a huge difference in the overall process.
133+
134+
## **Conclusion: Strategic Impact and Future Roadmap**
135+
136+
This large-scale migration for a regulated enterprise required meticulous planning and execution to ensure zero downtime
137+
and maintain data integrity. Every piece of data and workload was critical to operations. Our top priority was
138+
safeguarding production workloads and preventing any data loss. This careful approach required significant effort, but
139+
it was crucial for ensuring operational continuity and preserving data accuracy. The migration optimized the governance
140+
model, reduced costs, and enhanced operational focus.
141+
142+
Stay tuned for an upcoming series of blog posts where we’ll dive deep into the technical processes and scripts used,
143+
including enabling Unity Catalog for an existing production Databricks setup. 🚀
353 KB
Loading
185 KB
Loading

0 commit comments

Comments
 (0)