|
| 1 | +--- |
| 2 | +title: "Efficient Data Migration: From Hive Metastore to Unity Catalog in Databricks" |
| 3 | +authorId: "arihant" |
| 4 | +date: 2024-11-09 |
| 5 | +draft: false |
| 6 | +featured: true |
| 7 | +weight: 1 |
| 8 | +--- |
| 9 | + |
| 10 | +<img src="/images/blog/hive-to-unity-catalog-data-migration-databricks/cover.png" alt="Hive Metastore to Unity Catalog Data Migration in Databricks"> |
| 11 | + |
| 12 | +## **TL;DR** |
| 13 | + |
| 14 | +We had to migrate extensive data from Hive Metastore to Unity Catalog in a regulated large-scale enterprise environment. |
| 15 | +This task required precise planning, synchronization across teams, and expertise to ensure zero downtime and no data |
| 16 | +loss. By removing outdated jobs, cutting costs, and streamlining data workflows, we optimized the data infrastructure |
| 17 | +and achieved a robust governance framework. This migration gave us a deep understanding of complex data systems and the |
| 18 | +nuances of managing dependencies, aligning stakeholders, and ensuring regulatory compliance. |
| 19 | + |
| 20 | +## **Project Background: Migrating Enterprise-Scale Data** |
| 21 | + |
| 22 | +Our team recently undertook the task to upgrade data management from Hive Metastore to Unity Catalog within Databricks. |
| 23 | +The Hive Metastore system posed challenges in data governance, access control, and lineage tracking. Unity Catalog, with |
| 24 | +its advanced security and governance features, was the natural choice to address these limitations and to meet our |
| 25 | +standards for compliance and operational efficiency. |
| 26 | + |
| 27 | +## **Initial Setup and Core Challenges** |
| 28 | + |
| 29 | +### **Hive Metastore Constraints** |
| 30 | + |
| 31 | +Existing reliance on Hive Metastore introduced numerous constraints: |
| 32 | + |
| 33 | +- **Limited Security and Governance:** Lack of native auditing, inadequate access controls, and missing lineage tracking |
| 34 | + created compliance challenges. |
| 35 | +- **Environment Restrictions:** Hive did not allow sharing schemas across separate environments, complicating the setup |
| 36 | + of development, staging, and production. Unity Catalog, by contrast, supports schema sharing across environments with |
| 37 | + detailed access controls, essential for regulated enterprises. |
| 38 | + |
| 39 | +### **Migration Requirements** |
| 40 | + |
| 41 | +With around **87 schemas** and **1,920-1,950** tables totaling **~22 TB** to migrate, our objective was clear: migrate |
| 42 | +active data to Unity Catalog while maintaining zero downtime and safeguarding against data divergence. This meant not |
| 43 | +only handling the transfer but also ensuring no production jobs were disrupted. |
| 44 | + |
| 45 | +## **Strategy and Planning** |
| 46 | + |
| 47 | +### **Assessing Existing Data and Stakeholder Engagement** |
| 48 | + |
| 49 | +Our first steps included a thorough data assessment and stakeholder outreach. We leveraged the Databricks utility tools |
| 50 | +to catalog and assess tables within the Hive Metastore. This enabled us to filter out obsolete tables and prioritize |
| 51 | +essential data. |
| 52 | + |
| 53 | +We collaborated with multiple stakeholders, from schema owners to data engineers, for input on which tables were |
| 54 | +critical and required for daily operations. This info-gathering stage identified schemas for migration and non-essential |
| 55 | +ones for removal, enhancing the organization’s data efficiency. |
| 56 | + |
| 57 | +## **Execution Phase** |
| 58 | + |
| 59 | +### **Detailed Migration Workflow** |
| 60 | + |
| 61 | +- **Schema and Table Identification:** |
| 62 | + Using scripts, we generated a comprehensive list of schemas and tables, detailing table size, last update time, and |
| 63 | + owner feedback on necessity for migration. This streamlined our focus on only essential data, avoiding the migration |
| 64 | + of dead jobs and outdated test schemas. |
| 65 | + |
| 66 | + To gather this information efficiently, we reached out to schema owners for their inputs. Here's an example of a Slack |
| 67 | + message we sent to one of the Hive schema owners requesting the necessary details for the migration: |
| 68 | + |
| 69 | + <img src="/images/blog/hive-to-unity-catalog-data-migration-databricks/sample_slack_message.png" alt="Sample slack message requesting schema owners to update the hive schema details"> |
| 70 | + |
| 71 | +- **Scheduling and Timing:** |
| 72 | + Migrating tables with read-only operations was straightforward, as both Hive and Unity Catalog could be accessed |
| 73 | + simultaneously. However, tables with active write operations presented a challenge. If a job continued to write to |
| 74 | + Hive after migration, the data in Hive and Unity Catalog would diverge, potentially requiring a re-migration—a costly |
| 75 | + and complex task given the volume of tables. To avoid this, we scheduled migrations during job downtime to ensure data |
| 76 | + consistency. |
| 77 | + |
| 78 | +- **Automation and Scalability:** |
| 79 | + We automated and streamlined a significant part of the migration process by identifying ideal times when jobs were |
| 80 | + inactive and initiating migrations with minimal manual intervention. Our scripts were versatile, handling different |
| 81 | + scenarios based on table types and data sizes, ensuring efficient transfers. This automated approach allowed us to |
| 82 | + manage both large and small datasets with ease, scaling resources as needed to maintain optimal performance. |
| 83 | + |
| 84 | +**Note:** The jobs in this migration were primarily batch-oriented, which allowed us to perform migrations during |
| 85 | +scheduled downtimes without impacting production workloads. |
| 86 | + |
| 87 | +## **Results and Key Improvements** |
| 88 | + |
| 89 | +### **Data and Cost Efficiency Gains** |
| 90 | + |
| 91 | +#### **Data Migration:** |
| 92 | + |
| 93 | +Successfully transferred **~22 TB** to Unity Catalog, with **~75 TB** of deprecated data removed from Hive. |
| 94 | + |
| 95 | +#### **Cost Savings:** |
| 96 | + |
| 97 | +By removing 75 TB of deprecated data, we reduced both storage costs and data handling overheads. Here’s a rough |
| 98 | +breakdown of cost savings: |
| 99 | + |
| 100 | +| **Cost Type** | **Details** | **Annual Cost** | |
| 101 | +|------------------------|------------------------------------------------------------------|-----------------| |
| 102 | +| **EC2 Compute** | 20 x i3.xlarge instances, 1 hour/day at $0.252/hour per instance | **$1,840.56** | |
| 103 | +| **Databricks Compute** | 20 instances, 1 hour/day at $0.55/DBU/hour per instance | **$4,015** | |
| 104 | +| **Storage Cost** | 75 TB of data stored on AWS S3 at $0.025/GB/month | **$23,040** | |
| 105 | +| **Total Annual Cost** | Sum of all costs | **$28,895.56** | |
| 106 | + |
| 107 | +**Note:** Additional costs are incurred for reading data from upstream sources like Redshift and writing to S3. These |
| 108 | +read/write operations are not included in the above cost breakdown but contribute to the overall data processing costs. |
| 109 | + |
| 110 | +### **Enhanced Governance and Operational Efficiency** |
| 111 | + |
| 112 | +- **Improved Data Governance:** Unity Catalog introduced clear data lineage and granular access control, essential for |
| 113 | + maintaining regulatory compliance. |
| 114 | +- **Operational Efficiency:** Before the migration, engineers spent significant time maintaining outdated or unnecessary |
| 115 | + jobs. Considering, ~10 unused jobs required about 1 hour per week to manage, removing those jobs saved approximately |
| 116 | + 10 hours of engineering effort each week. This freed up valuable time for engineers to focus on core operational |
| 117 | + tasks, accelerating product delivery and reducing maintenance overheads. |
| 118 | + |
| 119 | +### **Key Migration Metrics** |
| 120 | + |
| 121 | +| **Total Schemas** | **Tables Migrated** | **Data Migrated** | **Data Deleted** | **Job Optimizations** | |
| 122 | +|-------------------|---------------------|-------------------|------------------|--------------------------| |
| 123 | +| 87 | 1,920-1,950 | ~22 TB | ~75 TB | Significant cost savings | |
| 124 | + |
| 125 | +## **Unseen Challenges: Key Insights We Gained During the Migration** |
| 126 | + |
| 127 | +As with any large-scale project, there were a few unexpected challenges along the way. One critical lesson we learned |
| 128 | +was the importance of removing migrated tables from Hive immediately after the migration. Initially, we delayed this |
| 129 | +step, which led to users continuing to write to the old tables in Hive, causing data divergence. |
| 130 | + |
| 131 | +The takeaway? **Don’t wait to delete migrated tables from Hive**—doing so ensures data consistency and smooths the |
| 132 | +transition to Unity Catalog. This small adjustment made a huge difference in the overall process. |
| 133 | + |
| 134 | +## **Conclusion: Strategic Impact and Future Roadmap** |
| 135 | + |
| 136 | +This large-scale migration for a regulated enterprise required meticulous planning and execution to ensure zero downtime |
| 137 | +and maintain data integrity. Every piece of data and workload was critical to operations. Our top priority was |
| 138 | +safeguarding production workloads and preventing any data loss. This careful approach required significant effort, but |
| 139 | +it was crucial for ensuring operational continuity and preserving data accuracy. The migration optimized the governance |
| 140 | +model, reduced costs, and enhanced operational focus. |
| 141 | + |
| 142 | +Stay tuned for an upcoming series of blog posts where we’ll dive deep into the technical processes and scripts used, |
| 143 | +including enabling Unity Catalog for an existing production Databricks setup. 🚀 |
0 commit comments