Skip to content

Commit 10abd0c

Browse files
authored
Define table migration process (#307)
Added a table migration doc. Let's discuss the migration process.
1 parent 035320c commit 10abd0c

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

docs/table_upgrade.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Table Upgrade logic and data structures
2+
The Hive Metastore migration process will upgrade the following Assets:
3+
- Tables ond DBFS root
4+
- External Tables
5+
- Views
6+
7+
We don't expect this process to be a "one and done" process. This typically is an iterative process and may require a few runs.
8+
9+
We suggest to keep track of the migration and provide the user a continuous feedback of the progress and status of the upgrade.
10+
11+
The migration process will be set as a job that can be invoked multiple times.
12+
Each time it will upgrade tables it can and report the ones it can't.
13+
14+
## Common considerations
15+
1. One view per workspace summarizing all the table inventory and various counters
16+
1. By default we create a single catalog per HMS (<prefix (optional)>_<workspace_name>), happens at the account level.
17+
1. Workspace Name would be set up as part of the installation at the account level.
18+
1. Consider other mappings of environments/database to catalog/database.
19+
1. The user will be able to specify a default catalog for the workspace.
20+
1. We have to annotate the status of assets that were migrated.
21+
1. We will roll up the migration status to the workspace/account level. Showing migration state.
22+
1. Aggregation of migration failures
23+
1. View of object migration:
24+
25+
| Object Type | Object ID | Migrated | Migration Failures |
26+
|----|----|----|----|
27+
|View|hive_metastore.finance.transaction_vw|1|[]|
28+
|Table|hive_metastore.finance.transactions|0|["Table uses SERDE: csv"]|
29+
|Table|hive_metastore.finance.accounts|0|[]|
30+
|Cluster|klasd-kladef-01265|0|["Uses Passthru authentication"]|
31+
32+
1. By default the target is the target_catalog/database_name
33+
1. The assessment will generate a mapping file/table. The file will be in CSV format.
34+
35+
| Source Database | Target Catalog | Target Database |
36+
|----|----|----|
37+
|finance| de_dev | finance |
38+
|hr | de_dev | human_resources|
39+
|sales | ucx-dev_ws | sales |
40+
1. The user can download the mapping file, override the targets and upload it to the workspace .csx folder.
41+
1. By default we copy the table content (CTAS)
42+
1. Allow skipping individual tables/databases
43+
1. Explore sizing tables or another threshold (recursively count bytes)
44+
1. By default we copy the table into a managed table/managed location
45+
1. Allow overriding target to an external table
46+
1. We should migrate ACLs for the tables (where applicable). We should highlight cases where we can't (no direct translation/conflicts)
47+
1. We should consider automating ACLs based on Instance Profiles / Service Principals and other legacy security mechanisms
48+
49+
## Tables (Parquet/Delta) on DBFS root
50+
1. By default we copy the table content (CTAS)
51+
1. Allow skipping individual tables/databases
52+
1. Explore sizing tables or another threshold (recursively count bytes)
53+
1. By default we copy the table into a managed table/managed location
54+
1. Allow overriding target to an external table
55+
1. Allow an exception list in case we want to skip certain tables
56+
57+
## Tables (Parquet/Delta) on Cloud Storage
58+
1. Verify that we have the external locations for these tables
59+
1. Automate creation of External Locations (Future)
60+
1. Use sync to upgrade these tables "in place". Use the default or override catalog.database destination.
61+
1. Update the source table with "upgraded_to" property
62+
63+
## Tables (None Parquet/Delta)
64+
1. Copy this table using CTAS or Deep Clone. (Consider bringing history)
65+
1. Copy the Metadata
66+
1. Skip tables as needed based on size threshold or an exception list
67+
1. Update the source table with "upgraded_to" property
68+
69+
## Views
70+
1. Make a "best effort" attempt to upgrade view
71+
1. Create a view in the new location
72+
1. Upgrade table reference to the new tables (based on the upgraded_to table property)
73+
1. Handle nested views
74+
1. Handle or highlight other cases (functions/storage references/ETC)
75+
1. Create an exception list with views failures
76+
77+
## Functions
78+
1. We should migrate (if possible) functions
79+
1. Address incompatibilities
80+
81+
82+
## Account Consideration
83+
1. HMS on multiple workspaces may point to the same assets. We need to dedupe upgrades.
84+
1. Allow running assessment on all the accounts workspaces or on a group of workspaces.
85+
1. We have to test on Glue and other external Metastores
86+
1. Create an exception list at the account level the list should contain
87+
1. Tables that show up on more than one workspace (pointing to the same cloud storage location)
88+
1. Tables that show up on more than one workspace with different metadata
89+
1. Tables that show up on more than one workspace with different ACLs
90+
1. Addressing table conflicts/duplications require special processing we have the following options
91+
1. Define a "master" and create derivative objects as views
92+
1. Flag and skip the dupes
93+
1. Duplicate the data and create dupes
94+
1. Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades.
95+
1. Allow workspace admins to upgrade more than one workspace.
96+
97+
## Open Questions
98+
1. How do we manage/surface potential cost of the assessment run in case of many workspaces.
99+
1. How do we handle conflicts between workspaces
100+
1. What mechanism do we use to map source to target databases
101+
1. How to list workspaces in Azure/AWS
102+

0 commit comments

Comments
 (0)