|
| 1 | +# Table Upgrade logic and data structures |
| 2 | +The Hive Metastore migration process will upgrade the following Assets: |
| 3 | +- Tables ond DBFS root |
| 4 | +- External Tables |
| 5 | +- Views |
| 6 | + |
| 7 | +We don't expect this process to be a "one and done" process. This typically is an iterative process and may require a few runs. |
| 8 | + |
| 9 | +We suggest to keep track of the migration and provide the user a continuous feedback of the progress and status of the upgrade. |
| 10 | + |
| 11 | +The migration process will be set as a job that can be invoked multiple times. |
| 12 | +Each time it will upgrade tables it can and report the ones it can't. |
| 13 | + |
| 14 | +## Common considerations |
| 15 | +1. One view per workspace summarizing all the table inventory and various counters |
| 16 | +1. By default we create a single catalog per HMS (<prefix (optional)>_<workspace_name>), happens at the account level. |
| 17 | +1. Workspace Name would be set up as part of the installation at the account level. |
| 18 | +1. Consider other mappings of environments/database to catalog/database. |
| 19 | + 1. The user will be able to specify a default catalog for the workspace. |
| 20 | +1. We have to annotate the status of assets that were migrated. |
| 21 | +1. We will roll up the migration status to the workspace/account level. Showing migration state. |
| 22 | +1. Aggregation of migration failures |
| 23 | + 1. View of object migration: |
| 24 | + |
| 25 | + | Object Type | Object ID | Migrated | Migration Failures | |
| 26 | + |----|----|----|----| |
| 27 | + |View|hive_metastore.finance.transaction_vw|1|[]| |
| 28 | + |Table|hive_metastore.finance.transactions|0|["Table uses SERDE: csv"]| |
| 29 | + |Table|hive_metastore.finance.accounts|0|[]| |
| 30 | + |Cluster|klasd-kladef-01265|0|["Uses Passthru authentication"]| |
| 31 | + |
| 32 | +1. By default the target is the target_catalog/database_name |
| 33 | +1. The assessment will generate a mapping file/table. The file will be in CSV format. |
| 34 | + |
| 35 | + | Source Database | Target Catalog | Target Database | |
| 36 | + |----|----|----| |
| 37 | + |finance| de_dev | finance | |
| 38 | + |hr | de_dev | human_resources| |
| 39 | + |sales | ucx-dev_ws | sales | |
| 40 | +1. The user can download the mapping file, override the targets and upload it to the workspace .csx folder. |
| 41 | +1. By default we copy the table content (CTAS) |
| 42 | +1. Allow skipping individual tables/databases |
| 43 | +1. Explore sizing tables or another threshold (recursively count bytes) |
| 44 | +1. By default we copy the table into a managed table/managed location |
| 45 | +1. Allow overriding target to an external table |
| 46 | +1. We should migrate ACLs for the tables (where applicable). We should highlight cases where we can't (no direct translation/conflicts) |
| 47 | +1. We should consider automating ACLs based on Instance Profiles / Service Principals and other legacy security mechanisms |
| 48 | + |
| 49 | +## Tables (Parquet/Delta) on DBFS root |
| 50 | +1. By default we copy the table content (CTAS) |
| 51 | +1. Allow skipping individual tables/databases |
| 52 | +1. Explore sizing tables or another threshold (recursively count bytes) |
| 53 | +1. By default we copy the table into a managed table/managed location |
| 54 | +1. Allow overriding target to an external table |
| 55 | +1. Allow an exception list in case we want to skip certain tables |
| 56 | + |
| 57 | +## Tables (Parquet/Delta) on Cloud Storage |
| 58 | +1. Verify that we have the external locations for these tables |
| 59 | +1. Automate creation of External Locations (Future) |
| 60 | +1. Use sync to upgrade these tables "in place". Use the default or override catalog.database destination. |
| 61 | +1. Update the source table with "upgraded_to" property |
| 62 | + |
| 63 | +## Tables (None Parquet/Delta) |
| 64 | +1. Copy this table using CTAS or Deep Clone. (Consider bringing history) |
| 65 | +1. Copy the Metadata |
| 66 | +1. Skip tables as needed based on size threshold or an exception list |
| 67 | +1. Update the source table with "upgraded_to" property |
| 68 | + |
| 69 | +## Views |
| 70 | +1. Make a "best effort" attempt to upgrade view |
| 71 | +1. Create a view in the new location |
| 72 | +1. Upgrade table reference to the new tables (based on the upgraded_to table property) |
| 73 | +1. Handle nested views |
| 74 | +1. Handle or highlight other cases (functions/storage references/ETC) |
| 75 | +1. Create an exception list with views failures |
| 76 | + |
| 77 | +## Functions |
| 78 | +1. We should migrate (if possible) functions |
| 79 | +1. Address incompatibilities |
| 80 | + |
| 81 | + |
| 82 | +## Account Consideration |
| 83 | +1. HMS on multiple workspaces may point to the same assets. We need to dedupe upgrades. |
| 84 | +1. Allow running assessment on all the accounts workspaces or on a group of workspaces. |
| 85 | +1. We have to test on Glue and other external Metastores |
| 86 | +1. Create an exception list at the account level the list should contain |
| 87 | + 1. Tables that show up on more than one workspace (pointing to the same cloud storage location) |
| 88 | + 1. Tables that show up on more than one workspace with different metadata |
| 89 | + 1. Tables that show up on more than one workspace with different ACLs |
| 90 | +1. Addressing table conflicts/duplications require special processing we have the following options |
| 91 | + 1. Define a "master" and create derivative objects as views |
| 92 | + 1. Flag and skip the dupes |
| 93 | + 1. Duplicate the data and create dupes |
| 94 | +1. Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades. |
| 95 | +1. Allow workspace admins to upgrade more than one workspace. |
| 96 | + |
| 97 | +## Open Questions |
| 98 | +1. How do we manage/surface potential cost of the assessment run in case of many workspaces. |
| 99 | +1. How do we handle conflicts between workspaces |
| 100 | +1. What mechanism do we use to map source to target databases |
| 101 | +1. How to list workspaces in Azure/AWS |
| 102 | + |
0 commit comments