Skip to content

Commit 8d0aa5d

Browse files
asnarenfx
andauthored
Update migration-progress-experimental workflow to crawl tables from the main cluster (#3269)
## Changes This PR updates the `migration-progress-experimental` workflow so that it uses the `main` cluster instead of the `tacl` one to crawl for tables. Crawling from the `tacl` cluster fails: the Py4j bridge isn't available (which the crawler relies on). ### Linked issues Resolves #3268 ### Functionality - modified existing workflow: `migration-progress-experimental` ### Tests - manually tested - existing integration tests --------- Co-authored-by: Serge Smertin <[email protected]>
1 parent fe68f4f commit 8d0aa5d

File tree

1 file changed

+8
-11
lines changed

1 file changed

+8
-11
lines changed

src/databricks/labs/ucx/progress/workflows.py

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -31,34 +31,31 @@ def verify_prerequisites(self, ctx: RuntimeContext) -> None:
3131
"""
3232
ctx.verify_progress_tracking.verify(timeout=dt.timedelta(hours=1))
3333

34-
@job_task(job_cluster="tacl")
35-
def setup_tacl(self, ctx: RuntimeContext):
36-
"""(Optimization) Allow the TACL job cluster to be started while we're verifying the prerequisites for
37-
refreshing everything."""
38-
39-
@job_task(depends_on=[verify_prerequisites, setup_tacl], job_cluster="tacl")
34+
@job_task(depends_on=[verify_prerequisites])
4035
def crawl_tables(self, ctx: RuntimeContext) -> None:
4136
"""Iterates over all tables in the Hive Metastore of the current workspace and persists their metadata, such
4237
as _database name_, _table name_, _table type_, _table location_, etc., in the table named
4338
`$inventory_database.tables`. The metadata stored is then used in the subsequent tasks and workflows to, for
4439
example, find all Hive Metastore tables that cannot easily be migrated to Unity Catalog."""
45-
# The TACL cluster is not UC-enabled, so the snapshot cannot be written immediately to the history log.
46-
# Step 1 of 2: Just refresh the inventory.
40+
# The table inventory cannot be (quickly) crawled from the table_migration cluster, and the main cluster is not
41+
# UC-enabled, so we cannot both snapshot and update the history log from the same location.
42+
# Step 1 of 3: Just refresh the inventory.
4743
ctx.tables_crawler.snapshot(force_refresh=True)
4844

4945
@job_task(depends_on=[verify_prerequisites, crawl_tables], job_cluster="table_migration")
5046
def refresh_table_migration_status(self, ctx: RuntimeContext) -> None:
5147
"""Scan the tables (and views) in the inventory and record whether each has been migrated or not."""
48+
# Step 2 of 3: Refresh the migration status of all the tables (updated in the previous step on the main cluster.)
5249
ctx.migration_status_refresher.snapshot(force_refresh=True)
5350

5451
@job_task(
5552
depends_on=[verify_prerequisites, crawl_tables, refresh_table_migration_status], job_cluster="table_migration"
5653
)
5754
def update_tables_history_log(self, ctx: RuntimeContext) -> None:
5855
"""Update the history log with the latest tables inventory snapshot."""
59-
# The table migration cluster is not legacy-ACL enabled, so we can't crawl from here.
60-
# Step 2 of 2: Assuming (due to depends-on) the inventory was refreshed, capture into the history log.
61-
# WARNING: this will fail if the inventory is empty, because it will then try to perform a crawl.
56+
# Step 3 of 3: Assuming (due to depends-on) the inventory and migration status were refreshed, capture into the
57+
# history log.
58+
# TODO: Avoid triggering implicit refresh here if either the table or migration-status inventory is empty.
6259
history_log = ctx.tables_progress
6360
tables_snapshot = ctx.tables_crawler.snapshot()
6461
history_log.append_inventory_snapshot(tables_snapshot)

0 commit comments

Comments
 (0)