v0.37.0
- Added ability to run create-missing-principals command as collection (#2675). This release introduces the capability to run the
create-missing-principalscommand as a collection in the UCX (Unified Cloud Experience) tool with the new optional flagrun-as-collection. This allows for more control and flexibility when managing cloud resources, particularly in handling multiple workspaces. The existingcreate-missing-principalscommand has been modified to accept a newrun_as_collectionparameter, enabling the command to run on multiple workspaces when set to True. The function has been updated to handle a list ofWorkspaceContextobjects, allowing it to iterate over each object and execute necessary actions for each workspace. Additionally, a newAccountClientparameter has been added to facilitate the retrieval of all workspaces associated with a specific account. New test functions have been added totest_cli.pyto test this new functionality on AWS and Azure cloud providers. Theacc_clientargument has been added to the test functions to enable running the tests with an authenticated AWS or Azure client, and theMockPromptsobject is used to simulate user responses to the prompts displayed during the execution of the command. - Added storage for direct filesystem references in code (#2526). The open-source library has been updated with a new table
directfs_in_pathsto store Direct File System Access (DFSA) records, extending support for managing and collecting DFSAs as part of addressing issue #2350 and #2526. The changes include a new classDirectFsAccessCrawlersand methods for handling DFSAs, as well as linting, testing, and a manually verified schema upgrade. Additionally, a new SQL query deprecates the use of direct filesystem references. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare. - Added task for linting queries (#2630). This commit introduces a new
QueryLinterclass for linting SQL queries in the workspace, similar to the existingWorkflowLinterfor jobs. TheQueryLinterchecks for any issues in dashboard queries and reports them in a newquery_problemstable. The commit also includes the addition of unit tests, integration tests, and manual testing of the schema upgrade. TheQueryLintermethod has been updated to include aTableMigrationIndexobject, which is currently set to an empty list and will be updated in a future commit. This change improves the quality of the codebase by ensuring that all SQL queries are properly linted and any issues are reported, allowing for better maintenance and development of the system. The commit is co-authored by multiple developers, including Eric Vergnaud, Serge Smertin, Andrew Snare, and Cor. Additionally, a new linting rule, "direct-filesystem-access", has been introduced to deprecate the use of direct filesystem references in favor of more abstracted file access methods in the project's codebase. - Adopt
databricks-labs-pytesterPyPI package (#2663). In this release, we have made updates to thepyproject.tomlfile, removing thepytestpackage version 8.1.0 and updating it to 8.3.3. We have also added thedatabricks-labs-pytesterpackage with a minimum version of 0.2.1. This update also includes the adoption of thedatabricks-labs-pytesterPyPI package, which moves fixture usage frommixins.fixturesinto its own top-level library. This affects various test files, includingtest_jobs.py, by replacing theget_purge_suffixfixture withwatchdog_purge_suffixto standardize the approach to creating and managing temporary directories and files used in tests. Additionally, new fixtures have been introduced in a separate PR for testing thedatabricks.labs.ucxpackage, includingdebug_env_name,product_info,inventory_schema,make_lakeview_dashboard,make_dashboard,make_dbfs_data_copy,make_mounted_location,make_storage_dir,sql_exec, andmigrated_group. These fixtures simplify the testing process by providing preconfigured resources that can be used in the tests. Theredash.pyfile has been removed from thedatabricks/labs/ucx/mixinsdirectory as the Redash API is being deprecated and replaced with a new library. - Assessment: crawl UDFs as a task in parallel to tables instead of implicitly during grants (#2642). This release introduces changes to the assessment workflow, specifically in how User Defined Functions (UDFs) are crawled/scanned. Previously, UDFs were crawled/scanned implicitly by the GrantsCrawler, which requested a snapshot from the UDFSCrawler that hadn't executed yet. With this update, UDFs are now crawled/scanned as their own task, running in parallel with tables before grants crawling begins. This modification addresses issue #2574, which requires grants and UDFs to be refreshable but only once within a given workflow run. A new method, crawl_udfs, has been introduced to iterate over all UDFs in the Hive Metastore of the current workspace and persist their metadata in a table named $inventory_database.udfs. This inventory is utilized when scanning securable objects for issues with grants that cannot be migrated to Unit Catalog. The crawl_grants task now depends on crawl_udfs, crawl_tables, and setup_tacl, ensuring that UDFs are crawled/scanned before grants are.
- Collect direct filesystem access from queries (#2599). This commit introduces support for extracting Direct File System Access (DirectFsAccess) records from workspace queries, adding a new table
directfs_in_queriesand a new viewdirectfsthat unionsdirectfs_in_pathswith the new table. The DirectFsAccessCrawlers class has been refactored into two separate classes:DirectFsAccessCrawler.for_pathsandDirectFsAccessCrawler.for_queries, and a newQueryLinterclass has been introduced to check queries for DirectFsAccess records. Unit tests and manual tests have been conducted to ensure the correct functioning of the schema upgrade. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare. - Fixed failing integration test:
test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace(#2624). In this release, we have made updates to the group migration workflow, addressing an issue (#2623) where the integration testtest_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspacefailed due to unhandled scenarios where a workspace group already existed with the same name as an account group to be reflected. The changes include the addition of a new method,_workspace_groups_in_workspace(), which checks for the existence of workspace groups. We have also modified thegroup-migrationworkflow and integrated testtest_reflect_account_groups_on_workspace_skips_account_groups_when_a_workspace_group_has_same_name. To enhance consistency and robustness, theGroupManagerclass has been updated with two new methods:test_reflect_account_groups_on_workspace_warns_skipping_when_a_workspace_group_has_same_nameandtest_reflect_account_groups_on_workspace_logs_skipping_groups_when_already_reflected_on_workspace. These new methods check if a group is skipped when a workspace group with the same name exists and log a warning message, as well as log skipping groups that are already reflected on the workspace. These improvements ensure that the system behaves as expected during the group migration process, handling cases where workspace groups and account groups share the same name. - Fixed failing solution accelerator verification tests (#2648). This release includes a fix for an issue in the LocalCodeLinter class that was unable to normalize Python code at the notebook cell level. The solution involved modifying the LocalCodeLinter constructor to include a notebook loader, as well as adding a conditional block to the lint_path method to determine the correct loader to use based on whether the path is a notebook or not. These changes allow the linter to handle Python code more effectively within Jupyter notebook cells. The tests for this change were manually verified using
make solaccon the files that failed in CI. This commit has been co-authored by Eric Vergnaud. The functionality of the linter remains unchanged, and there is no impact on the overall software functionality. The target audience for this description includes software engineers who adopt this open-source library. - Fixed handling of potentially corrupt
state.jsonof UCX workflows (#2673). This commit introduces a fix for potential corruption ofstate.jsonfiles in UCX workflows, addressing issue #2673 and resolving #2667. It updates the import statement ininstall.py, introduces a newwith_extrafunction, and centralizes the deletion of jobs, improving code maintainability. Two new methods are added to check if a job is managed by UCX. Additionally, the commit removes deprecation warnings for direct filesystem references in pytester fixtures and adjusts the known.json file to accurately reflect the project's state. A newTaskmethod is added for defining UCX workflow tasks, and several test cases are updated to ensure the correct handling of jobs during the uninstallation process. Overall, these changes enhance the reliability and user-friendliness of the UCX workflow installation process. - Let
create-catalog-schemascommand run as collection (#2653). Thecreate-catalog-schemasandvalidate-external-locationscommands in thedatabricks labs ucxpackage have been updated to operate as collections, allowing for simultaneous execution on multiple workspaces. These changes, which resolve issue #2609, include the addition of new parameters and flags to the command signatures and method signatures, as well as updates to the existing functionality for creating catalogs and schemas. The changes have been manually tested and accompanied by unit tests, with integration tests to be added in a future update. Thecreate-catalog-schemascommand now accepts a list of workspace clients and arun_as_collectionparameter, and skips existing catalogs and schemas while logging a message. Thevalidate-external-locationscommand also operates as a collection, though specific details about this change are not provided. - Let
create-uber-principalcommand run on collection of workspaces (#2640). Thecreate-uber-principalcommand has been updated to support running on a collection of workspaces, allowing for more efficient management of service principals across multiple workspaces. This change includes the addition of a new flag,run-as-collection, which, when set to true, allows the command to run on a collection of workspaces with UCX installed. The command continues to grant STORAGE_BLOB_READER access to Azure storage accounts and identify S3 buckets used in AWS workspaces. The changes also include updates to the testing strategy, with manual testing and unit tests added. Integration tests will be added in a future PR. These modifications enhance the functionality and reliability of the command, improving the user experience for managing workspaces. In terms of implementation, thecreate_uber_principalmethod in theaccess.pyandcli.pyfiles has been updated to support running on a collection of workspaces. The modification includes the addition of a new parameter,run_as_collection, which, when set to True, allows the method to retrieve a collection of workspace contexts and execute the necessary operations for each context. The changes also include updates to the underlying methods, such as theaws_profilemethod, to ensure the correct cloud provider is being utilized. The behavior of the command has been isolated from the underlyingucxfunctionality by introducing mock values for the uber service principal ID and policy ID. The changes also include updates to the tests to reflect these modifications, with new tests added to ensure that the command behaves correctly when run on a collection of workspaces and to test the error handling for unsupported cloud providers and missing subscription IDs. - Let
migrate-aclscommand run as collection (#2664). Themigrate-aclscommand in thelabs.ymlfile has been updated to facilitate the migration of access control lists (ACLs) from a legacy metastore to a UC metastore for a collection of workspaces with Unity Catalog (UC) installed. This command now supports running as a collection, enabled by a new optional flagrun-as-collection. When set to true, the command will run for all workspaces with UC installed, enhancing efficiency and ease of use. The new functionality has been manually tested and verified with added unit tests. However, integration tests are yet to be added. The command is part of thedatabricks/labs/ucxmodule and is implemented in thecli.pyfile. This update addresses issue #2611 and includes both manual and unit tests. - Let
migrate-dbsql-dashboardscommand to run as collection (#2656). Themigrate-dbsql-dashboardscommand in thedatabricks labs ucxcommand group has been updated to support running as a collection, allowing it to migrate queries for all dashboards in one or more workspaces. This new feature is achieved by adding an optional flagrun-as-collectionto the command. If set to True, the command will be executed for all workspaces with ucx installed, resolving issue #2612. Themigrate-dbsql-dashboardsfunction has been updated to take additional parametersctx,run_as_collection, anda. Thectxparameter is an optionalWorkspaceContextobject, which can be used to specify the context for a single workspace. If not provided, the function will retrieve a list ofWorkspaceContextobjects for all workspaces. Therun_as_collectionparameter is a boolean flag indicating whether the command should run as a collection. If set to True, the function will iterate over all workspaces and migrate queries for all dashboards in each workspace. Theaparameter is an optionalAccountClientobject for authentication. Unit tests have been added to ensure that the new functionality works as expected. This feature will be useful for users who need to migrate many dashboards at once. Integration tests will be added in a future update after issue #2507 is addressed. - Let
migrate-locationscommand run as collection (#2652). Themigrate-locationscommand in thedatabricks labs ucxlibrary for AWS and Azure has been enhanced to support running as a collection of workspaces, allowing for more efficient management of external locations. This has been achieved by modifying the existingdatabricks labs ucx migrate-locationscommand and adding arun_as_collectionflag to specify that the command should run for a collection of workspaces. The changes include updates to therunmethod inlocations.pyto return a list of strings containing the URLs of missing external locations, and the addition of the_filter_unsupported_locationmethod to filter out unsupported locations. A new_get_workspace_contextsfunction has been added to return a list ofWorkspaceContextobjects based on the providedWorkspaceClient,AccountClient, and named parameters. The commit also includes new test cases for handling unsupported cloud providers and testing therun as collectionfunctionality with multiple workspaces, as well as manual and unit tests. Note that due to current limitations in unit testing, therun as collectiontests for both Azure and AWS raise exceptions. - Let
migrate-tablescommand run as collection (#2654). Themigrate-tablescommand in thelabs.ymlconfiguration file has been updated to support running as a collection of workspaces with UCX installed. This change includes adding a new flagrun_as_collectionthat, when set toTrue, allows the command to run on all workspaces in the collection, and modifying the existing command to accept anAccountClientobject andWorkspaceContextobjects. The function_get_workspace_contextsis used to retrieve theWorkspaceContextobjects for each workspace in the collection. Additionally, themigrate_tablescommand now checks for the presence of hiveserde and external tables and prompts the user to run themigrate-external-hiveserde-tables-in-place-experimentalandmigrate-external-tables-ctasworkflows, respectively. The command's documentation and tests have also been updated to reflect this new functionality. Integration tests will be added in a future update. These changes improve the scalability and efficiency of themigrate-tablescommand, allowing for easier and more streamlined execution across multiple workspaces. - Let
validate-external-locationscommand run as collection (#2649). In this release, thevalidate-external-locationscommand has been updated to support running as a collection, allowing it to operate on multiple workspaces simultaneously. This change includes the addition of new parametersctx,run_as_collection, andato thevalidate-external-locationscommand in thecli.pyfile. Thectxparameter determines the current workspace context, obtained through the_get_workspace_contextsfunction whenrun_as_collectionis set to True. The function queries for all available workspaces associated with the given account clienta. Thesave_as_terraform_definitions_on_workspacemethod is then called to save the external locations as Terraform definitions on the workspace. This enhancement improves the validation process for external locations across multiple workspaces. Additionally, the command's implementation has been updated to include therun_as_collectionparameter, which controls whether the command is executed as a collection, ensuring sequential execution of each statement within the command. The unit tests have been updated to include a test case that verifies this functionality. Thevalidate_external_locationsfunction has also been updated to include actxparameter, which is used to specify the workspace context. These changes improve the functionality of thevalidate-external-locationscommand, ensuring sequential execution of statements across workspaces. - Let
validate-groups-membershipcommand to run as collection (#2657). The latest commit introduces an optionalrun-as-collectionflag to thevalidate-groups-membershipcommand in thelabs.ymlconfiguration file. This flag, when set to true, enables the command to run for a collection of workspaces equipped with UCX. The updatedvalidate-groups-membershipcommand indatabricks/labs/ucx/cli.pynow accepts new arguments:ctx,run_as_collection, anda. This change resolves issue #2613 and includes updated unit and manual tests, ensuring thorough functionality verification. The new feature allows software engineers to validate group memberships across multiple workspaces simultaneously, enhancing efficiency and ease of use. When run as a collection, the command validates groups at both the account and workspace levels, comparing memberships for each specified workspace context. - Removed installing on workspace log message in
_get_installer(#2641). In this enhancement, the_get_installerfunction in theinstall.pyfile has undergone modification to improve the clarity of the installation process for users. Specifically, a confusing log message that incorrectly indicated that UCX was being installed when it was not, has been removed. The log message has been relocated to a more accurate position in the codebase. It is important to note that the_get_installerfunction itself has not been modified, only the log message has been removed. This change eliminates confusion about the installation of UCX, thus enhancing the overall user experience. - Support multiple subscription ids for command line commands (#2647). The
databricks labs ucxtool now supports multiple subscription IDs for thecreate-uber-principal,guess-external-locations,migrate-credentials, andmigrate-locationscommands. This change allows users to specify multiple subscriptions for scanning storage accounts, improving management for users who handle multiple subscriptions simultaneously. Relevant flags in thelabs.ymlconfiguration file have been updated, and unit tests, as well as manual testing, have been conducted to ensure proper functionality. In thecli.pyfile, thecreate_uber_principalandprincipal_prefix_accessfunctions have been updated to accept a list of subscription IDs, affecting thecreate_uber_principalandprincipal_prefix_accesscommands. Theazure_subscription_idproperty has been renamed toazure_subscription_ids, modifying theazureResourcesconstructor and ensuring correct handling of the subscription IDs. - Updated databricks-labs-lsql requirement from <0.11,>=0.5 to >=0.5,<0.12 (#2666). In this release, we have updated the version requirement for the
databricks-labs-lsqllibrary in the 'pyproject.toml' file from a version greater than or equal to 0.5 and less than 0.11 to a version greater than or equal to 0.5 and less than 0.12. This change allows us to use the latest version of thedatabricks-labs-lsqllibrary while still maintaining a version range constraint. This library provides functionality for managing and querying data in Databricks, and this update ensures compatibility with the project's existing dependencies. No other changes are included in this commit. - Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22 (#2633). In this pull request, we have updated the
sqlglotdependency requirement in thepyproject.tomlfile. The previous requirement was for a minimum version of 25.5.0 and less than 25.21, which has now been changed to a minimum version of 25.5.0 and less than 25.22. This update allows us to utilize the latest version ofsqlglot, up to but not including version 25.22. While the changelog and commits for the latest version ofsqlglothave been provided for reference, the specific changes made to the project as a result of this update are not detailed in the pull request description. Therefore, as a reviewer, it is essential to verify the compatibility of the updatedsqlglotversion with our project and ensure that any necessary modifications have been made to accommodate the new version. - fix test_running_real_remove_backup_groups_job timeout (#2651). In this release, we have made an adjustment to the
test_running_real_remove_backup_groups_jobtest case by increasing the timeout of an inner task from 90 seconds to 3 minutes. This change is implemented to address the timeout issue reported in issue #2639. Furthermore, to ensure the correct functioning of the code, we have incorporated integration tests. It is important to note that the functionality of the code remains unaffected. This enhancement aims to provide a more reliable and efficient testing process, thereby improving the overall quality of the open-source library.
Dependency updates:
- Updated sqlglot requirement from <25.21,>=25.5.0 to >=25.5.0,<25.22 (#2633).
- Updated databricks-labs-lsql requirement from <0.11,>=0.5 to >=0.5,<0.12 (#2666).
Contributors: @JCZuurmond, @ericvergnaud, @asnare, @dependabot[bot], @nfx, @HariGS-DB