v0.38.0
- Added Py4j implementation of tables crawler to retrieve a list of HMS tables in the assessment workflow (#2579). In this release, we have added a Py4j implementation of a tables crawler to retrieve a list of Hive Metastore tables in the assessment workflow. A new
FasterTableScanCrawlerclass has been introduced, which can be used in the Assessment Job based on a feature flag to replace the old Scala code, allowing for better logging during table scans. The existingassessment.crawl_tablesworkflow now utilizes the new py4j crawler instead of the scala one. Integration tests have been added to ensure the functionality works correctly. The commit also includes a new method for listing table names in the specified database and improvements to error handling and logging mechanisms. The new Py4j tables crawler enhances the functionality of the assessment workflow by improving error handling, resulting in better logging and faster table scanning during the assessment process. This change is part of addressing issue #2190 and was co-authored by Serge Smertin. - Added
create-ucx-catalogcli command (#2694). A new CLI command,create-ucx-catalog, has been added to create a catalog for migration tracking that can be used across multiple workspaces. The command creates a UCX catalog for tracking migration status and artifacts, and is created by runningdatabricks labs ucx create-ucx-catalogand specifying the storage location for the catalog. Relevant user documentation, unit tests, and integration tests have been added for this command. Theassign-metastorecommand has also been updated to allow for the selection of a metastore when multiple metastores are available in the workspace region. This change improves the migration tracking feature and enhances the user experience. - Added experimental
migration-progress-experimentalworkflow (#2658). This commit introduces an experimental workflow,migration-progress-experimental, which refreshes the inventory for various resources such as clusters, grants, jobs, pipelines, policies, tables, TableMigrationStatus, and UDFs. The workflow can be triggered using thedatabricks labs ucx migration-progressCLI command and uses a new implementation of a Scala-based crawler,TablesCrawler, which will eventually replace the current implementation. The new workflow is a duplicate of most of theassessmentpipeline's functionality but with some differences, such as the use ofTablesCrawler. Relevant user documentation has been added, along with unit tests, integration tests, and a screenshot of a successful staging environment run. The new workflow is expected to run on a schedule in the future. This change resolves #2574 and progresses #2074. - Added handling for
InternalErrorinListing.__iter__(#2697). This release introduces improved error handling in theListing.__iter__method of theGenericclass, located in theworkspace_access/generic.pyfile. Previously, onlyNotFoundexceptions were handled, but now bothInternalErrorandNotFoundexceptions are caught and logged appropriately. This change enhances the robustness of the method, which is responsible for listing objects of a specific type and returning them asGenericPermissionsInfoobjects. To ensure the correct functionality, we have added new unit tests and manual testing. The logging of theInternalErrorexception is properly handled in theGenericPermissionsSupportclass when listing serving endpoints. This behavior is verified by the newly added test functiontest_internal_error_in_serving_endpoints_raises_warningand the updatedtest_serving_endpoints_not_enabled_raises_warning. - Added handling for
PermissionDeniedwhen listing accessible workspaces (#2733). A newcan_administermethod has been added to theWorkspacesclass in theworkspaces.pyfile, which allows for more fine-grained control over which users can administer workspaces. This method checks if the user has access to a given workspace and is a member of the workspace'sadminsgroup, indicating that the user has administrative privileges for that workspace. If the user does not have access to the workspace or is not a member of theadminsgroup, the method returnsFalse. Additionally, error handling in theget_accessible_workspacesmethod has been improved by adding aPermissionDeniedexception to the list of exceptions that are caught and logged. New unit tests have been added for theAccountWorkspacesclass of thedatabricks.labs.blueprint.accountmodule to ensure that the new method is functioning as intended, specifically checking if a user is a workspace administrator based on whether they belong to theadminsgroup. The linked issue #2732 is resolved by this change. All changes have been manually and unit tested. - Added static code analysis results to assessment dashboard (#2696). This commit introduces two new tasks,
assess_dashboardsandassess_workflows, to the existing assessment dashboard for identifying migration problems in dashboards and workflows. These tasks analyze embedded queries and notebooks for migration issues and collect direct filesystem access patterns requiring attention. Upon completion, the results are stored in the inventory database and displayed on the Migration dashboard. Additionally, two new widgets, job/query problem widgets and directfs access widgets, have been added to enhance the dashboard's functionality by providing additional information related to code compatibility and access control. Integration tests using mock data have been added and manually tested to ensure the proper functionality of these new features. This update improves the overall assessment and compatibility checking capabilities of the dashboard, making it easier for users to identify and address issues related to Unity Catalog compatibility in their workflows and dashboards. - Added unskip CLI command to undo a skip on schema or a table (#2727). This pull request introduces a new CLI command, "unskip", which allows users to reverse a previously applied
skipon a schema or table. Theunskipcommand accepts a required--schemaparameter and an optional--tableparameter. A new function, also named "unskip", has been added, which takes the same parameters as theskipcommand. The function checks for the required--schemaparameter and creates a new WorkspaceContext object to call the appropriate method on the table_mapping object. Two new methods,unskip_schemaand "unskip_table_or_view", have been added to the HiveMapping class. These methods remove the skip mark from a schema or table, respectively, and handle exceptions such as NotFound and BadRequest. The get_tables_to_migrate method has been updated to consider the unskipped tables or schemas. Currently, the feature is tested manually and has not been added to the user documentation. - Added unskip CLI command to undo a skip on schema or a table (#2734). A new
unskipCLI command has been added to the project, which allows users to remove theskipmark set by the existingskipcommand on a specified schema or table. This command takes an optional--tableflag, and if not provided, it will unskip the entire schema. The new functionality is accompanied by a unit test and relevant user documentation, and addresses issue #1938. The implementation includes the addition of theunskip_table_or_viewmethod, which generates the appropriateALTER TABLE/VIEWstatement to remove the skip marker, and updates to theunskip_schemamethod to include the schema name in theALTER SCHEMAstatement. Additionally, exception handling has been updated to includeNotFoundandBadRequestexceptions. This feature simplifies the process of undoing a skip on a schema, table, or view in the Hive metastore, which previously required manual editing of the Hive metastore properties. - Assess source code as part of the assessment (#2678). This commit introduces enhancements to the assessment workflow, including the addition of two new tasks for evaluating source code from SQL queries in dashboards and from notebooks/files in jobs and tasks. The existing
databricks labs install ucxcommand has been modified to incorporate linting during the assessment. TheQueryLinterclass has been updated to accept an additional argument for linting source code. These changes have been thoroughly tested through integration tests to ensure proper functionality. Co-authored by Eric Vergnaud. - Bump astroid version, pylint version and drop our f-string workaround (#2746). In this update, we have bumped the versions of astroid and pylint to 3.3.1 and removed workarounds related to f-string inference limitations in previous versions of astroid (< 3.3). These workarounds were necessary for handling issues such as uninferrable sys.path values and the lack of f-string inference in loops. We have also updated corresponding tests to reflect these changes and improve the overall code quality and maintainability of the project. These changes are part of a larger effort to update dependencies and simplify the codebase by leveraging the latest features of updated tools and removing obsolete workarounds.
- Delete temporary files when running solacc (#2750). This commit includes changes to the
solacc.pyscript to improve the linting process for thesolaccrepository, specifically targeting the issue of excessive temporary files that were exceeding CI storage capacity. The modifications include linting the repository on a per-top-levelsolutionbasis, where each solution resides within the top folders and is independent of others. Post-linting, temporary files and directories registered inPathLookupare deleted to enhance storage efficiency. Additionally, this commit prepares for improving false positive detection and introduces a newSolaccContextclass that tracks various aspects of the linting process, providing more detailed feedback on the linting results. This change does not introduce new functionality or modify existing functionality, but rather optimizes the linting process for thesolaccrepository, maintaining CI storage capacity levels within acceptable limits. - Don't report direct filesystem access for API calls (#2689). This release introduces enhancements to the Direct File System Access (DFSA) linter, resolving false positives in API call reporting. The
ws.api_client.docall previously triggered inaccurate direct filesystem access alerts, which have been addressed by adding new methods to identify HTTP call parameters and specific API calls. The linter now disregards DFSA patterns within known API calls, eliminating false positives with relative URLs and duplicate advice from SparkSqlPyLinter. Additionally, improvements in thepython_ast.pyandpython_infer.pyfiles include the addition ofis_instance_ofandis_from_modulemethods, along with safer inference methods to prevent infinite recursion and enhance value inference. These changes significantly improve the DFSA linter's accuracy and effectiveness when analyzing code containing API calls. - Enables cli cmd
databricks labs ucx create-catalog-schemasto apply catalog/schema acl from legacy hive_metastore (#2676). The new release introduces adatabricks labs ucx create-catalog-schemascommand, which applies catalog/schema Access Control List (ACL) from a legacy hive_metastore. This command modifies the existingtable_mappingmethod to include a newgrants_crawlerparameter in theCatalogSchemaconstructor, enabling the application of ACLs from the legacy hive_metastore. A corresponding unit test is included to ensure proper functionality. TheCatalogSchemaclass in thedatabricks.labs.ucx.hive_metastore.catalog_schemamodule has been updated with a new argumenthive_acland the integration of theGrantsCrawlerclass. TheGrantsCrawlerclass is responsible for crawling the Hive metastore and retrieving grants for catalogs, schemas, and tables. Theprepare_testfunction has been updated to include thehive_aclargument and thetest_catalog_schema_aclfunction has been updated to test the new functionality, ensuring that the correct grant statements are generated for a wider range of principals and catalogs/schemas. These changes improve the functionality and usability of thedatabricks labs ucx create-catalog-schemascommand, allowing for a more seamless transition from a legacy hive metastore. - Fail
make teston coverage below 90% (#2682). A new change has been introduced to the pyproject.toml file to enhance the codebase's quality and robustness by ensuring that the test coverage remains above 90%. This has been accomplished by adding the--cov-fail-under=90flag to thetestandcoveragescripts in the[tool.hatch.envs.default.scripts]section. This flag will cause themake testcommand to fail if the coverage percentage falls below the specified value of 90%, ensuring that all new changes are thoroughly tested and that the codebase maintains a minimum coverage threshold. This is a best practice for maintaining code coverage and improving the overall quality and reliability of the codebase. - Fixed DFSA false positives from f-string fragments (#2679). This commit addresses false positive DataFrame API Scanning Antipattern (DFSA) reports in Python code, specifically in f-string fragments containing forward slashes and curly braces. The linter has been updated to accurately detect DFSA paths while avoiding false positives, and it now checks for
JoinedStrfragments in string constants. Additionally, the commit rectifies issues with duplicate advices reported bySparkSqlPyLinter. No new features or major functionality changes have been introduced; instead, the focus has been on improving the reliability and accuracy of DFSA detection. Co-authored by Eric Vergnaud, this commit includes new unit tests and refinements to the DFSA linter, specifically addressing false positive patterns likef"/Repos/{thing1}/sdk-{thing2}-{thing3}". To review these changes, consult the updated tests in thetests/unit/source_code/linters/test_directfs.pyfile, such as the new test case for the f-string pattern causing false positives. By understanding these improvements, you'll ensure your project adheres to the latest updates, maintaining quality and accurate DFSA detection. - Fixed failing integration tests that perform a real assessment (#2736). In this release, we have made significant improvements to the integration tests in the
assessmentworkflow, by reducing the scope of the assessment and improving efficiency and reliability. We have removed several object creation functions and added a new functionpopulate_for_lintingfor linting purposes. Thepopulate_for_lintingfunction adds necessary information to the installation context, and is used to ensure that the integration tests still have the required data for linting. We have also added a pytest fixturepopulate_for_lintingto set up a minimal amount of data in the workspace for linting purposes. These changes have been implemented in thetest_workflows.pyfile in the integration/assessment directory. This will help to ensure that the tests are not unnecessarily extensive, and that they are able to accurately assess the functionality of the library. - Fixed sqlglot crasher with 'drop schema ...' statement (#2758). In this release, we have addressed a crash issue in the
sqlglotlibrary caused by thedrop schemastatement. A new method,_unsafe_lint_expression, has been introduced to prevent the crash by checking if the current expression is aUse,Create, orDropstatement and updating theschemaattribute accordingly. The library now correctly handles thedrop schemastatement and returns aDeprecationwarning if the table being processed is in thehive_metastorecatalog and has been migrated to the Unity Catalog. Unit tests have been added to ensure the correct behavior of this code, and the linter forfrom tableSQL has been updated to parse and handle thedrop schemastatement without raising any errors. These changes improve the library's overall reliability and stability, allowing it to operate smoothly with thedrop schemastatement. - Fixed test failure:
test_table_migration_job_refreshes_migration_status[regular-migrate-tables](#2625). In this release, we have addressed two issues (#2621 and #2537) and fixed a test failure intest_table_migration_job_refreshes_migration_status[regular-migrate-tables]. Theindexandindex_full_refreshmethods intable_migrate.pyhave been updated to accept a newforce_refreshflag. When set toTrue, these methods will ensure that the migration status is up-to-date. This change also affects theViewsMigrationSequencerclass, which now passesforce_refresh=Trueto theindexmethod. Additionally, we have fixed a test failure by reusing theforce_refreshflag to ensure the migration status is up-to-date. TheTableMigrationStatusclass intable_migration_status.pyhas been modified to accept an optionalforce_refreshparameter in theindexmethod, and a unit test has been updated to assert the correct behavior when updating the migration status. - Fixes error message (#2759). The
loadmethod of themapping.pyfile in thedatabricks/labs/ucx/hive_metastorepackage has been updated to correct an error message displayed when aNotFoundexception is raised. The previous message suggested running an incorrect command, which has been updated to the correct one: "Please run: databricks labs ucx create-table-mapping". This change does not add any new methods or alter existing functionality, but instead focuses on improving the user experience by providing accurate information when an error occurs. The scope of this change is limited to updating the error message, and no other modifications have been made. - Fixes issue of circular dependency of migrate-location ACL (#2741). In this release, we have resolved two issues (#274
- Fixes source table alias dissapearance during migrate_views (#2726). This release introduces a fix to preserve the alias for the source table during the conversion of CREATE VIEW SQL from the legacy Hive metastore to the Unity Catalog. The issue was addressed by adding a new test case,
test_migrate_view_alias_test, to verify the correct handling of table aliases during migration. The changes also include a fix for the SQL conversion and new test cases to ensure the correct handling of table aliases, reflected in accurate SQL conversion. A new parameter,alias, has been added to the Table class, and theapplymethod in thefrom_table.pyfile has been updated. The migration process has been updated to retain the original alias of the table. Unit tests have been added and thoroughly tested to confirm the correctness of the changes, including handling potential intermittent failures caused by external dependencies. - Py4j table crawler: suggestions/fixes for describing tables (#2684). This release introduces significant improvements and fixes to the Py4J-based table crawler, enhancing its capability to describe tables effectively. The code for fetching table properties over the bridge has been updated, and error tracing has been improved through individual fetching of each table property and providing python backtrace on JVM side errors. Scala
Optionvalues unboxing issues have been resolved, and a small optimization has been implemented to detect partitioned tables without materializing the collection. The table's.viewText()property is now properly handled as a ScalaOption. Thecatalogargument is now explicitly verified to behive_metastore, and a new static method_option_as_pythonhas been introduced for safely extracting values from ScalaOption. The_describemethod has been refactored to handle exceptions more gracefully and improved code readability. These changes result in better functionality, error handling, logging, and performance when describing tables within a specified catalog and database. The linked issues #2658 and #2579 are progressed through these updates, and appropriate testing has been conducted to ensure the improvements' effectiveness. - Speedup assessment workflow by making DBFS root table size calculation parallel (#2745). In this release, the assessment workflow for calculating DBFS root table size has been optimized through the parallelization of the calculation process, resulting in improved performance. This has been achieved by updating the
pipelines_crawlerfunction insrc/databricks/labs/ucx/contexts/workflow_task.py, specifically thecached_property table_size_crawler, to include an additional argumentself.config.include_databases. TheTablesCrawlerclass has also been modified to include a generic type parameterTable, enabling type hinting and more robust type checking. Furthermore, the unit test filetest_table_size.pyin thehive_metastoredirectory has been updated to handle corrupt tables and invalid delta format errors more effectively. Additionally, a new entrydatabricks-pydabshas been added to the "known.json" file, potentially enabling better integration with thedatabricks-pydabslibrary or providing necessary configuration information for parallel processing. Overall, these changes improve the efficiency and scalability of the codebase and optimize the assessment workflow for calculating DBFS root table size. - Updated databricks-labs-blueprint requirement from <0.9,>=0.8 to >=0.8,<0.10 (#2747). In this update, the requirement for
databricks-labs-blueprinthas been updated to version>=0.8,<0.10in thepyproject.tomlfile. This change allows the project to utilize the latest features and bug fixes included in version 0.9.0 of thedatabricks-labs-blueprintlibrary. Notable updates in version 0.9.0 consist of the addition of Databricks CLI version as part of routed command telemetry and support for Unicode Byte Order Mark (BOM) in file upload and download operations. Additionally, various bug fixes and improvements have been implemented for theWorkspacePathclass, including the addition ofstat()methods and improved compatibility with different versions of Python. - Updated databricks-labs-lsql requirement from <0.12,>=0.5 to >=0.5,<0.13 (#2688). In this update, the version requirement of the
databricks-labs-lsqllibrary has been changed from a version greater than or equal to 0.5 and less than 0.12 to a version greater than or equal to 0.5 and less than 0.13. This allows the project to utilize the latest version of 'databricks-labs-lsql', which includes new methods for differentiating between a table that has never been written to and one with zero rows in the MockBackend class. Additionally, the update adds support for various filter types and improves testing coverage and reliability. The release notes and changelog for the updated library are provided in the commit message for reference. - Updated documentation to explain the usage of collections and eligible commands (#2738). The latest update to the Databricks Labs Unified CLI (UCX) tool introduces the
join-collectioncommand, which enables users to join two or more workspaces into a collection, allowing for streamlined and consolidated command execution across multiple workspaces. This feature is available to Account admins on the Databricks account, Workspace admins on the workspaces to be joined, and requires UCX installation on the workspace. To run collection-eligible commands, users can simply pass the--run-as-collection=Trueflag. This enhancement enhances the UCX tool's functionality, making it easier to manage and execute commands on multiple workspaces. - Updated sqlglot requirement from <25.22,>=25.5.0 to >=25.5.0,<25.23 (#2687). In this pull request, we have updated the version requirement for the
sqlglotlibrary in the pyproject.toml file. The previous requirement specified a version greater than or equal to 25.5.0 and less than 25.22, but we have updated it to allow for versions greater than or equal to 25.5.0 and less than 25.23. This change allows us to use the latest version of 'sqlglot', while still ensuring compatibility with other dependencies. Additionally, this pull request includes a detailed changelog from thesqlglotrepository, which provides information on the features, bug fixes, and changes included in each version. This can help us understand the scope of the update and how it may impact our project. - [DOCUMENTATION] Improve documentation on using account profile for
sync-workspace-infocli command (#2683). Thesync-workspace-infoCLI command has been added to the Databricks Labs UCX package, which uploads the workspace configuration to all workspaces in the Databricks account where theucxtool is installed. This feature requires Databricks Account Administrator privileges and is necessary to create an immutable default catalog mapping for the table migration process. It also serves as a prerequisite for thecreate-table-mappingcommand. To utilize this command, users must configure the Databricks CLI profile with access to the Databricks account console, available at "accounts.cloud.databricks.com" or "accounts.azuredatabricks.net". Additionally, the documentation for using the account profile with thesync-workspace-infocommand has been enhanced, addressing issue #1762. - [DOCUMENTATION] Improve documentation when installing UCX from a machine with restricted internet access (#2690). "A new section has been added to the
ADVANCEDinstallation section of the UCX library documentation, providing detailed instructions for installing UCX with a company-hosted PyPI mirror. This feature is intended for environments with restricted internet access, allowing users to bypass the public PyPI index and use a company-controlled mirror instead. Users will need to add all UCX dependencies to the company-hosted PyPI mirror and set thePIP_INDEX_URLenvironment variable to the mirror URL during installation. The solution also includes a prompt asking the user if their workspace blocks internet access. Additionally, the documentation has been updated to clarify that UCX requires internet access to connect to GitHub for downloading the tool, specifying the necessary URLs that need to be accessible. This update aims to improve the installation process for users with restricted internet access and provide clear instructions and prompts for installing UCX on machines with limited internet connectivity."
Dependency updates:
- Updated sqlglot requirement from <25.22,>=25.5.0 to >=25.5.0,<25.23 (#2687).
- Updated databricks-labs-lsql requirement from <0.12,>=0.5 to >=0.5,<0.13 (#2688).
- Updated databricks-labs-blueprint requirement from <0.9,>=0.8 to >=0.8,<0.10 (#2747).
Contributors: @ericvergnaud, @JCZuurmond, @asnare, @pritishpai, @dependabot[bot], @aminmovahed-db, @HariGS-DB, @nfx