v0.40.0
- Added
google-cloud-coreto known list (#2826). In this release, we have incorporated thegoogle-cloud-corelibrary into our project's configuration file, specifying several modules from this library. This change is part of the resolution of issue #1931, which pertains to working with Google Cloud services. Thegoogle-cloud-corelibrary offers core functionalities for Google Cloud client libraries, including helper functions, HTTP-related functionalities, testing utilities, client classes, environment variable handling, exceptions, obsolete features, operation tracking, and version management. By adding these new modules to the known list in the configuration file, we can now utilize them in our project as needed, thereby enhancing our ability to work with Google Cloud services. - Added
gviz-apito known list (#2831). In this release, we have added thegviz-apilibrary to our known library list, specifically specifying thegviz_apipackage within it. This addition enables the proper handling and recognition of components from thegviz-apilibrary in the system, thereby addressing a portion of issue #1931. While the specifics of thegviz-apilibrary's implementation and usage are not described in the commit message, it is expected to provide functionality related to data visualization. This enhancement will enable us to expand our system's capabilities and provide more comprehensive solutions for our users. - Added export CLI functionality for assessment results (#2553). A new
exportcommand-line interface (CLI) function has been added to the open-source library to export assessment results. This feature includes the addition of a newAssessmentExporterclass in theexport.pymodule, which is responsible for exporting assessment results to CSV files inside a ZIP archive. Users can specify the destination path and type of report for the exported results. A notebook utility is also included to run the export from the workspace environment, with default location, unit tests, and integration tests for the notebook utility. Theacl_migratormethod has been optimized for better performance. This new functionality provides more flexibility in exporting assessment results and improves the overall assessment functionality of the library. - Added functional test related to bug #2850 (#2880). A new functional test has been added to address a bug fix related to issue #2850, which involves reading data from a CSV file located in a volume using Spark's readStream function. The test specifies various options including file format, schema location, header, and compression. The CSV file is loaded from '/Volumes/playground/test/demo_data/' and the schema location is set to '/Volumes/playground/test/schemas/'. Additionally, a unit test has been added and is referenced in the commit. This functional test will help ensure that the bug fix for issue #2850 is working as expected.
- Added handling for
PermissionDeniedwhen retrievingWorkspaceClients from account (#2877). In this release, theworkspace_clientsmethod of theAccountclass inworkspaces.pyhas been updated to handlePermissionDeniedexceptions when retrievingWorkspaceClients. This change introduces a try-except block around the command retrieving the workspace client, which catches thePermissionDeniedexception and logs a warning message if access to a workspace is denied. If no exception is raised, the workspace client is added to the list of clients as before. The commit also includes a new unit test to verify this functionality. This update addresses issue #2874 and enhances the robustness of thedatabricks labs ucx sync-workspace-infocommand by ensuring it gracefully handles permission errors during workspace retrieval. - Added testing with Python 3.13 (#2878). The project has been updated to include testing with Python 3.13, in addition to the previously supported versions of Python 3.10, 3.11, and 3.12. This update is reflected in the
.github/workflows/push.ymlfile, which now includes '3.13' in thepyVersionmatrix for the jobs. This addition expands the range of Python versions that the project can be tested and run on, providing increased flexibility and compatibility for users, as well as ensuring continued support for the latest versions of the Python programming language. - Added used tables in assessment dashboard (#2836). In this update, we introduce a new widget to the assessment dashboard for displaying used tables, enhancing visibility into how tables are utilized within the Databricks environment. This change includes the addition of the
UsedTableclass in thedatabricks.labs.ucx.source_code.basemodule, which tracks table usage details in the inventory database. Two new methods,collect_dfsas_from_queryandcollect_used_tables_from_query, have been implemented to collect data source access and used tables information from a query, with lineage information added to the table details. Additionally, a test function,test_dashboard_with_prepopulated_data, has been introduced to prepopulate data for use in the dashboard, ensuring proper functionality of the new feature. - Avoid resource conflicts in integration tests by using a random dir name (#2865). In this release, we have implemented changes to address resource conflicts in integration tests by introducing random directory names. The
save_locationsmethod inconftest.pyhas been updated to generate random directory names using thetempfile.mkdtempfunction, based on the value of the newmake_randomparameter. Additionally, in thetest_migrate.pyfile located in thetests/integration/hive_metastoredirectory, the hard-coded directory name has been replaced with a random one generated by themake_randomfunction, which is used when creating external tables and specifying the external delta location. Lastly, thetest_move_tables_table_properties_mismatch_preserves_originalfunction intest_table_move.pyhas been updated to include a randomly generated directory name in the table's external delta and storage location, ensuring that tests can run concurrently without conflicting with each other. These changes resolve the issue described in #2797 and improve the reliability of integration tests. - Exclude dfsas from used tables (#2841). In this release, we've made significant improvements to the accuracy of table identification and handling in our system. We've excluded certain direct filesystem access patterns from being treated as tables in the current implementation, correcting a previous error. The
collect_tablesmethod has been updated to exclude table names matching defined direct filesystem access patterns. Additionally, we've added a new methodTableInfoNodeto wrap used tables and the nodes that use them. We've also introduced changes to handle direct filesystem access patterns more accurately, ensuring that the DataFrame API'sspark.table()function is identified correctly, while thespark.read.parquet()function, representing direct filesystem access, is now ignored. These changes are supported by new unit tests to ensure correctness and reliability, enhancing the overall functionality and behavior of the system. - Fixed known matches false postives for libraries starting with the same name as a library in the known.json (#2860). This commit addresses an issue of false positives in known matches for libraries that have the same name as a library in the known.json file. The
module_compatibilityfunction in theknown.pyfile was updated to look for exact matches or parent module matches, rather than just matches at the beginning of the name. This more nuanced approach ensures that libraries with similar names are not incorrectly flagged as having compatibility issues. Additionally, theknown.jsonfile is now sorted when constructing module problems, indicating that the order of the entries in this file may have been relevant to the issue being resolved. To ensure the accuracy of the changes, new unit tests were added. The test suite was expanded to include tests for known and unknown compatibility, and a new load test was added for the known.json file. These changes improve the reliability of the known matches feature, which is critical for ensuring the correct identification of compatibility issues. - Make delta format case sensitive (#2861). In this commit, the delta format is made case sensitive to enhance the robustness and reliability of the code. The
TableInMountclass has been updated with a__post_init__method to convert theformatattribute to uppercase, ensuring case sensitivity. Additionally, theTableclass in thetables.pyfile has been modified to include a__post_init__method that converts thetable_formatattribute to uppercase during object creation, making format comparisons case insensitive. New properties,is_deltaandis_hive, have been added to theTableclass to check if the table format is delta or hive, respectively. These changes affect thewhatmethod of theAclMigrationWhatenum class, which now checks foris_deltaandis_hiveinstead of comparingtable_formatwithDELTAand "HIVE". Relevant issues #2858 and #2840 have been addressed, and unit tests have been included to verify the behavior. However, the changes have not been verified on the staging environment yet. - Make delta format case sensitive (#2862). The recent update, derived from the resolution of issue #2861, introduces a case-sensitive delta format to our open-source library, enhancing the precision of delta table tracking. This change impacts all table format-related code and is accompanied by additional tests for robustness. A new
locationcolumn has been incorporated into thetable_estimatesview, facilitating the determination of delta table location. Furthermore, a new method has been implemented to extract thelocationcolumn from thetable_estimatesview, further refining the project's functionality and accuracy in managing delta tables. - Verify UCX catalog is accessible at start of
migration-progress-experimentalworkflow (#2851). In this release, we have introduced a newverify_has_ucx_catalogmethod in theApplicationclass of thedatabricks.labs.ucx.contextsmodule, which checks for the presence of a UCX catalog in the workspace and returns an instance of theVerifyHasCatalogclass. This method is used in themigration-progress-experimentalworkflow to verify UCX catalog accessibility, addressing issues #2577 and #2848 and progressing work on #2816. Theverify_has_ucx_catalogmethod is decorated with@cached_propertyand takesworkspace_clientanducx_catalogas arguments. Additionally, we have added a newVerifyHasCatalogclass that checks if a specified Unity Catalog (UC) catalog exists in the workspace and updated the import statement to include aNotFoundexception. We have also added a timeout parameter to thevalidate_stepfunction in theworkflows.pyfile, modified themigration-progress-experimentalworkflow to include a new stepverify_prerequisitesin thetable_migrationjob cluster, and added unit tests to ensure the proper functioning of these changes. These updates improve the application's ability to interact with UCX catalogs and ensure their presence and accessibility during workflow execution, while also enhancing the robustness and reliability of themigration-progress-experimentalworkflow.
Contributors: @ericvergnaud, @JCZuurmond, @asnare, @pritishpai, @nfx, @rportilla-databricks