Releases: databrickslabs/ucx
Releases · databrickslabs/ucx
v0.44.0
- Added
imbalanced-learnto known list (#2943). A new open-source library, "imbalanced-learn," has been added to the project's known list of libraries, providing various functionalities for handling imbalanced datasets. The addition includes modules such as "imblearn", "imblearn._config", "imblearn._min_dependencies", "imblearn._version", "imblearn.base", and many others, enabling features such as over-sampling, under-sampling, combining sampling techniques, and creating ensembles. This change partially resolves issue #1931, which may have been related to the handling of imbalanced datasets, thereby enhancing the project's ability to manage such datasets. - Added
importlib_resourcesto known list (#2944). In this update, we've added theimportlib_resourcespackage to the known list in theknown.jsonfile. This package offers a consistent and straightforward interface for accessing resources such as data files and directories in Python packages. It includes several modules, includingimportlib_resources,importlib_resources._adapters,importlib_resources._common,importlib_resources._functional,importlib_resources._itertools,importlib_resources.abc,importlib_resources.compat,importlib_resources.compat.py38,importlib_resources.compat.py39,importlib_resources.future,importlib_resources.future.adapters,importlib_resources.readers, andimportlib_resources.simple. These modules provide various functionalities for handling resources within a Python package. By adding this package to the known list, we enable its usage and integration with the project's codebase. This change partially addresses issue #1931, improving the management and accessibility of resources within our Python packages. - Dependency update: ensure we install with at least version 0.9.1 of
databricks-labs-blueprint(#2950). In the updatedpyproject.tomlfile, the version constraint for thedatabricks-labs-blueprintdependency has been revised to range between 0.9.1 and 0.10, specifically targeting 0.9.1 or higher. This modification ensures the incorporation of a fixed upstream issue (databrickslabs/blueprint#157), which was integrated in the 0.9.1 release. This adjustment was triggered by a preceding change (#2920) that standardized notebook paths, thereby addressing issue #2882, which was dependent on this upstream correction. By embracing this upgrade, users can engage the most recent dependency version, thereby ensuring the remediation of the aforementioned issue. - Fixed an issue with source table deleted after migration (#2927). In this release, we have addressed an issue where a source table was marked as migrated even after it was deleted following migration. An exception handling mechanism has been added to the
is_migratedmethod to returnTrueand log a warning message if the source table does not exist, indicating that it has been migrated. A new test function,test_migration_index_deleted_source, has also been included to verify the migration index behavior when the source table no longer exists. This function creates a source and destination table, sets the destination table'supgraded_fromproperty to the source table, drops the source table, and checks if the migration index contains the source table and if an error message was recorded, indicating that the source table no longer exists. Theget_seen_tablesmethod remains unchanged in this diff. - Improve robustness of
sqlglotfailure handling (#2952). This PR introduces changes to improve the robustness of error handling in thesqlglotlibrary, specifically targeting issues with inadequate parsing quality. Thecollect_table_infosmethod has been updated and renamed tocollect_used_tablesto accurately gather information about tables used in a SQL expression. Thelint_expressionandcollect_tablesmethods have also been updated to use the newcollect_used_tablesmethod for better accuracy. Additionally, methods such asfind_all,walk_expressions, and the test suite for the SQL parser have been enhanced to handle potential failures and unsupported SQL syntax more gracefully, by returning empty lists or logging warning messages instead of raising errors. These changes aim to improve the reliability and robustness of thesqlglotlibrary, enabling it to handle unexpected input more effectively. - Log warnings when mounts are discovered on incorrect cluster type (#2929). The
migrate-tablescommand in the ucx project's CLI now includes a verification step to ensure the successful completion of a prerequisite assessment workflow before execution. If this workflow has not been completed, a warning message is logged and the command is not executed. A new exception handling mechanism has been implemented for thedbutils.fs.mounts()method, which logs a warning and skips mount point discovery if an exception is raised. A new unit test has been added to verify that a warning is logged when attempting to discover mounts on an incompatible cluster type. The diff also includes a new methodVerifyProgressTrackingfor verifying progress tracking and updates to existing test methods to include verification of successful runs and error handling before assessment. These changes improve the handling of edge cases in the mount point discovery process, add warnings for mounts on incorrect cluster types, and increase test coverage with progress tracking verification. create-uber-principalfixes and improvements (#2941). This change introduces fixes and improvements to thecreate-uber-principalfunctionality within thedatabricks-sdk-pyproject, specifically targeting the Azure access module. The main enhancements include addressing an issue with the Databricks warehouses API by adding theset_workspace_warehouse_config_wrapperfunction, modifying the command to request the uber principal name only when necessary, improving storage account crawl logic, and introducing new methods to manage workspace-level configurations. Error handling mechanisms have been fortified through added and modified try-except blocks. Additionally, several unit and integration tests have been implemented and verified to ensure the functionality is correct and running smoothly. These changes improve the overall robustness and versatility of thecreate-uber-principalcommand, directly addressing issues #2764, #2771, and progressing on #2949.
Contributors: @pritishpai, @FastLee, @asnare, @JCZuurmond, @nfx
v0.43.0
- Added
imageioto known list (#2942). In this release, we have addedimageioto our library's known list, which includes all its modules, sub-modules, testing, and typing packages. This change addresses issue #1931, which may have been caused by a dependency or compatibility issue. Theimageiolibrary offers I/O functionality for scientific imaging data, and its addition is expected to expand the library's supported formats and functionality. As a result, software engineers can leverage the enhanced capabilities to handle scientific imaging data more effectively. - Added
ipyflow-coreto known list (#2945). In this release, the project has expanded its capabilities by adding two open-source libraries to a known list contained in a JSON file. The first library,ipyflow-core, brings a range of modules for the data model, experimental features, frontend, kernel, patches, shell, slicing, tracing, types, and utils. The second library,pyccolo, offers fast and adaptable code transformation using abstract syntax trees, with functionalities including code rewriting, import hooks, syntax augmentation, and tracing, along with various utility functions. By incorporating these libraries into the project, we aim to enhance its overall efficiency and versatility, providing software engineers with access to a broader set of tools and capabilities. - Added
isodateto known list (#2946). In this release, we have added theisodatepackage to our library's known package list, which resolves part of issue #1931. Theisodatepackage provides several modules for parsing and manipulating ISO 8601 dated strings, includingisodate,isodate.duration,isodate.isodates,isodate.isodatetime,isodate.isoduration,isodate.isoerror,isodate.isostrf,isodate.isotime,isodate.isotzinfo, andisodate.tzinfo. This addition enhances our compatibility and integration with theisodatepackage in the larger system, enabling users to utilize the full functionality of theisodatepackage in their applications. - Experimental command for enabling HMS federation (#2939). In this release, we have introduced an experimental feature for enabling HMS (Hive Metastore) federation through a new
enable-hms-federationcommand in the labs.yml file. This command, when enabled, will create a federated HMS catalog synced with the workspace HMS in a hierarchical manner, facilitating migration and integration of HMS models. Additionally, we have added an optionalenable_hms_federationconstructor argument to theLocationsclass in the locations.py file. Setting this flag to True enables a fallback mode for AWS resources to use HMS for data access. TheHiveMetastoreFederationEnablerclass is introduced with anenable()method to modify the workspace configuration and enable HMS federation. These changes aim to provide a more streamlined experience for users working with complex modeling systems, and careful testing and feedback are encouraged on this experimental feature. - Experimental support for HMS federation (#2283). In this release, we introduce experimental support for Hive Metastore (HMS) federation in our open-source library. A new
HiveMetastoreFederationclass has been implemented, enabling the registration of an internal HMS as a federated catalog. This class utilizes theWorkspaceClientobject from thedatabricks.sdklibrary to create necessary connections and handles permissions for successful federation. Additionally, a new filetest_federation.pyhas been added, containing unit tests to demonstrate the functionality of HMS federation, including the creation of federated catalogs and handling of existing connections. As this is an experimental feature, users should expect potential issues and are encouraged to provide feedback to help improve its functionality. - Fixed
InvalidParameterValuefailure for scanning jobs running on interactive clusters that got deleted (#2935). In this release, we have addressed an issue where anInvalidParameterValueerror was not being handled properly during scanning jobs run on interactive clusters that were deleted. This error has now been added to the exceptions handled in the_register_existing_cluster_idand_register_cluster_infomethods. These methods retrieve information about an existing cluster or its ID, and if the cluster is not found or an invalid parameter value is provided, they now yield aDependencyProblemobject with an appropriate error message. ThisDependencyProblemobject is used to indicate that there is a problem with the dependencies required for the job, preventing it from running successfully. By handling this error, the code ensures that the job can fail gracefully and provide a clear and informative error message to the user, avoiding any potential confusion or unexpected behavior. - Improve logging when skipping legacy grant in
create-catalogs-schemas(#2933). In this update, thecreate-catalogs-schemasprocess has been improved with enhanced logging for skipped legacy grants. This change is a follow-up to previous issue #2917 and progresses issue #2932. The_apply_from_legacy_table_aclsand_update_principal_aclmethods now include more descriptive logging when a legacy grant is skipped, providing information about the type of grant being skipped and clarifying that it is not supported in the Unity Catalog. Additionally, a new methodget_interactive_cluster_grantshas been added to theprincipal_aclobject, returning a list of grants specific to the interactive cluster. Thehive_aclobject is now autospec'd after theprincipal_acl.get_interactive_cluster_grantscall. Thetest_catalog_schema_aclfunction has been updated to reflect these changes. New grants have been added to thehive_grantslist, including grants foruser1withUSEaction type onhive_metastorecatalog and grants foruser2withUSAGEaction type onschema3database. A new grant foruser4withDENYaction type onschema3database has also been added, but it is skipped in the logging due to it not being supported in UC. Skipped legacy grants forDENYaction type oncatalog2catalog and 'catalog2.schema2' database are also included in the commit. These updates improve the clarity and usefulness of the logs, making it easier for users to understand what is happening during the migration of grants to UC and ensuring that unsupported grants are not inadvertently included in the UC. - Notebook linting: ensure path-type is preserved during linting (#2923). In this release, we have enhanced the type safety of the
NotebookResolverclass in theloaders.pymodule by introducing a new type variablePathT. This change includes an update to the_adjust_pathmethod, which ensures the preservation of the original file suffix when adding the ".py" suffix for Python notebooks. This addresses a potential issue where aWorkspacePathinstance could be incorrectly converted to a genericPathinstance, causing downstream errors. Although this change may potentially resolve issue #2888, the reproduction steps for that issue were not provided in the commit message. It is important to note that while this change has been manually tested, it does not include any new unit tests, integration tests, or staging environment verification.
Contributors: @nfx, @pritishpai, @ericvergnaud, @asnare, @JCZuurmond
v0.42.0
- Added
google-cloud-storageto known list (#2827). In this release, we have added thegoogle-cloud-storagelibrary, along with its various modules and sub-modules, to our project's known list in a JSON file. Additionally, we have included thegoogle-crc32candgoogle-resumable-medialibraries. These libraries provide functionalities such as content addressable storage, checksum calculation, and resumable media upload and download. This change is a partial resolution to issue #1931, which is likely related to the integration or usage of these libraries in the project. Software engineers should take note of these additions and how they may impact the project's functionality. - Added
google-crc32cto known list (#2828). With this commit, we have added thegoogle-crc32clibrary to our system's known list, addressing part of issue #1931. This addition enhances the overall functionality of the system by providing efficient and high-speed CRC32C computation when utilized. Thegoogle-crc32clibrary is known for its performance and reliability, and by incorporating it into our system, we aim to improve the efficiency and robustness of the CRC32C computation process. This enhancement is part of our ongoing efforts to optimize the system and ensure a more efficient experience for our end-users. With this change, users can expect faster and more reliable CRC32C computations in their applications. - Added
holidaysto known list (#2906). In this release, we have expanded the known list in our open-source library to include a newholidayscategory, aimed at supporting tracking of holidays for different countries, religions, and financial institutions. This category includes several subcategories, such as calendars, countries, deprecation, financial holidays, groups, helpers, holiday base, mixins, observed holiday base, registry, and utils. Each subcategory contains an empty list, allowing for future data storage related to holidays. This change partially resolves issue #1931, and represents a significant step towards supporting a more comprehensive range of holiday tracking needs in our library. Software engineers may utilize this new feature to build applications that require tracking and management of various holidays and related data. - Added
htmlminto known list (#2907). In this update, we have added thehtmlminlibrary to theknown.jsonconfiguration file's list of known libraries. This addition enables the use and management ofhtmlminand its components, includinghtmlmin.command,htmlmin.decorator,htmlmin.escape,htmlmin.main,htmlmin.middleware,htmlmin.parser,htmlmin.python3html, andhtmlmin.python3html.parser. This change partially addresses issue #1931, which may have been caused by the integration or usage ofhtmlmin. Software engineers can now utilizehtmlminand its features in their projects, thanks to this enhancement. - Document preparing external locations when creating catalogs (#2915). Databricks Labs' UCX tool has been updated to incorporate the preparation of external locations when creating catalogs during the upgrade to Unity Catalog (UC). This enhancement involves the addition of new documentation outlining how to physically separate data in storage within UC, adhering to Databricks' best practices. The
create-catalogs-schemascommand has been updated to create UC catalogs and schemas based on a mapping file, allowing users to reuse previously created external locations or establish new ones outside of UCX. For data separation, users can leverage external locations when using subpaths, providing flexibility in data management during the upgrade process. - Fixed
KeyErrorfromassess_workflowstask (#2919). In this release, we have made significant improvements to error handling in our open-source library. We have fixed a KeyError in theassess_workflowstask and modified the_safe_infer_internaland_unsafe_infer_internalmethods to handle bothInferenceErrorandKeyErrorduring inference. When an error occurs, we now log the error message with the node and yield aUninferableobject. Additionally, we have updated thedo_infer_valuesmethod of the_LocalInferredValueclass to yield an iterator of iterables ofNodeNGobjects. We have added multiple unit tests for inferring values in Python code, including cases for handling externally defined values and their absence. These changes ensure that our library can handle errors more gracefully and provide more informative feedback during inference, making it more robust and easier to use in software engineering projects. - Fixed
OSError: [Errno 95]bug inassess_workflowstask by skipping GIT-sourced workflows from static code analysis (#2924). In this release, we have resolved theOSError: [Errno 95]bug in theassess_workflowstask that occurred while performing static code analysis on GIT-sourced workflows. A new attributeSourcehas been introduced in thejobsmodule of thedatabricks.sdk.servicepackage to identify the source of a notebook task. If the notebook task source is GIT, a newDependencyProblemis raised, indicating that notebooks in GIT should be analyzed using thedatabricks labs ucx lint-local-codeCLI command. The_register_notebookmethod has been updated to check if the notebook task source is GIT and return an appropriateDependencyProblemmessage. This change enhances the reliability of theassess_workflowstask by avoiding the aforementioned bug and provides a more informative message when notebooks are sourced from GIT. This change is part of our ongoing effort to improve the project's quality and reliability and benefits software engineers who adopt the project. - Fixed absolute path normalisation in source code analysis (#2920). In this release, we have addressed an issue with the Workspace API not supporting relative subpaths such as "/a/b/../c", which has been resolved by resolving workspace paths before calling the API. This fix is backward compatible and ensures the correct behavior of the source code analysis. Additionally, we have added integration tests and co-authored this commit with Eric Vergnaud and Serge Smertin. Furthermore, we have added a new test case that supports relative grand-parent paths in the dependency graph construction, utilizing a new
NotebookLoaderclass. This loader is responsible for loading the notebook content and metadata given a path, and this new test case exercises the path resolution logic when a notebook depends on another notebook located two levels up in the directory hierarchy. These changes improve the robustness and reliability of the source code analysis in the presence of relative paths. - Fixed downloading wheel libraries from DBFS on mounted Azure Storage fail with access denied (#2918). In this release, we have introduced enhancements to the library's handling of registering and downloading wheel libraries from DBFS on mounted Azure Storage, addressing an issue that resulted in access denied errors. The changes include improved error handling with the addition of a
try-exceptblock to handle potentialBadRequestexceptions and the inclusion of three new methods to register different types of libraries. The_register_requirements_txtmethod reads requirements files and registers each library specified in the file, logging a warning message for any references to other requirements or constraints files. The_register_whlmethod creates a temporary copy of the given wheel file in the local file system and registers it, while the_register_eggmethod checks the runtime version and yields aDependencyProblemif the version is greater than (14, 0). These changes simplify the code and enhance error handling while addressing the reported issues related to registering libraries. The changes are implemented in thejobs.pyfile located in thedatabricks/labs/ucx/source_codedirectory, which also includes the import of theBadRequestexception class fromdatabricks.sdk.errors. - Fixed issue with migrating MANAGED hive_metastore table to UC (#2892). In this release, we have implemented changes to address the issue of migrating HMS (Hive Metastore) managed tables to UC (Unity Catalog) as EXTERNAL. Historically, deleting a managed table also removed the underlying data, leading to potential data loss and making the UC table unusable. The new approach provides options to mitigate these issues, including migrating as EXTERNAL or cloning the data to maintain integrity. These changes aim to prevent accidental data deletion, ensure data recoverability, and avoid inconsistencies when new data is added to either HMS or UC. We have introduced new class attributes, methods, and parameters in relevant modules such as
WorkspaceConfig,Table,migrate_tables, andinstall.py. These modifications support the new migration strategies and allow for more flexibility in managing how tables are migrated and how data is handled. The upgrade process can be triggered using themigrate-tablesUCX command or by running the table migration workflows deployed to the workspace. Thorough testing and documentation have been performed to minimize risks of data inconsis...
v0.41.0
- Added UCX history schema and table for storing UCX's artifact (#2744). In this release, we have introduced a new dataclass
Historicalto store UCX artifacts for migration progress tracking, including attributes such as workspace identifier, job run identifier, object type, object identifier, data, failures, owner, and UCX version. TheProgressTrackingInstallationclass has been updated to include a new method for deploying a table for historical records using theHistoricaldataclass. Additionally, we have modified thedatabricks labs ucx create-ucx-catalogcommand, and updated the integration test filetest_install.pyto include a parametrized test function for checking if theworkflow_runsandhistoricaltables are created by the UCX installation. We have also renamed the functiontest_progress_tracking_installation_run_creates_workflow_runs_tabletotest_progress_tracking_installation_run_creates_tablesto reflect the addition of the new table. These changes add necessary functionality for tracking UCX migration progress and provide associated tests to ensure correctness, thereby improving UCX's progress tracking functionality and resolving issue #2572. - Added
hjsonto known list (#2899). In this release, we are excited to announce the addition of support for the Hjson library, addressing partial resolution for issue #1931 related to configuration. This change integrates the following Hjson modules: hjson, hjson.compat, hjson.decoder, hjson.encoder, hjson.encoderH, hjson.ordered_dict, hjson.scanner, and hjson.tool. Hjson is a powerful library that enhances JSON functionality by providing comments and multi-line strings. By incorporating Hjson into our library's known list, users can now leverage its advanced features in a more streamlined and cohesive manner, resulting in a more versatile and efficient development experience. - Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 (#2894). In this version bump from acceptance/v0.3.0 to 0.3.1 of the databrickslabs/sandbox library, several enhancements and bug fixes have been implemented. These changes include updates to the README file with instructions on how to use the library with the databricks labs sandbox command, fixes for the
unsupported protocol schemeerror, and the addition of more git-related libraries. Additionally, dependency updates for golang.org/x/crypto from version 0.16.0 to 0.17.0 have been made in the /go-libs and /runtime-packages directories. This version also introduces new commits that allow larger logs from acceptance tests and implement experimental OIDC refresh token rotation. The tests using this library have been updated to utilize the new version to ensure compatibility and functionality. - Fixed
AttributeError:UsedTablehas no attribute 'table'by adding more type checks (#2895). In this release, we have made significant improvements to the library's type safety and robustness in handlingUsedTableobjects. We fixed an AttributeError related to theUsedTableclass not having atableattribute by adding more type checks in thecollect_tablesmethod of theTablePyCollectorandCollectTablesVisitclasses. We also introducedAstroidSyntaxErrorexception handling and logging. Additionally, we renamed thetable_infosvariable toused_tablesand changed its type to 'list[JobProblem]' in thecollect_tables_from_treeand '_SparkSqlAnalyzer.collect_tables' functions. We added conditional statements to check for the presence of required attributes before yielding a new 'TableInfoNode'. A new unit test file, 'test_context.py', has been added to exercise thetables_collectormethod, which extracts table references from a given code snippet, improving the linter's table reference extraction capabilities. - Fixed
TokenErrorin assessment workflow (#2896). In this update, we've implemented a bug fix to improve the robustness of the assessment workflow in our open-source library. Previously, the code only caught parse errors during the execution of the workflow, but parse errors were not the only cause of failures. This commit changes the exception being caught fromParseErrorto the more generalSqlglotError, which is the common ancestor of bothParseErrorandTokenError. By catching the more generalSqlglotError, the code is now able to handle both parse errors and tokenization errors, providing a more robust solution. Thewalk_expressionsmethod has been updated to catchSqlglotErrorinstead ofParseError. This change allows the assessment workflow to handle a wider range of issues that may arise during the execution of SQL code, making it more versatile and reliable. TheSqlglotErrorclass has been imported from thesqlglot.errorsmodule. This update enhances the assessment workflow's ability to handle more complex SQL queries, ensuring smoother execution. - Fixed
assessmentworkflow failure for jobs running tasks on existing interactive clusters (#2889). In this release, we have implemented changes to address a failure in theassessmentworkflow when jobs are run on existing interactive clusters (issue #2886). The fix includes modifying thejobs.pyfile by adding a try-except block when loading libraries for an existing cluster, utilizing a new exception typeResourceDoesNotExistto handle cases where the cluster does not exist. Furthermore, the_register_cluster_infofunction has been enhanced to manage situations where the existing cluster is not found, raising aDependencyProblemwith the message 'cluster-not-found'. This ensures the workflow can continue running jobs on other clusters or with other configurations. Overall, these enhancements improve the system's robustness by gracefully handling edge cases and preventing workflow failure due to non-existent clusters. - Ignore UCX inventory database in HMS while scanning tables (#2897). In this release, changes have been implemented in the 'tables.py' file of the 'databricks/labs/ucx/hive_metastore' directory to address the issue of mistakenly scanning the UCX inventory database during table scanning. The
_all_databasesmethod has been updated to exclude the UCX inventory database by checking if the database name matches the schema name and skipping it if so. This change affects the_crawland_get_table_namesmethods, which no longer process the UCX inventory schema when scanning for tables. A TODO comment has been added to the_get_table_namesmethod, suggesting potential removal of the UCX inventory schema check in future releases. This change ensures accurate and efficient table scanning, avoiding thehallucinationof mistaking the UCX inventory schema as a database to be scanned. - Tech debt: fix situations where
next()isn't being used properly (#2885). In this commit, technical debt related to the proper usage of Python's built-innext()function has been addressed in several areas of the codebase. Previously, there was an assumption thatNonewould be returned if there is no next value, which is incorrect. This commit updates and fixes the implementation to correctly handle cases wherenext()is used. Specifically, theget_dbutils_notebook_run_path_arg,of_languageclass method in theCellLanguageclass, and certain methods in thetest_table_migrate.pyfile have been updated to correctly handle situations where there is no next value. Thehas_path()method has been removed, and theprepend_path()method has been updated to insert the given path at the beginning of the list of system paths. Additionally, a test case for checking table in mount mapping with table owner has been included. These changes improve the robustness and reliability of the code by ensuring that it handles edge cases related to thenext()function and paths correctly. - [chore] apply
make fmt(#2883). In this release, themake_randomparameter has been removed from thesave_locationsmethod in theconftest.pyfile for the integration tests. This method is used to save a list ofExternalLocationobjects to theexternal_locationstable in the inventory database, and it no longer requires themake_randomparameter. In the updated implementation, thesave_locationsmethod creates a singleExternalLocationobject with a specific string and priority based on the workspace environment (Azure or AWS), and then uses the SQL backend to save the list ofExternalLocationobjects to the database. This change simplifies thesave_locationsmethod and makes it more reusable throughout the test suite.
Dependency updates:
- Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 (#2894).
Contributors: @nfx, @asnare, @dependabot[bot], @JCZuurmond, @pritishpai
v0.40.0
- Added
google-cloud-coreto known list (#2826). In this release, we have incorporated thegoogle-cloud-corelibrary into our project's configuration file, specifying several modules from this library. This change is part of the resolution of issue #1931, which pertains to working with Google Cloud services. Thegoogle-cloud-corelibrary offers core functionalities for Google Cloud client libraries, including helper functions, HTTP-related functionalities, testing utilities, client classes, environment variable handling, exceptions, obsolete features, operation tracking, and version management. By adding these new modules to the known list in the configuration file, we can now utilize them in our project as needed, thereby enhancing our ability to work with Google Cloud services. - Added
gviz-apito known list (#2831). In this release, we have added thegviz-apilibrary to our known library list, specifically specifying thegviz_apipackage within it. This addition enables the proper handling and recognition of components from thegviz-apilibrary in the system, thereby addressing a portion of issue #1931. While the specifics of thegviz-apilibrary's implementation and usage are not described in the commit message, it is expected to provide functionality related to data visualization. This enhancement will enable us to expand our system's capabilities and provide more comprehensive solutions for our users. - Added export CLI functionality for assessment results (#2553). A new
exportcommand-line interface (CLI) function has been added to the open-source library to export assessment results. This feature includes the addition of a newAssessmentExporterclass in theexport.pymodule, which is responsible for exporting assessment results to CSV files inside a ZIP archive. Users can specify the destination path and type of report for the exported results. A notebook utility is also included to run the export from the workspace environment, with default location, unit tests, and integration tests for the notebook utility. Theacl_migratormethod has been optimized for better performance. This new functionality provides more flexibility in exporting assessment results and improves the overall assessment functionality of the library. - Added functional test related to bug #2850 (#2880). A new functional test has been added to address a bug fix related to issue #2850, which involves reading data from a CSV file located in a volume using Spark's readStream function. The test specifies various options including file format, schema location, header, and compression. The CSV file is loaded from '/Volumes/playground/test/demo_data/' and the schema location is set to '/Volumes/playground/test/schemas/'. Additionally, a unit test has been added and is referenced in the commit. This functional test will help ensure that the bug fix for issue #2850 is working as expected.
- Added handling for
PermissionDeniedwhen retrievingWorkspaceClients from account (#2877). In this release, theworkspace_clientsmethod of theAccountclass inworkspaces.pyhas been updated to handlePermissionDeniedexceptions when retrievingWorkspaceClients. This change introduces a try-except block around the command retrieving the workspace client, which catches thePermissionDeniedexception and logs a warning message if access to a workspace is denied. If no exception is raised, the workspace client is added to the list of clients as before. The commit also includes a new unit test to verify this functionality. This update addresses issue #2874 and enhances the robustness of thedatabricks labs ucx sync-workspace-infocommand by ensuring it gracefully handles permission errors during workspace retrieval. - Added testing with Python 3.13 (#2878). The project has been updated to include testing with Python 3.13, in addition to the previously supported versions of Python 3.10, 3.11, and 3.12. This update is reflected in the
.github/workflows/push.ymlfile, which now includes '3.13' in thepyVersionmatrix for the jobs. This addition expands the range of Python versions that the project can be tested and run on, providing increased flexibility and compatibility for users, as well as ensuring continued support for the latest versions of the Python programming language. - Added used tables in assessment dashboard (#2836). In this update, we introduce a new widget to the assessment dashboard for displaying used tables, enhancing visibility into how tables are utilized within the Databricks environment. This change includes the addition of the
UsedTableclass in thedatabricks.labs.ucx.source_code.basemodule, which tracks table usage details in the inventory database. Two new methods,collect_dfsas_from_queryandcollect_used_tables_from_query, have been implemented to collect data source access and used tables information from a query, with lineage information added to the table details. Additionally, a test function,test_dashboard_with_prepopulated_data, has been introduced to prepopulate data for use in the dashboard, ensuring proper functionality of the new feature. - Avoid resource conflicts in integration tests by using a random dir name (#2865). In this release, we have implemented changes to address resource conflicts in integration tests by introducing random directory names. The
save_locationsmethod inconftest.pyhas been updated to generate random directory names using thetempfile.mkdtempfunction, based on the value of the newmake_randomparameter. Additionally, in thetest_migrate.pyfile located in thetests/integration/hive_metastoredirectory, the hard-coded directory name has been replaced with a random one generated by themake_randomfunction, which is used when creating external tables and specifying the external delta location. Lastly, thetest_move_tables_table_properties_mismatch_preserves_originalfunction intest_table_move.pyhas been updated to include a randomly generated directory name in the table's external delta and storage location, ensuring that tests can run concurrently without conflicting with each other. These changes resolve the issue described in #2797 and improve the reliability of integration tests. - Exclude dfsas from used tables (#2841). In this release, we've made significant improvements to the accuracy of table identification and handling in our system. We've excluded certain direct filesystem access patterns from being treated as tables in the current implementation, correcting a previous error. The
collect_tablesmethod has been updated to exclude table names matching defined direct filesystem access patterns. Additionally, we've added a new methodTableInfoNodeto wrap used tables and the nodes that use them. We've also introduced changes to handle direct filesystem access patterns more accurately, ensuring that the DataFrame API'sspark.table()function is identified correctly, while thespark.read.parquet()function, representing direct filesystem access, is now ignored. These changes are supported by new unit tests to ensure correctness and reliability, enhancing the overall functionality and behavior of the system. - Fixed known matches false postives for libraries starting with the same name as a library in the known.json (#2860). This commit addresses an issue of false positives in known matches for libraries that have the same name as a library in the known.json file. The
module_compatibilityfunction in theknown.pyfile was updated to look for exact matches or parent module matches, rather than just matches at the beginning of the name. This more nuanced approach ensures that libraries with similar names are not incorrectly flagged as having compatibility issues. Additionally, theknown.jsonfile is now sorted when constructing module problems, indicating that the order of the entries in this file may have been relevant to the issue being resolved. To ensure the accuracy of the changes, new unit tests were added. The test suite was expanded to include tests for known and unknown compatibility, and a new load test was added for the known.json file. These changes improve the reliability of the known matches feature, which is critical for ensuring the correct identification of compatibility issues. - Make delta format case sensitive (#2861). In this commit, the delta format is made case sensitive to enhance the robustness and reliability of the code. The
TableInMountclass has been updated with a__post_init__method to convert theformatattribute to uppercase, ensuring case sensitivity. Additionally, theTableclass in thetables.pyfile has been modified to include a__post_init__method that converts thetable_formatattribute to uppercase during object creation, making format comparisons case insensitive. New properties,is_deltaandis_hive, have been added to theTableclass to check if the table format is delta or hive, respectively. Thes...
v0.39.0
- Added
Farama-Notificationsto known list (#2822). A new configuration has been implemented in this release to integrate Farama-Notifications into the existing system, partially addressing issue #193 - Added
aiohttp-corslibrary to known list (#2775). In this release, we have added theaiohttp-corslibrary to our project, providing asynchronous Cross-Origin Resource Sharing (CORS) handling for theaiohttplibrary. This addition enhances the robustness and flexibility of CORS management in our relevant projects. The library includes several new modules such as "aiohttp_cors", "aiohttp_cors.abc", "aiohttp_cors.cors_config", "aiohttp_cors.mixin", "aiohttp_cors.preflight_handler", "aiohttp_cors.resource_options", and "aiohttp_cors.urldispatcher_router_adapter", which offer functionalities for configuring and handling CORS inaiohttpapplications. This change partially resolves issue #1931 and further strengthens our application's security and cross-origin resource sharing capabilities. - Added
category-encoderslibrary to known list (#2781). In this release, we've added thecategory-encoderslibrary to our supported libraries, which provides a variety of methods for encoding categorical variables as numerical data, including one-hot encoding and target encoding. This addition resolves part of issue #1931, which concerned the support of this library. The library has been integrated into our system by adding a new entry forcategory-encodersin the known.json file, which contains several modules and classes corresponding to various encoding methods provided by the library. This enhancement enables software engineers to leverage the capabilities ofcategory-encoderslibrary to encode categorical variables more efficiently and effectively. - Added
cmdstanpyto known list (#2786). In this release, we have addedcmdstanpyandstaniolibraries to our codebase.cmdstanpyis a Python library for interfacing with the Stan probabilistic programming language and has been added to the whitelist. This addition enables the use ofcmdstanpy's functionalities, including loading, inspecting, and manipulating Stan model objects, as well as running MCMC simulations. Additionally, we have included thestaniolibrary, which provides functionality for reading and writing Stan data and model files. These additions enhance the codebase's capabilities for working with probabilistic models, offering expanded options for loading, manipulating, and simulating models written in Stan. - Added
confectionlibrary to known list (#2787). In this release, theconfectionlibrary, a lightweight, pure Python library for parsing and formatting cookies with two modules for working with cookie headers and utility functions, has been added to the known list of libraries and is now usable within the project. Additionally, several modules from thesrslylibrary, a collection of serialization utilities for Python including support for JSON, MessagePack, cloudpickle, and Ruamel YAML, have been added to the known list of libraries, increasing the project's flexibility and functionality in handling serialized data. This partially resolves issue #1931. - Added
configparserlibrary to known list (#2796). In this release, we have added support for theconfigparserlibrary, addressing issue #1931.Configparseris a standard Python library used for parsing configuration files. This change not only whitelists the library but also includes the "backports.configparser" and "backports.configparser.compat" modules, providing backward compatibility for older versions of Python. By recognizing and supporting theconfigparserlibrary, users can now utilize it in their code with confidence, knowing that it is a known and supported library. This update also ensures that the backports for older Python versions are recognized, enabling users to leverage the library seamlessly, regardless of the Python version they are using. - Added
diskcachelibrary to known list (#2790). A new update has been made to include thediskcachelibrary in our open-source library's known list, as detailed in the release notes. This addition brings in multiple modules, includingdiskcache,diskcache.cli,diskcache.core,diskcache.djangocache,diskcache.persistent, anddiskcache.recipes. Thediskcachelibrary is a high-performance caching system, useful for a variety of purposes such as caching database queries, API responses, or any large data that needs frequent access. By adding thediskcachelibrary to the known list, developers can now leverage its capabilities in their projects, partially addressing issue #1931. - Added
dm-treelibrary to known list (#2789). In this release, we have added thedm-treelibrary to our project's known list, enabling its integration and use within our software. Thedm-treelibrary is a C++ API that provides functionalities for creating and manipulating tree data structures, with support for sequences and tree benchmarking. This addition expands our range of available data structures, addressing the lack of support for tree data structures and partially resolving issue #1931, which may have been related to the integration of thedm-treelibrary. By incorporating this library, we aim to enhance our project's performance and versatility, providing software engineers with more options for handling tree data structures. - Added
evaluateto known list (#2821). In this release, we have added theevaluatepackage and its dependent libraries to our open-source library. Theevaluatepackage is a tool for evaluating and analyzing machine learning models, providing a consistent interface to various evaluation tasks. Its dependent libraries includecolorful,cmdstanpy,comm,eradicate,multiprocess, andxxhash. Thecolorfullibrary is used for colorizing terminal output, whilecmdstanpyprovides Python infrastructure for Stan, a platform for statistical modeling and high-performance statistical computation. Thecommlibrary is used for creating and managing IPython comms, anderadicateis used for removing unwanted columns from pandas DataFrame. Themultiprocesslibrary is used for spawning processes, andxxhashis used for the XXHash algorithms, which are used for fast hash computation. This addition partly resolves issue #1931, providing enhanced functionality for evaluating machine learning models. - Added
futureto known list (#2823). In this commit, we have added thefuturemodule, a compatibility layer for Python 2 and Python 3, to the project's known list in the configuration file. This module provides a wide range of backward-compatible tools and fixers to smooth over the differences between the two major versions of Python. It includes numerous sub-modules such as "future.backports", "future.builtins", "future.moves", and "future.standard_library", among others, which offer backward-compatible features for various parts of the Python standard library. The commit also includes related modules like "libfuturize", "libpasteurize", andpastand their respective sub-modules, which provide tools for automatically converting Python 2 code to Python 3 syntax. These additions enhance the project's compatibility with both Python 2 and Python 3, providing developers with an easier way to write cross-compatible code. By adding thefuturemodule and related tools, the project can take full advantage of the features and capabilities provided, simplifying the process of writing code that works on both versions of the language. - Added
google-api-coreto known list (#2824). In this commit, we have added thegoogle-api-coreandproto-pluspackages to our codebase. Thegoogle-api-corepackage brings in a collection of modules for low-level support of Google Cloud services, such as client options, gRPC helpers, and retry mechanisms. This addition enables access to a wide range of functionalities for interacting with Google Cloud services. Theproto-pluspackage includes protobuf-related modules, simplifying the handling and manipulation of protobuf messages. This package includes datetime helpers, enums, fields, marshaling utilities, message definitions, and more. These changes enhance the project's versatility, providing users with a more feature-rich environment for interacting with external services, such as those provided by Google Cloud. Users will benefit from the added functionality and convenience provided by these packages. - Added
google-auth-oauthliband dependent libraries to known list (#2825). In this release, we have added thegoogle-auth-oauthlibandrequests-oauthliblibraries and their dependencies to our repository to enhance OAuth2 authentication flow support. Thegoogle-auth-oauthliblibrary is utilized for Google's OAuth2 client authentication and authorization flows, whilerequests-oauthlibprovi...
v0.38.0
- Added Py4j implementation of tables crawler to retrieve a list of HMS tables in the assessment workflow (#2579). In this release, we have added a Py4j implementation of a tables crawler to retrieve a list of Hive Metastore tables in the assessment workflow. A new
FasterTableScanCrawlerclass has been introduced, which can be used in the Assessment Job based on a feature flag to replace the old Scala code, allowing for better logging during table scans. The existingassessment.crawl_tablesworkflow now utilizes the new py4j crawler instead of the scala one. Integration tests have been added to ensure the functionality works correctly. The commit also includes a new method for listing table names in the specified database and improvements to error handling and logging mechanisms. The new Py4j tables crawler enhances the functionality of the assessment workflow by improving error handling, resulting in better logging and faster table scanning during the assessment process. This change is part of addressing issue #2190 and was co-authored by Serge Smertin. - Added
create-ucx-catalogcli command (#2694). A new CLI command,create-ucx-catalog, has been added to create a catalog for migration tracking that can be used across multiple workspaces. The command creates a UCX catalog for tracking migration status and artifacts, and is created by runningdatabricks labs ucx create-ucx-catalogand specifying the storage location for the catalog. Relevant user documentation, unit tests, and integration tests have been added for this command. Theassign-metastorecommand has also been updated to allow for the selection of a metastore when multiple metastores are available in the workspace region. This change improves the migration tracking feature and enhances the user experience. - Added experimental
migration-progress-experimentalworkflow (#2658). This commit introduces an experimental workflow,migration-progress-experimental, which refreshes the inventory for various resources such as clusters, grants, jobs, pipelines, policies, tables, TableMigrationStatus, and UDFs. The workflow can be triggered using thedatabricks labs ucx migration-progressCLI command and uses a new implementation of a Scala-based crawler,TablesCrawler, which will eventually replace the current implementation. The new workflow is a duplicate of most of theassessmentpipeline's functionality but with some differences, such as the use ofTablesCrawler. Relevant user documentation has been added, along with unit tests, integration tests, and a screenshot of a successful staging environment run. The new workflow is expected to run on a schedule in the future. This change resolves #2574 and progresses #2074. - Added handling for
InternalErrorinListing.__iter__(#2697). This release introduces improved error handling in theListing.__iter__method of theGenericclass, located in theworkspace_access/generic.pyfile. Previously, onlyNotFoundexceptions were handled, but now bothInternalErrorandNotFoundexceptions are caught and logged appropriately. This change enhances the robustness of the method, which is responsible for listing objects of a specific type and returning them asGenericPermissionsInfoobjects. To ensure the correct functionality, we have added new unit tests and manual testing. The logging of theInternalErrorexception is properly handled in theGenericPermissionsSupportclass when listing serving endpoints. This behavior is verified by the newly added test functiontest_internal_error_in_serving_endpoints_raises_warningand the updatedtest_serving_endpoints_not_enabled_raises_warning. - Added handling for
PermissionDeniedwhen listing accessible workspaces (#2733). A newcan_administermethod has been added to theWorkspacesclass in theworkspaces.pyfile, which allows for more fine-grained control over which users can administer workspaces. This method checks if the user has access to a given workspace and is a member of the workspace'sadminsgroup, indicating that the user has administrative privileges for that workspace. If the user does not have access to the workspace or is not a member of theadminsgroup, the method returnsFalse. Additionally, error handling in theget_accessible_workspacesmethod has been improved by adding aPermissionDeniedexception to the list of exceptions that are caught and logged. New unit tests have been added for theAccountWorkspacesclass of thedatabricks.labs.blueprint.accountmodule to ensure that the new method is functioning as intended, specifically checking if a user is a workspace administrator based on whether they belong to theadminsgroup. The linked issue #2732 is resolved by this change. All changes have been manually and unit tested. - Added static code analysis results to assessment dashboard (#2696). This commit introduces two new tasks,
assess_dashboardsandassess_workflows, to the existing assessment dashboard for identifying migration problems in dashboards and workflows. These tasks analyze embedded queries and notebooks for migration issues and collect direct filesystem access patterns requiring attention. Upon completion, the results are stored in the inventory database and displayed on the Migration dashboard. Additionally, two new widgets, job/query problem widgets and directfs access widgets, have been added to enhance the dashboard's functionality by providing additional information related to code compatibility and access control. Integration tests using mock data have been added and manually tested to ensure the proper functionality of these new features. This update improves the overall assessment and compatibility checking capabilities of the dashboard, making it easier for users to identify and address issues related to Unity Catalog compatibility in their workflows and dashboards. - Added unskip CLI command to undo a skip on schema or a table (#2727). This pull request introduces a new CLI command, "unskip", which allows users to reverse a previously applied
skipon a schema or table. Theunskipcommand accepts a required--schemaparameter and an optional--tableparameter. A new function, also named "unskip", has been added, which takes the same parameters as theskipcommand. The function checks for the required--schemaparameter and creates a new WorkspaceContext object to call the appropriate method on the table_mapping object. Two new methods,unskip_schemaand "unskip_table_or_view", have been added to the HiveMapping class. These methods remove the skip mark from a schema or table, respectively, and handle exceptions such as NotFound and BadRequest. The get_tables_to_migrate method has been updated to consider the unskipped tables or schemas. Currently, the feature is tested manually and has not been added to the user documentation. - Added unskip CLI command to undo a skip on schema or a table (#2734). A new
unskipCLI command has been added to the project, which allows users to remove theskipmark set by the existingskipcommand on a specified schema or table. This command takes an optional--tableflag, and if not provided, it will unskip the entire schema. The new functionality is accompanied by a unit test and relevant user documentation, and addresses issue #1938. The implementation includes the addition of theunskip_table_or_viewmethod, which generates the appropriateALTER TABLE/VIEWstatement to remove the skip marker, and updates to theunskip_schemamethod to include the schema name in theALTER SCHEMAstatement. Additionally, exception handling has been updated to includeNotFoundandBadRequestexceptions. This feature simplifies the process of undoing a skip on a schema, table, or view in the Hive metastore, which previously required manual editing of the Hive metastore properties. - Assess source code as part of the assessment (#2678). This commit introduces enhancements to the assessment workflow, including the addition of two new tasks for evaluating source code from SQL queries in dashboards and from notebooks/files in jobs and tasks. The existing
databricks labs install ucxcommand has been modified to incorporate linting during the assessment. TheQueryLinterclass has been updated to accept an additional argument for linting source code. These changes have been thoroughly tested through integration tests to ensure proper functionality. Co-authored by Eric Vergnaud. - Bump astroid version, pylint version and drop our f-string workaround (#2746). In this update, we have bumped the versions of astroid and pylint to 3.3.1 and removed workarounds related to f-string inference limitations in previous versions of astroid (< 3.3). These workarounds were necessary for handling issues such as uninferrable sys.path values and the lack of f-string inference in loops. We have also updated corresponding tests to reflect these changes and improve the overall code quality and maintainability of the project. These changes are part of a larger effort to update dependencies and simplify the codebase by leveraging the latest features of up...
v0.37.0
- Added ability to run create-missing-principals command as collection (#2675). This release introduces the capability to run the
create-missing-principalscommand as a collection in the UCX (Unified Cloud Experience) tool with the new optional flagrun-as-collection. This allows for more control and flexibility when managing cloud resources, particularly in handling multiple workspaces. The existingcreate-missing-principalscommand has been modified to accept a newrun_as_collectionparameter, enabling the command to run on multiple workspaces when set to True. The function has been updated to handle a list ofWorkspaceContextobjects, allowing it to iterate over each object and execute necessary actions for each workspace. Additionally, a newAccountClientparameter has been added to facilitate the retrieval of all workspaces associated with a specific account. New test functions have been added totest_cli.pyto test this new functionality on AWS and Azure cloud providers. Theacc_clientargument has been added to the test functions to enable running the tests with an authenticated AWS or Azure client, and theMockPromptsobject is used to simulate user responses to the prompts displayed during the execution of the command. - Added storage for direct filesystem references in code (#2526). The open-source library has been updated with a new table
directfs_in_pathsto store Direct File System Access (DFSA) records, extending support for managing and collecting DFSAs as part of addressing issue #2350 and #2526. The changes include a new classDirectFsAccessCrawlersand methods for handling DFSAs, as well as linting, testing, and a manually verified schema upgrade. Additionally, a new SQL query deprecates the use of direct filesystem references. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare. - Added task for linting queries (#2630). This commit introduces a new
QueryLinterclass for linting SQL queries in the workspace, similar to the existingWorkflowLinterfor jobs. TheQueryLinterchecks for any issues in dashboard queries and reports them in a newquery_problemstable. The commit also includes the addition of unit tests, integration tests, and manual testing of the schema upgrade. TheQueryLintermethod has been updated to include aTableMigrationIndexobject, which is currently set to an empty list and will be updated in a future commit. This change improves the quality of the codebase by ensuring that all SQL queries are properly linted and any issues are reported, allowing for better maintenance and development of the system. The commit is co-authored by multiple developers, including Eric Vergnaud, Serge Smertin, Andrew Snare, and Cor. Additionally, a new linting rule, "direct-filesystem-access", has been introduced to deprecate the use of direct filesystem references in favor of more abstracted file access methods in the project's codebase. - Adopt
databricks-labs-pytesterPyPI package (#2663). In this release, we have made updates to thepyproject.tomlfile, removing thepytestpackage version 8.1.0 and updating it to 8.3.3. We have also added thedatabricks-labs-pytesterpackage with a minimum version of 0.2.1. This update also includes the adoption of thedatabricks-labs-pytesterPyPI package, which moves fixture usage frommixins.fixturesinto its own top-level library. This affects various test files, includingtest_jobs.py, by replacing theget_purge_suffixfixture withwatchdog_purge_suffixto standardize the approach to creating and managing temporary directories and files used in tests. Additionally, new fixtures have been introduced in a separate PR for testing thedatabricks.labs.ucxpackage, includingdebug_env_name,product_info,inventory_schema,make_lakeview_dashboard,make_dashboard,make_dbfs_data_copy,make_mounted_location,make_storage_dir,sql_exec, andmigrated_group. These fixtures simplify the testing process by providing preconfigured resources that can be used in the tests. Theredash.pyfile has been removed from thedatabricks/labs/ucx/mixinsdirectory as the Redash API is being deprecated and replaced with a new library. - Assessment: crawl UDFs as a task in parallel to tables instead of implicitly during grants (#2642). This release introduces changes to the assessment workflow, specifically in how User Defined Functions (UDFs) are crawled/scanned. Previously, UDFs were crawled/scanned implicitly by the GrantsCrawler, which requested a snapshot from the UDFSCrawler that hadn't executed yet. With this update, UDFs are now crawled/scanned as their own task, running in parallel with tables before grants crawling begins. This modification addresses issue #2574, which requires grants and UDFs to be refreshable but only once within a given workflow run. A new method, crawl_udfs, has been introduced to iterate over all UDFs in the Hive Metastore of the current workspace and persist their metadata in a table named $inventory_database.udfs. This inventory is utilized when scanning securable objects for issues with grants that cannot be migrated to Unit Catalog. The crawl_grants task now depends on crawl_udfs, crawl_tables, and setup_tacl, ensuring that UDFs are crawled/scanned before grants are.
- Collect direct filesystem access from queries (#2599). This commit introduces support for extracting Direct File System Access (DirectFsAccess) records from workspace queries, adding a new table
directfs_in_queriesand a new viewdirectfsthat unionsdirectfs_in_pathswith the new table. The DirectFsAccessCrawlers class has been refactored into two separate classes:DirectFsAccessCrawler.for_pathsandDirectFsAccessCrawler.for_queries, and a newQueryLinterclass has been introduced to check queries for DirectFsAccess records. Unit tests and manual tests have been conducted to ensure the correct functioning of the schema upgrade. The commit is co-authored by Eric Vergnaud, Serge Smertin, and Andrew Snare. - Fixed failing integration test:
test_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspace(#2624). In this release, we have made updates to the group migration workflow, addressing an issue (#2623) where the integration testtest_reflect_account_groups_on_workspace_skips_groups_that_already_exists_in_the_workspacefailed due to unhandled scenarios where a workspace group already existed with the same name as an account group to be reflected. The changes include the addition of a new method,_workspace_groups_in_workspace(), which checks for the existence of workspace groups. We have also modified thegroup-migrationworkflow and integrated testtest_reflect_account_groups_on_workspace_skips_account_groups_when_a_workspace_group_has_same_name. To enhance consistency and robustness, theGroupManagerclass has been updated with two new methods:test_reflect_account_groups_on_workspace_warns_skipping_when_a_workspace_group_has_same_nameandtest_reflect_account_groups_on_workspace_logs_skipping_groups_when_already_reflected_on_workspace. These new methods check if a group is skipped when a workspace group with the same name exists and log a warning message, as well as log skipping groups that are already reflected on the workspace. These improvements ensure that the system behaves as expected during the group migration process, handling cases where workspace groups and account groups share the same name. - Fixed failing solution accelerator verification tests (#2648). This release includes a fix for an issue in the LocalCodeLinter class that was unable to normalize Python code at the notebook cell level. The solution involved modifying the LocalCodeLinter constructor to include a notebook loader, as well as adding a conditional block to the lint_path method to determine the correct loader to use based on whether the path is a notebook or not. These changes allow the linter to handle Python code more effectively within Jupyter notebook cells. The tests for this change were manually verified using
make solaccon the files that failed in CI. This commit has been co-authored by Eric Vergnaud. The functionality of the linter remains unchanged, and there is no impact on the overall software functionality. The target audience for this description includes software engineers who adopt this open-source library. - Fixed handling of potentially corrupt
state.jsonof UCX workflows (#2673). This commit introduces a fix for potential corruption ofstate.jsonfiles in UCX workflows, addressing issue #2673 and resolving #2667. It updates the import statement ininstall.py, introduces a newwith_extrafunction, and centralizes the deletion of jobs, improving code maintainability. Two new methods are added to check if a job is managed by UCX. Additionally, the commit removes deprecation warnings for direct filesystem references in pytester fixtures and adjusts the known.json file to accurately reflect the project's state. A newTaskmethod is added for defining UCX workflow tasks, and several test cases are updated to ensure the correct handli...
v0.36.0
- Added
uploadanddownloadcli commands touploadanddownloada file to/from a collection of workspaces (#2508). In this release, the Databricks Labs Unified CLI (Command Line Interface) for UCX (Unified CLI for Workspaces, Clusters, and Tables) has been updated with newuploadanddownloadcommands. Theuploadcommand allows users to upload a file to a single workspace or a collection of workspaces, while thedownloadcommand enables users to download a CSV file from a single workspace or a collection of workspaces. This enhances the efficiency of uploading or downloading the same file to multiple workspaces. Both commands display a warning or information message upon completion, and ensure the file schema is correct before uploading CSV files. This feature includes new methods for uploading and downloading files for multiple workspaces, as well as new unit and integration tests. Users can refer to the contributing instructions to help improve the project. - Added ability to run
create-table-mappingcommand as collection (#2602). This PR introduces the capability to run thecreate-table-mappingcommand as a collection in thedatabricks labs ucxCLI, providing increased flexibility and automation for workflows. A new optional boolean flag,run-as-collection, has been added to thecreate-table-mappingcommand, allowing users to indicate if they want to run it as a collection with a default value of False. The updatedcreate_table_mappingfunction now accepts additional arguments, enabling efficient creation of table mappings for multiple workspaces. Users are encouraged to test this feature in various scenarios and provide feedback for further improvements. - Added comment on the source tables to capture that they have been deprecated (#2548). A new method,
_sql_add_migrated_comment(self, table: Table, target_table_key: str), has been added to thetable_migrate.pyfile to mark deprecated source tables with a comment indicating their deprecated status and directing users to the new table. This method is currently being used in three existing methods within the same file to add comments to deprecated tables as part of the migration process. In addition, a new SQL query has been added to set a comment on the source tablehive_metastore.db1_src.managed_dbfs, indicating that it is deprecated and directing users to the new tableucx_default.db1_dst.managed_dbfs. A unit test has also been updated to ensure that the migration process correctly adds the deprecation comment to the source table. This change is part of a larger effort to deprecate and migrate data from old tables to new tables and provides guidance for users to migrate to the new table. - Added documentation for PrincipalACl migration and delete-missing-principal cmd (#2552). In this open-source library release, the UCX project has added a new command
delete-missing-principals, applicable only for AWS, to delete IAM roles created by UCX. This command lists all IAM roles generated by theprincipal-prefix-accesscommand and allows for the selection of multiple roles to delete. It checks if the selected roles are mapped to any storage credentials and seeks confirmation before deleting the role and its associated inline policy. Additionally, updates have been made to thecreate-uber-principalandmigrate-locationscommands to apply location ACLs from existing clusters and grant necessary permissions to users. Thecreate-catalogs-schemascommand has been updated to apply catalog and schema ACLs from existing clusters for both Azure and AWS. Themigrate-tablescommand has also been updated to apply table and view ACLs from existing clusters for both Azure and AWS. The documentation of commands that require admin privileges in the UCX project has also been updated. - Added linting for
spark.sql(...)calls (#2558). This commit introduces linting forspark.sql(...)calls to enhance code quality and consistency by addressing issue #2558. The previous SparkSqlPyLinter linter only checked for table migration, but not other SQL linters like DirectFsAccess linters. This has been rectified by incorporating additional SQL linters forspark.sql(...)calls, improving the overall linting functionality of the system. The commit also introduces an abstract base class called Fixer, which enforces the inclusion of anameproperty for all derived classes. Additionally, minor improvements and changes have been made to the codebase. The commit resolves issue #2551, and updates the testing process intest_functional.pyto testspark-sql-directfs.py, ensuring the proper functioning of the lintedspark.sql(...)calls. - Document: clarify that the
assessmentjob is not intended to be re-run (#2560). In this release, we have updated the behavior of theassessmentjob for Databricks Labs Unity Catalog (UCX) to address confusion around its re-run functionality. Moving forward, theassessmentjob should only be executed once during the initial setup of UCX and should not be re-run to refresh the inventory or findings. If a re-assessment is necessary, UCX will need to be reinstalled first. This change aligns the actual functionality of theassessmentjob and will not affect the daily job that updates parts of the inventory. Theassessmentworkflow is designed to detect incompatible entities and provide information for the migration process. It can be executed in parallel or sequentially, and its output is stored in Delta tables for further analysis and decision-making through the assessment report. - Enabled
migrate-credentialscommand to run as collection (#2532). In this pull request, themigrate-credentialscommand in the UCX project's CLI has been updated with a new optional flag,run_as_collection, which allows the command to operate on multiple workspaces as a collection. This change introduces theget_contextsfunction and modifies thedelete_missing_principalsfunction to support the new functionality. Themigrate-credentialscommand's behavior for Azure and AWS has been updated to accept an additionalacc_clientargument in its tests. Comprehensive tests and documentation have been added to ensure the reliability and robustness of the new functionality. It is recommended to review the attached testing evidence and ensure the new functionality works as intended without introducing any unintended side effects. - Escape column names in target tables of the table migration (#2563). In this release, the
escape_sql_identifierfunction in theutils.pyfile has been enhanced with a newmaxsplitparameter, providing more control over the maximum number of splits performed on the input string. This addresses issue #2544 and is part of the existing workflow "-migration-ones". The "tables.py" file in the "databricks/labs/ucx/hive_metastore" directory has been updated to escape column names in target tables, preventing SQL injection attacks. Additionally, a newColumnInfoclass and several utility functions have been added to thefixtures.pyfile in thedatabricks.labs.ucxproject for generating SQL schemas and column casting. The integration tests for migrating Hive Metastore tables have been updated with new tests to handle column names that require escaping. Lastly, thetest_manager.pyfile in thetests/unit/workspace_accessdirectory has been refactored by removing themock_backendfixture and adding thetest_inventory_permission_manager_initmethod to test the initialization of thePermissionManagerclass. These changes improve security, functionality, and test coverage for software engineers utilizing these libraries in their projects. - Explain why metastore is checked to exists in group migration workflow in docstring (#2614). In the updated
workflows.pyfile, the docstring for theverify_metastore_attachedmethod has been revised to explain the necessity of checking if a metastore is attached to the workspace. The reason for this check is that account level groups are only available when a metastore is attached, which is crucial for the group migration workflow to function properly. The method itself remains the same, only verifying the presence of a metastore attached to the workspace and causing the workflow to fail if no metastore is found. This modification enhances the clarity of the metastore check's importance in the context of the group migration workflow. - Fixed infinite recursion when visiting a dependency graph (#2562). This change addresses an issue of infinite recursion that can occur when visiting a dependency graph, particularly when many files in a package import the package itself. The
visitmethod has been modified to only visit each parent/child pair once, preventing the recursion that can occur in such cases. Thedependenciesproperty has been added to the DependencyGraph class, and theDependencyGraphVisitorclass has been introduced to handle visiting nodes and tracking visited pairs. These modifications improve the robustness of the library by preventing infinite recursion during dependency resolution. The change includes added unit tests to ensure correct behavior and addresses a blocker for a previous pull request. The functionality of the code remains unchanged...
v0.35.0
- Added
databricks labs ucx delete-credentialcmd to delete the UC roles created by UCX (#2504). In this release, we've added several new commands to thelabs.ymlfile for managing Unity Catalog (UC) roles in Databricks, specifically for AWS. The new commands includedatabricks labs ucx delete-missing-principalsanddatabricks labs ucx delete-credential. Thedatabricks labs ucx delete-missing-principalscommand helps manage UC roles created through thecreate-missing-principalscmd by listing all the UC roles in AWS and allowing users to select roles to delete. It also checks for unused roles before deletion. Thedatabricks labs ucx delete-credentialcommand deletes UC roles created by UCX and is equipped with an optionalaws-profileflag for authentication purposes. Additionally, we've added a new methoddelete_uc_rolein theaccess.pyfile for deleting UC roles and introduced new test cases to ensure correct behavior. These changes resolve issue #2359, improving the overall management of UC roles in AWS. - Added basic documentation for linter message codes (#2536). A new section, "Linter Message Codes," has been added to the README file, providing detailed explanations, examples, and resolution instructions for various linter message codes related to Unity Catalog (UC) migration. To help users familiarize themselves with the different message codes that may appear during linting, a new command,
python tests/integration/source_code/message_codes.py, has been implemented. Running this command will display a list of message codes, includingcannot-autofix-table-reference,catalog-api-in-shared-clusters,changed-result-format-in-uc,dbfs-read-from-sql-query,dbfs-usage,dependency-not-found,direct-filesystem-access,implicit-dbfs-usage,jvm-access-in-shared-clusters,legacy-context-in-shared-clusters,not-supported,notebook-run-cannot-compute-value,python-udf-in-shared-clusters,rdd-in-shared-clusters,spark-logging-in-shared-clusters,sql-parse-error,sys-path-cannot-compute-value,table-migrated-to-uc,to-json-in-shared-clusters, andunsupported-magic-line. Users are encouraged to review these message codes and their corresponding explanations to ensure a smooth migration to Unity Catalog. - Added linters for direct filesystem access in Python and SQL code (#2519). In this release, linters have been added for detecting direct file system access (DFSA) in Python and SQL code, specifically addressing Direct File System Access in the Unity Catalog. Initially, the linters only detect DBFS, but the plan is to expand detection to all DSFSAs. This change is part of issue #2519 and includes new unit tests. The linters will flag code that accesses the file system directly, which is not allowed in Unity Catalog, including SQL queries that read from DBFS and Python code that reads from or displays data using DBFS paths. Developers are required to modify such code to use Unity Catalog tables or volumes instead, ensuring that their code is compatible with Unity Catalog's deprecation of direct file system access and DBFS, ultimately resulting in better project integration and performance.
- Clean up left over uber principal resources for AWS (#2449). In this release, we have made significant improvements to our open-source library, particularly in the management of AWS resources and permissions. We have added new methods to handle the creation and deletion of Uber instance profiles and external locations in AWS. The
create_uber_principalmethod in theAWSResourcePermissionsclass has been updated to allow more fine-grained control over the migration process and ensure proper creation and configuration of all necessary resources. Additionally, we have introduced a newAWSResourcesclass andaws_resource_permissionsproperty in theaws_cli_ctxfixture to improve the management of AWS resources and permissions. We have also added new unit tests to ensure proper error handling when creating AWS IAM roles for Unity Catalog Migration (ucx) in specific scenarios. These changes enhance the functionality, test coverage, and overall quality of our library. - Improve log warning about skipped grants (#2517). In this release, we have implemented improvements to the warning messages displayed during the verification of Unified Client Context (UCX) behavior for Access Control List (ACL) migration and User-Defined Function (UDF) behavior migration. Previously, generic warning messages were logged when specific Hive metastore grants could not be identified. This release enhances the warning messages by providing more specific information about the skipped grants, including the action type of the Hive metastore grant that failed to be mapped. Additionally, unittest.mock has been utilized to create a mock object for the GroupManager class, and a new method called MigrateGrants has been introduced, which applies a list of grant loaders to a specific table. These changes improve the logging and error handling, ensuring that software engineers have a clear understanding of any skipped grants during UCX and UDF behavior migration.
- Support sql notebooks in functional tests (#2513). This pull request introduces support for SQL notebooks in the functional testing framework, expanding its capabilities beyond Python notebooks. The changes include migrating relevant tests from
test_notebook_linterto functional tests, as well as introducing new classes and methods to support SQL notebook testing. These changes improve the flexibility and scope of the testing framework, enabling developers to test SQL notebooks and ensuring that they meet quality standards. The commit also includes the addition of a new SQL notebook for demonstrating Unity Catalog table migrations, as well as modifications to various tests and regular expressions to accommodate SQL notebooks. Note that environment variablesDATABRICKS_HOSTandDATABRICKS_TOKENare currently hardcoded asany, requiring further updates based on the specific testing environment. - Updated sqlglot requirement from <25.19,>=25.5.0 to >=25.5.0,<25.20 (#2533). In this update, we have updated the
sqlglotdependency to a version greater than or equal to 25.5.0 and less than 25.20. The previous requirement allowed for versions up to 25.19, but we have chosen to update to a newer version that includes new features and bug fixes. The changelog and commit history for version 25.19 ofsqlglotare provided in the pull request, highlighting the breaking changes, new features, and bug fixes. By updating to this version, we will benefit from the latest improvements and bug fixes in thesqlglotlibrary. We encourage users to review the changelog and test their code to ensure compatibility with the new version. - [chore] fixed
make fmtwarnings related to sdk upgrade (#2534). In this change, warnings related to a recent SDK upgrade have been addressed by modifying thecreatefunction in thefixtures.pyfile. Thecreatefunction is responsible for generating aWait[ServingEndpointDetailed]object, which contains information about the endpoint name and its core configuration. TheServedModelInputinstance was previously created using positional arguments, but has been updated to use named arguments instead (model_name,model_version,workload_size,scale_to_zero_enabled). This modification enhances code readability and maintainability, making it easier for software engineers to understand and modify the codebase.
Dependency updates:
- Updated sqlglot requirement from <25.19,>=25.5.0 to >=25.5.0,<25.20 (#2533).
Contributors: @asnare, @ericvergnaud, @nfx, @FastLee, @dependabot[bot], @HariGS-DB, @JCZuurmond