v0.25.0
- Added handling for legacy ACL
DENYpermission in group migration (#1815). In this release, the handling ofDENYpermissions during group migrations in our legacy ACL table has been improved. Previously,DENYoperations were denoted with aDENIEDprefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence ofDENIEDin the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue #1803. A new test function, test_hive_deny_sql(), has also been added to test the behavior of theDENYpermission. - Added handling for parsing corrupted log files (#1817). The
logs.pyfile in thesrc/databricks/labs/ucx/installerdirectory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new methodtest_parse_logs_warns_for_corrupted_log_filethat verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. - Added known problems with
pysparkpackage (#1813). In this release, updates have been made to thesrc/databricks/labs/ucx/source_code/known.jsonfile to document known issues with thepysparkpackage when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A newKnownProblemdataclass has been added to theknown.pyfile, which includes methods for converting the object to a dictionary for better encoding of problems. The_analyze_filemethod has also been updated to use aknown_problemsset ofKnownProblemobjects, improving readability and management of known problems within the application. These changes address issue #1813 and improve the documentation of known issues withpyspark. - Added library linting for jobs launched on shared clusters (#1689). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue #1637. A new function,
_register_existing_cluster_id(graph: DependencyGraph), has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to thetest_jobs.pyfile in thetests/integration/source_codedirectory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of thejobsandcomputemodules from thedatabricks.sdk.servicepackage. Additionally, a newWorkflowTaskContainermethod has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. - Added linters to check for spark logging and configuration access (#1808). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via
sc.conf, andrdd.mapPartitions. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to theSparkConnectLinterclass and are executed as part of thedatabricks labs ucxcommand. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. - Added list of known dependency compatibilities and regeneration infrastructure for it (#1747). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the
known.jsonfile, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. - Added more known libraries from Databricks Runtime (#1812). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
- Added more known packages from Databricks Runtime (#1814). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
- Added support for
.eggPython libraries in jobs (#1789). This commit adds support for.eggPython libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue #1643. It includes the addition of a new method,PythonLibraryResolver, which replaces the oldPipResolver, and is used to register egg library dependencies in theDependencyGraph. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section wherePipResolveris replaced withPythonLibraryResolverfrom the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from.eggfiles. - Added table migration workflow guide (#1607). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
- Added workflow linter for spark python tasks (#1810). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the
_register_spark_python_taskmethod in thejobs.pyfile. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. Thetest_job_spark_python_task_linter_happy_pathtest checks the linter on a valid job configuration where all required libraries are specified, while thetest_job_spark_python_task_linter_unhappy_pathtest checks the linter on an invalid job configuration where required libraries are not specified. These tests ensure that the workflow linter for Spark Python tasks is functioning correctly and can help identify any potential issues in job configurations. - Connect all linters to
LinterContextand add functional testing framework (#1811). This commit connects all linters, including those related to JVM, to the critical path for improved code linting, and introduces a functional testing framework to simplify the writing of code linting verification tests. Thepyproject.tomlfile has been updated to include a new configuration for theignore-pathsoption, utilizing a regular expression to exclude certain files or directories from linting. The testing framework is particularly useful for verifying the correct functioning of linters, reducing the risk of errors and improving the overall development experience. These changes will help to improve the reliability and efficiency of the linting process, making it easier to write and maintain high-quality code. - Deduplicate errors emitted by Spark Connect linter (#1824). This pull request introduces error deduplication for the Spark Connect linter and adds new functional tests using an updated framework. The modifications include the addition of user documentation and unit tests, as well as alterations to existing commands and workflows. Specifically, a new CLI command has been added, and the command
databricks labs ucx ...has been modified. Additionally, a new workflow has been implemented, and an existing workflow has been updated. No new tables or modifications to existing tables are present. Testing has been conducted through manual testing and new unit tests, with no integration tests or staging environment tests specified. Theverifymethod in thetest_functional.pyfile has been updated to sort the actual problems list before comparing it to the expected problems list, ensuring consistent ordering of results. The changes aim to improve the functionality and usability of the Spark Connect linter for our software engineer audience. - Download wheel dependency locally to register it to the dependency graph (#1704). A new feature has been implemented in the open-source library to enhance dependency management for wheel files. Previously, when the library type was wheel, a
not-yet-implementedDependencyProblem would be yielded. Now, the system downloads the wheel file from a remote location, saves it to a temporary directory, and registers the local file to the dependency graph. This allows for more comprehensive handling of wheel dependencies, as they are now downloaded and registered instead of simply being flagged as "not-yet-implemented". Additionally, new functions for creating jobs, making notebooks, and generating random values have been added to enable more comprehensive testing of the workflow linter. New tests have been implemented to check the linter's behavior when there is a missing library dependency and to verify that the linter correctly handles wheel dependencies. These changes improve the testing capabilities of the workflow linter and ensure that all dependencies are properly accounted for and managed within the system. A new test method, 'test_workflow_task_container_builds_dependency_graph_for_python_wheel', has been added to ensure that the dependency graph is built correctly for Python wheels and to improve test coverage. - Drop pyspark
registerlint matcher (#1818). In the latest release, theregisterlint matcher has been removed from pyspark, indicating that the specific usage pattern for theregistermethod in UDTFRegistration is no longer required. This change affects the linting process during code reviews, but does not impact the functionality of the code directly. Other matchers for DataFrame, DataFrameReader, DataFrameWriter, and direct filesystem access remain unchanged. Theregistermethod, which was likely used to register a temporary table or view in pyspark, is no longer considered a best practice or necessary feature. If you previously relied on theregistermethod in your pyspark code, you will need to find an alternative solution. This update aims to improve the quality and consistency of pyspark code by removing outdated or unnecessary functionality. - Enabled joining an existing installation to a collection (#1799). This change introduces several new features and modifications to the open-source library, aimed at enhancing the management and organization of workspaces within a collection. A new command
join-collectionhas been added to allow a workspace to join a collection using its workspace ID. Thereport-account-compatibilitycommand has been updated with a new flag--workspace-ids, and thealiascommand has been updated with a new description. Two new commandsprincipal-prefix-accessandcreate-missing-principalshave been introduced for AWS, and a new commandcreate-uber-principalhas been introduced for Azure to handle the creation of service principals with STORAGE BLOB READER access for storage accounts used by tables in the workspace. The code's readability and maintainability have been improved by modifying the method_can_administertocan_administerand_load_workspace_infotoload_workspace_infoin theworkspaces.pyfile. A newjoin_collectioncommand has been added to theucxapplication instance to enable joining an existing installation to a collection. Additionally, modifications to theinstall.pyfile andtest_installation.pyfile have been made to facilitate the integration of existing installations into a collection. The tests have been updated to ensure that the joining process works correctly in various scenarios. Overall, these changes provide more flexibility and ease of use for users and improve the interoperability and security of the system. - Fixed
migrate-credentialcli command on AWS (#1732). In this release, themigrate-credentialCLI command for AWS has been improved and fixed. The command now includes changes to theaccess.pyfile in thedatabricks/labs/ucx/awsdirectory. Notable updates are the refactoring of therole_namemethod into a dataclass calledAWSCredentialCandidate, the addition of the method_aws_role_trust_doc, and the removal of the_databricks_trust_statementmethod. The_aws_s3_policymethod has been updated to includes3:PutObjectAclin the allowed actions, and methods_create_roleand_get_role_access_taskhave been updated to usearninstead ofrole_name. Additionally, thecreate_uc_roleandupdate_uc_trust_rolemethods have been combined into a singleupdate_uc_rolemethod. Themigrate-credentialscommand in thecli.pyfile has also been updated to support migration of AWS Instance Profiles to UC storage credentials. These improvements resolve issue #1726 and enhance the functionality and reliability of themigrate-credentialcommand for AWS. - Fixed crasher when running migrate-local-code (#1794). In this release, we have addressed a crasher issue that occurred when running the
migrate-local-codecommand. The change involves modifying thelocal_file_migratorproperty in theLocalCheckoutContextclass to use a lambda function instead of directly passingself.languages. This ensures that the languages are loaded only when thelocal_file_migratorproperty is accessed, preventing unnecessary load and potential crashes. The change does not introduce any new functionalities, but instead modifies existing commands related to local file migration. Comprehensive manual testing and unit tests have been conducted to ensure the fix works as expected without negatively impacting other parts of the system. - Fixed inconsistent behavior in
%pipcell handling (#1785). This PR addresses inconsistent behavior in%pipcell handling by modifying Python library installation to occur in a designated path lookup, rather than deep within the library tree. These changes impact various components, such as thePipResolverclass, which no longer requires aFileLoaderinstance as an argument and now takes aWhitelistinstance directly. Additionally, tests liketest_detect_s3fs_importandtest_detect_s3fs_import_in_dependenciesare affected by these modifications. Overall, these changes streamline the%pipfeature, improving library installation efficiency and consistency. - Fixed issue when creating view using
WITHclause (#1809). In this release, we have addressed an issue that occurred when creating a view using aWITHclause, which was causing potential errors or incorrect results due to improper handling of aliases. A new method,_read_aliases, has been introduced to read and store aliases from theWITHclause as a set, and during view dependency analysis, if an old table's name matches an alias, it is now skipped to prevent double-counting. This ensures improved accuracy and reliability of view creation withWITHclauses. Moreover, the commit includes adjustments to import statements, addition of unit tests, and the introduction of a new classTableViewin thedatabricks.labs.ucx.hive_metastore.view_migratemodule to test whether a view with a local dataset should be skipped. This release also includes a test for migrating a view with columns, ensuring that views with local datasets are now handled correctly. The fix resolves issue #1798. - Fixed linting for non-UTF8 encoded files (#1804). This commit addresses linting issues for files that are not encoded in UTF-8, improving compatibility with non-UTF-8 encoded files in the databricks labs ucx project. Previously, the linter and fixer tools were unable to process non-UTF-8 encoded files, causing them to fail. This issue has been resolved by adding a check for file encoding during linting and handling the case where the file is not encoded in UTF-8 by returning a failure message. A new method,
getpreferredencoding(False), has been introduced to determine the file's encoding, ensuring UTF-8 compatibility. Additionally, a new test method,test_file_linter_lints_non_ascii_encoded_file, has been added to check the linter's behavior with non-ASCII encoded files. This enhancement simplifies the linting process, allowing for better file handling of non-UTF-8 encoded files, and is supported by manual testing and unit tests. - Further fix for DENY permissions (#1834). This commit addresses issue #1834 by implementing a fix for handling DENY permissions in the legacy TACL migration logic. Previously, all permissions were grouped in a single GRANT statement, but they have now been updated to be split into separate GRANT and DENY statements. This change improves the clarity and maintainability of the code and also increases test coverage with the addition of unit tests and integration tests. A new test function
test_tacl_applier_deny_and_grant()has been added to demonstrate the use of the updated logic for handling DENY permissions. The resulting SQL queries now include both GRANT and DENY statements, reflecting the updated logic. These changes ensure that the DENY permissions are correctly applied, increasing the overall test coverage and confidence in the code. - Removed false warning on DataFrame.insertInto() about the default format changing from parquet to delta (#1823). This pull request removes a false warning related to the use of DataFrameWriter.insertInto(), which had been incorrectly flagging a potential issue due to the default format change from Parquet to Delta. The warning is now suppressed as it is no longer relevant, since the operation ignores any specified format and uses the existing format of the underlying table. Additionally, an unnecessary linting suppression has been removed. These changes improve the accuracy of the warning system and eliminate confusion for users, with no impact on functionality, usability, or performance. The changes have been manually tested and do not require any new unit or integration tests, CLI commands, workflows, or tables.
- Support linting python wheel tasks (#1821). This release introduces support for linting python wheel tasks, addressing issue #1
- Updated linting checks for Spark table methods (#1816). This commit updates linting checks for PySpark's Spark table methods, focusing on improving handling of migrated tables and deprecating direct filesystem references in favor of the Unity Catalog. New tests and examples include literal and variable references to known and unknown tables, as well as cases with extra or out-of-position arguments. The commit also highlights false positives and trivial references in unrelated contexts. These changes aim to ensure proper usage of Spark table methods, improve codebase consistency, and minimize potential issues related to migrations and format changes.
Dependency updates:
- Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2 (#1819).
Contributors: @nfx, @asnare, @JCZuurmond, @ericvergnaud, @nkvuong, @HariGS-DB, @vsevolodstep-db, @FastLee, @pritishpai, @dependabot[bot]