Skip to content

Releases: databrickslabs/ucx

v0.34.0

30 Aug 14:26
@nfx nfx
622ed83

Choose a tag to compare

  • Added a check for No isolation shared clusters and MLR (#2484). This commit introduces a check for No isolation shared clusters utilizing MLR as part of the assessment workflow and cluster crawler, addressing issue #846. A new function, is_mlr, has been implemented to determine if the Spark version corresponds to an MLR cluster. If the cluster has no isolation and uses MLR, the assessment failure list is appended with an appropriate error message. Thorough testing, including unit tests and manual verification, has been conducted. However, user documentation and new CLI commands, workflows, tables, or unit/integration tests have not been added. Additionally, a new test has been added to verify the behavior of MLR clusters without isolation, enhancing the assessment workflow's accuracy in identifying unsupported configurations.
  • Added a section in migration dashboard to list the failed tables, etc (#2406). In this release, we have introduced a new logging message format for failed table migrations in the TableMigrate class, specifically impacting the _migrate_external_table, _migrate_external_table_hiveserde_in_place, _migrate_dbfs_root_table, _migrate_table_create_ctas, _migrate_table_in_mount, and _migrate_acl methods within the table_migrate.py file. This update employs the failed-to-migrate prefix in log messages for improved failure reason identification during table migrations, enhancing debugging capabilities. As part of this release, we have also developed a new SQL file, 05_1_failed_table_migration.sql, which retrieves a list of failed table migrations by extracting messages with the 'failed-to-migrate:' prefix from the inventory.logs table and returning the corresponding message text. While this release does not include new methods or user documentation, it resolves issue #1754 and has been manually tested with positive results in the staging environment, demonstrating its functionality.
  • Added clean up activities when migrate-credentials cmd fails intermittently (#2479). This pull request enhances the robustness of the migrate-credentials command for Azure in the event of intermittent failures during the creation of access connectors and storage credentials. It introduces new methods, delete_storage_credential and delete_access_connectors, which are responsible for removing incomplete resources when errors occur. The _migrate_service_principals and _create_storage_credentials_for_storage_accounts methods now handle PermissionDenied, NotFound, and BadRequest exceptions, deleting created storage credentials and access connectors if exceptions occur. Additionally, error messages have been updated to guide users in resolving issues before attempting the operation again. The PR also modifies the sp_migration fixture in the tests/unit/azure/test_credentials.py file, simplifying the deletion process for access connectors and improving the testing of the ServicePrincipalMigration class. These changes address issue #2362, ensuring clean-up activities in case of intermittent failures and improving the overall reliability of the system.
  • Added standalone migrate ACLs (#2284). A new migrate-acls command has been introduced to facilitate the migration of Access Control Lists (ACLs) from a legacy metastore to a Unity Catalog (UC) metastore. The command, designed to work with HMS federation and other table migration scenarios, can be executed with optional flags target-catalog and hms-fed to specify the target catalog and migrate HMS-FED ACLs, respectively. The release also includes modifications to the labs.yml file, adding the new command and its details to the commands section. In addition, a new ACLMigrator class has been added to the databricks.labs.ucx.contexts.application module to handle ACL migration for tables in a standalone manner. A new test file, test_migrate_acls.py, contains unit tests for ACL migration in a Hive metastore, covering various scenarios and ensuring proper query generation. These features streamline and improve the functionality of ACL migration, offering better access control management for users.
  • Appends metastore_id or location_name to roles for uniqueness (#2471). A new method, _generate_role_name, has been added to the Access class in the aws/access.py file of the databricks/labs/ucx module to generate unique names for AWS roles using a consistent naming convention. The list_uc_roles method has been updated to utilize this new method for creating role names. In response to issue #2336, the create_missing_principals change enforces role uniqueness on AWS by modifying the ExternalLocation table to include metastore_id or location_name for uniqueness. To ensure proper cleanup, the create_uber_principal method has been updated to delete the instance profile if creating the cluster policy fails due to a PermissionError. Unit tests have been added to verify these changes, including tests for the new role name generation method and the updated ExternalLocation table. The MetastoreAssignment class is also imported in this diff, although its usage is not immediately clear. These changes aim to improve the creation of unique AWS roles for Databricks Labs UCX and enforce role uniqueness on AWS.
  • Cache workspace content (#2497). In this release, we have implemented a caching mechanism for workspace content to improve load times and bypass rate limits. The WorkspaceCache class handles caching of workspace content, with the _CachedIO and _PathLruCache classes managing IO operation caching and LRU caching, respectively. The _CachedPath class, a subclass of WorkspacePath, handles caching of workspace paths. The open and unlink methods of _CachedPath have been overridden to cache results and remove corresponding cache entries. The guess_encoding function is used to determine the encoding of downloaded content. Unit tests have been added to ensure the proper functioning of the caching mechanism, including tests for cache reuse, invalidation, and encoding determination. This feature aims to enhance the performance of file operations, making the overall system more efficient for users.
  • Changes the security mode for assessment cluster (#2472). In this release, the security mode of the main cluster assessment has been updated from LEGACY_SINGLE_USER to LEGACY_SINGLE_USER_STANDARD in the workflows.py file. This change disables passthrough and addresses issue #1717. The new data security mode is defined in the compute.ClusterSpec object for the main job cluster by modifying the data_security_mode attribute. While no new methods have been introduced, existing functionality related to the cluster's security mode has been modified. Software engineers adopting this project should be aware of the security implications of this change, ensuring the appropriate data protection measures are in place. Manual testing has been conducted to verify the functionality of this update.
  • Do not normalize cases when reformatting SQL queries in CI check (#2495). In this release, the CI workflow for pushing changes to the repository has been updated to improve the behavior of the SQL query reformatting step. Previously, case normalization of SQL queries was causing issues with case-sensitive columns, resulting in blocked CI checks. This release addresses the issue by adding the --normalize-case false flag to the databricks labs lsql fmt command, which disables case normalization. This modification allows the CI workflow to pass and ensures correct SQL query formatting, regardless of case sensitivity. The change impacts the assessment/interactive directory, specifically a cluster summary query for interactive assessments. This query involves a change in the ORDER BY clause, replacing a normalized case with the original case. Despite these changes, no new methods have been added, and existing functionality has been modified solely to improve CI efficiency and SQL query compatibility.
  • Drop source table after successful table move not before (#2430). In this release, we have addressed an issue where the source table was being dropped before a new table was created, which could cause the creation process to fail and leave the source table unavailable. This problem has been resolved by modifying the _recreate_table method of the TableMove class in the hive_metastore package to drop the source table after the new table creation. The updated implementation ensures that the source table remains intact during the creation process, even in case of any issues. This change comes with integration tests and does not involve any modifications to user documentation, CLI commands, workflows, tables, or existing functionality. Additionally, a new test function test_move_tables_table_properties_mismatch_preserves_original has been added to test_table_move.py, which checks if the original table is preserved when there is a mismatch in table properties during the move operation. The changes also include adding the pytest library and the BadRequest exception from the databricks.sdk.errors package for the new test function. The imports section has been updated accordingly with...
Read more

v0.33.0

15 Aug 11:33
@nfx nfx
a583899

Choose a tag to compare

  • Added validate-table-locations command for checking overlapping tables across workspaces (#2341). A new command, validate-table-locations, has been added to check for overlapping table locations across workspaces before migrating tables. This command is intended to ensure that tables can be migrated across workspaces without issues. The new command is part of the table migration workflows and uses a LocationTrie data structure to efficiently search for overlapping table locations. If any overlaps are found, the command logs a warning message and adds the conflicting tables to a list of all conflicts. This list is returned at the end of the command. The validate-table-locations command is intended to be run before migrating tables to ensure that the tables can be migrated without conflicts. The command includes a workspace-ids flag, which allows users to specify a list of workspace IDs to include in the validation. If this flag is not provided, the command will include all workspaces present in the account. This new command resolves issue #673. The validate_table_locations method is added to the AccountAggregate class and the ExternalLocations class has been updated to use the new LocationTrie class. The import section has also been updated to include new modules such as LocationTrie and Table from databricks.labs.ucx.hive_metastore.locations and databricks.labs.ucx.hive_metastore.tables respectively. Additionally, test cases have been added to ensure the correct functioning of the LocationTrie class.
  • Added references to hive_metastore catalog in all table references an… (#2419). In this release, we have updated various methods and functions across multiple files to include explicit references to the hive_metastore catalog in table references. This change aims to improve the accuracy and consistency of table references in the codebase, enhancing reliability and maintainability. Affected files include azure.py, init_scripts.py, pipelines.py, and others in the databricks/labs/ucx/assessment module, as well as test files in the tests/unit/assessment and tests/unit/azure directories. The _try_fetch method has been updated to include the catalog name in table references in all instances, ensuring the correct catalog is referenced in all queries. Additionally, various test functions in affected files have been updated to reference the hive_metastore catalog in SQL queries. This update is part of the resolution of issue #2207 and promotes robust handling of catalog, schema, and table naming scenarios in hive metastore migration status management.
  • Added support for skipping views when migrating tables and views (#2343). In this release, we've added support for skipping both tables and views during the migration process in the databricks labs ucx command, addressing issue #1937. The skip command has been enhanced to support skipping views, and new functions skip_table_or_view and load_one have been introduced to the Table class. Appropriate error handling and tests, including unit tests and integration tests, have been implemented to ensure the functionality works as expected. With these changes, users can now skip views during migration and have more flexibility when working with tables in the Unity Catalog.
  • Avoid false positives when linting for pyspark patterns (#2381). This release includes enhancements to the PySpark linter aimed at reducing false positives during linting. The linter has been updated to check the originating module when detecting PySpark calls, ensuring that warnings are triggered only for relevant nodes from the pyspark or dbutils modules. Specifically, the ReturnValueMatcher and DirectFilesystemAccessMatcher classes have been modified to include this new check. These changes improve the overall accuracy of the PySpark linter, ensuring that only pertinent warnings are surfaced during linting. Additionally, the commit includes updated unit tests to verify the correct behavior of the modified linter. Specific improvements have been made to avoid false positives when detecting the listTables function in the PySpark catalog, ensuring that the warning is only triggered for the actual PySpark listTables method call.
  • Bug: Generate custom warning when doing table size check and encountering DELTA_INVALID_FORMAT exception (#2426). A modification has been implemented in the _safe_get_table_size method within the table_size.py file of the hive_metastore package. This change addresses an issue (#1913) concerning the occurrence of a DELTA_INVALID_FORMAT exception while determining the size of a Delta table. Instead of raising an error, the exception is now converted into a warning, and the function proceeds to process the rest of the table. A corresponding warning message has been added to inform users about the issue and suggest checking the table structure. No new methods have been introduced, and existing functionalities have been updated to handle this specific exception more gracefully. The changes have been thoroughly tested with unit tests for the table size check when encountering a DELTA_INVALID_FORMAT error, employing a mock backend and a mock Spark session to simulate the error conversion. This change does not affect user documentation, CLI commands, workflows, or tables, and is solely intended for software engineers adopting the project.
  • Clean up left over uber principal resources for Azure (#2370). This commit includes modifications to the Azure access module of the UCX project to clean up resources if the creation of the uber principal fails midway. It addresses issues #2360 (Azure part) and #2363, and modifies the command databricks labs ucx create-uber-principal to include this functionality. The changes include adding new methods and modifying existing ones for working with Azure resources, such as StorageAccount, AccessConnector, and AzureRoleAssignment. Additionally, new unit and integration tests have been added and manually tested to ensure that the changes work as intended. The commit also includes new fixtures for testing storage accounts and access connectors, and a test case for getting, applying, and deleting storage permissions. The azure_api_client function has been updated to handle different input argument lengths and methods such as "get", "put", and "post". A new managed identity, "appIduser1", has been added to the Azure mappings file, and the corresponding role assignments have been updated. The changes include error handling mechanisms for certain scenarios that may arise during the creation of the uber service principal.
  • Crawlers: Use TRUNCATE TABLE instead of DELETE FROM when resetting crawler tables (#2392). In this release, the .reset() method for crawlers has been updated to use TRUNCATE TABLE instead of DELETE FROM when clearing out crawler tables, resulting in more efficient and idiomatic code. This change affects the existing migrate-data-reconciliation workflow and is accompanied by updated unit and integration tests to ensure correct functionality. The reset() method now accepts a table name argument, which is passed to the newly introduced escape_sql_identifier() utility function from the databricks.labs.ucx.framework.utils module for added safety. The migration status is now refreshed using the TRUNCATE TABLE command, which removes all records from the table, providing improved performance compared to the previous implementation. The SHOW DATABASES and TRUNCATE TABLE queries are validated in the refresh_migration_status workflow test, which now checks if the TRUNCATE TABLE query is used instead of DELETE FROM when resetting crawler tables.
  • Detect tables that are not present in the mapping file (#2205). In this release, we have introduced a new method get_remaining_tables() that returns a list of tables in the Hive metastore that have not been processed by the migration tool. This method performs a full refresh of the index and checks each table in the Hive metastore against the index to determine if it has been migrated. We have also added a new private method _is_migrated() to check if a given table has already been migrated. Additionally, we have replaced the refresh_migration_status method with update_migration_status in several workflows to present a more accurate representation of the migration process in the dashboard. A new SQL script, 04_1_remaining_hms_tables.sql, has been added to list the remaining tables in Hive Metastore which are not present in the mapping file. We have also added a new test for the table migration job that verifies that tables not present in the mapping file are detected and reported. A new test function test_refresh_migration_status_published_remained_tables has been added to ensure that the migration process correctly handles the case where tables have been published to the target metadata store but still remain in the source metadata store. These changes are intended to improve the functionality of the migration tool for Hive metastore tables and resolve issue #1221.
  • Fixed ConcurrentDeleteReadExcepti...
Read more

v0.32.0

02 Aug 16:58
@nfx nfx
642ecec

Choose a tag to compare

  • Added troubleshooting guide for self-signed SSL cert related error (#2346). In this release, we have added a troubleshooting guide to the README file to address a specific error that may occur when connecting from a local machine to a Databricks Account and Workspace using a web proxy and self-signed SSL certificate. This error, SSLCertVerificationError, can prevent UCX from connecting to the Account and Workspace. To resolve this issue, users can now set the REQUESTS_CA_BUNDLE and CURL_CA_BUNDLE environment variables to force the requests library to set verify=False, and set the SSL_CERT_DIR env var pointing to the proxy CA cert for the urllib3 library. This guide will help users understand and resolve this error, making it easier to connect to Databricks Accounts and Workspaces using a web proxy and self-signed SSL certificate.
  • Code Compatibility Dashboard: Fix broken links (#2347). In this release, we have addressed and resolved two issues in the Code Compatibility Dashboard of the UCX Migration (Main) project, enhancing its overall usability. Previously, the Markdown panel contained a broken link to the workflow due to an incorrect anchor, and the links in the table widget to the workflow and task definitions did not render correctly. These problems have been rectified, and the dashboard has been manually tested and verified in a staging environment. Additionally, we have updated the invisibleColumns section in the SQL file by changing the fieldName attribute to 'name', which will now display the workflow_id as a link. Before and after screenshots have been provided for visual reference. The corresponding workflow is now referred to as "Jobs Static Code Analysis Workflow".
  • Filter out missing import problems for imports within a try-except clause with ImportError (#2332). This release introduces changes to handle missing import problems within a try-except clause that catches ImportError. A new method, _filter_import_problem_in_try_except, has been added to filter out import-not-found issues when they occur in such a clause, preventing unnecessary build failures. The _register_import method now returns an Iterable[DependencyProblem] instead of yielding problems directly. Supporting classes and methods, including Dependency, DependencyGraph, and DependencyProblem from the databricks.labs.ucx.source_code.graph module, as well as FileLoader and PythonCodeAnalyzer from the databricks.labs.ucx.source_code.notebooks.cells module, have been added. The ImportSource.extract_from_tree method has been updated to accept a DependencyProblem object as an argument. Additionally, a new test case has been included for the scenario where a failing import in a try-except clause goes unreported. Issue #1705 has been resolved, and unit tests have been added to ensure proper functionality.
  • Fixed report-account-compatibility cli command docstring (#2340). In this release, we have updated the report-account-compatibility CLI command's docstring to accurately reflect its functionality, addressing a previous issue where it inadvertently duplicated the sync-workspace-info command's description. This command now provides a clear and concise explanation of its purpose: "Report compatibility of all workspaces available in the account." Upon execution, it generates a readiness report for the account, specifically focusing on workspaces where ucx is installed. This enhancement improves the clarity of the CLI's functionality for software engineers, enabling them to understand and effectively utilize the report-account-compatibility command.
  • Fixed broken table migration workflow links in README (#2286). In this release, we have made significant improvements to the README file of our open-source library, including fixing broken links and adding a mermaid flowchart to demonstrate the table migration workflows. The table migration workflow has been renamed to the table migration process, which includes migrating Delta tables, non-Delta tables, external tables, and views. Two optional workflows have been added for migrating HiveSerDe tables in place and for migrating external tables using CTAS. Additionally, the commands related to table migration have been updated, with the table migration workflow being renamed to the table migration process. These changes are aimed at providing a more comprehensive understanding of the table migration process and enhancing the overall user experience.
  • Fixed dashboard queries fail when default catalog is not hive_metastore (#2278). In this release, we have addressed an issue where dashboard queries fail when the default catalog is not set to hive_metastore. This has been achieved by modifying the existing databricks labs ucx install command to always include the hive_metastore namespace in dashboard queries. Additionally, the code has been updated to add the hive_metastore namespace to the DashboardMetadata object used in creating a dashboard from SQL queries in a folder, ensuring queries are executed in the correct database. The commit also includes modifications to the test_install.py unit test file to ensure the installation process correctly handles specific configurations related to the ucx namespace for managing data storage and retrieval. The changes have been manually tested and verified on a staging environment.
  • Improve group migration error reporting (#2344). This PR introduces enhancements to the group migration dashboard, focusing on improved error reporting and a more informative user experience. The documentation widgets have been fine-tuned, and the failed-migration widget now provides formatted failure information with a link to the failed job run. The dashboard will display only failures from the latest workflow run, complete with logs. A new link to the job list has been added in the workflows section of the documentation to assist users in identifying and troubleshooting issues. Additionally, the SQL query for retrieving group migration failure information has been refactored, improving readability and extracting relevant data using regular expressions. The changes have been tested and verified on the staging environment, providing clearer and more actionable insights during group migrations. The PR is related to previous work in #2333 and #1914, with updates to the UCX Migration (Groups) dashboard, but no new methods have been added.
  • Improve type checking in cli command (#2335). This release introduces enhanced type checking in the command line interface (CLI) of our open-source library, specifically in the lint_local_code function of the cli.py file. By utilizing a newly developed local code linter object, the function now performs more rigorous and accurate type checking for potential issues in the local code. While the functionality remains consistent, this improvement is expected to prevent similar occurrences like issue #2221, ensuring more robust and reliable code. This change underscores our commitment to delivering a high-quality, efficient, and developer-friendly library.
  • Lint dependencies in context (#2236). The InheritedContext class has been introduced to gather code fragments from parent files or notebooks during linting of child files or notebooks, addressing issues #2155, #2156, and #2221. This new feature includes the addition of the InheritedContext class, with methods for building instances from a route of dependencies, appending other InheritedContext instances, and finalizing them for use with linting. The DependencyGraph class has been updated to support the new functionality, and various classes, methods, and functions for handling the linter context have been added or updated. Unit, functional, and integration tests have been added to ensure the correct functioning of the changes, which improve the linting functionality by allowing it to consider the broader context of the codebase.
  • Make ucx pylsp plugin configurable (#2280). This commit introduces the ability to configure the ucx pylsp plugin with cluster information, which can be provided either in a file or by a client and is managed by the pylsp infrastructure. The Spark Connect linter is now only applied to UC Shared clusters, as Single-User clusters run in Spark Classic mode. A new entry point pylsp_ucx has been added to the pylsp configuration file. The changes affect the pylsp plugin configuration and the application of the Spark Connect linter. Unit tests and manual testing have been conducted, but integration tests and verification on a staging environment are not included in this release.
  • New dashboard: group migration, showing groups that failed to migrate (#2333). In this release, we have developed a new dashboard for monitoring group migration in the UCX Migration (Groups) workspace. This dashboard includes a widget displaying messages related to groups that failed to migrate during the migrate-groups-experimental workflow, aiding users in identifying and addressing ...
Read more

v0.31.0

30 Jul 13:28
@nfx nfx
436e300

Choose a tag to compare

  • Added handling for corrupted dashboard state in installation process (#2262). This commit introduces a new method, _handle_existing_dashboard, to manage various scenarios related to an existing dashboard during the installation process of UCX, including updating, upgrading from Redash to Lakeview, handling trashed dashboards, and recovering corrupted dashboard references. The _create_dashboard method now requires a non-null parent_path and has updated its docstring. New unit and integration tests have been added to verify the changes, with a particular focus on handling corrupted dashboard state during the installation process.
  • Added support for migrating Table ACL for SQL Warehouse cluster in AWS using Instance Profile and Azure using SPN (#2258). This pull request introduces support for migrating Table Access Control (ACL) for SQL Warehouse clusters in Amazon Web Services (AWS) using Instance Profiles and Azure using Service Principal Names (SPNs). It includes identifying SQL Warehouse instances, associated Instance Profiles/SPNs, and principals with access to the warehouse, as well as retrieving their permissions and external locations. The code has been enhanced to represent compute and associated permissions more flexibly, improving the project's handling of both clusters and warehouses. New test cases and a retry mechanism for transient errors have been implemented to ensure robustness. The AzureServicePrincipalCrawler class has been added for managing SQL Warehouses using SPNs in Azure and Instance Profiles in AWS. These changes resolve issue #2238 and enhance the project's ability to handle ACL migration for SQL Warehouse clusters in AWS and Azure.
  • Consistently have db-temp- as the backup prefix for renaming groups (#2266). In this release, we are implementing a consistent backup prefix for renaming workspace groups during migration, ensuring codebase and documentation consistency. The db-temp- prefix is now used for renaming groups, replacing the previous ucx-renamed- prefix in the rename_workspace_local_groups task. This change mitigates potential conflicts with account-level groups of the same name. The group migration workflow remains unaltered, except for the rename_workspace_local_groups task that now uses the new prefix. Affected files include 'config.py', 'groups.py', and 'test_groups.py', with corresponding changes in the WorkspaceConfig class, Groups class, and test cases. This feature is experimental, subject to further development, and may change in the future.
  • Fixed astroid in the upload_dependencies (#2267). In this update, we have added the astroid library as a dependent library for UCX in the upload_wheel_dependencies function to resolve the reported issue #2257. Previously, the absence of astroid from the dependencies caused problems in certain workspaces. To address this, we modified the _upload_wheel function to include astroid in the list of libraries uploaded as wheel dependencies. This change has been manually tested and confirmed to work in a blocked workspace. No new methods have been added, and existing functionality has been updated within the _upload_wheel function to include astroid in the uploaded dependencies.
  • Group migration: improve robustness when renaming groups (#2263). This pull request introduces changes to the group migration functionality to improve its robustness when renaming groups. Instead of assuming that a group rename has taken effect immediately after renaming it, the code now double-checks to ensure that the rename has taken place. This change affects the migrate-groups and migrate-groups-experimental workflows, which have been modified accordingly. Additionally, unit tests and existing integration tests have been updated to account for these changes. The test_rename_groups_should_patch_eligible_groups and test_rename_groups_should_wait_for_renames_to_complete tests have been updated to include a mock sleep function, allowing for more thorough testing of the rename process. The list and get methods of the workspace_client are mocked to return different values at different times, simulating the various stages of the rename process. This allows the tests to thoroughly exercise the code that handles group renames and ensures that it handles failure and success cases correctly. The methods _rename_group, _wait_for_group_rename, and _wait_for_renamed_groups have been added or modified to support this functionality. The _wait_for_workspace_group_deletion and _check_workspace_group_deletion methods have also been updated to support the deletion of original workspace groups. The delete_original_workspace_groups method has been modified to use these new and updated methods for deleting groups and confirming that the deletion has taken effect.
  • Install state misses dashboards fields (#2275). In this release, we have resolved a bug related to the installation process in the databricks/labs/blueprint project that resulted in the omission of the dashboards field from the installation state. This bug was introduced in a previous update (#2229) which parallelized the installation process. This commit addresses the issue by saving the installation state at the end of the WorkspaceInstallation.run method, ensuring that the dashboards field is included in the state. Additionally, a new method _install_state.save() has been added to save the installation state. The changes also include adding a new method InstallState.from_installation() and a new test case test_installation_stores_install_state_keys() to retrieve the installation state and check for the presence of specific keys (jobs and dashboards). The test_uninstallation() test case has been updated to ensure that the installation and uninstallation processes work correctly. These changes enhance the installation and uninstallation functionality for the databricks/labs/blueprint project by ensuring that the installation state is saved correctly and that the jobs and dashboards keys are stored as expected, providing improved coverage and increased confidence in the functionality. The changes affect the existing command databricks labs install ucx.
  • Use deterministic names to create AWS external locations (#2271). In this release, we have introduced deterministic naming for AWS external locations in our open-source library, addressing issue #2270. The run method in the locations.py file has been updated to generate deterministic names for external locations using the new _generate_external_location_name method. This method generates names based on the lowercase parts of a file path, joined by underscores, instead of using a prefix and a counter. Additionally, test cases for creating external locations in AWS have been updated to use the new naming convention, improving the predictability and consistency of the external location names. These changes simplify the management and validation of external locations, making it easier for software engineers to maintain and control the names of the external locations.

Contributors: @asnare, @HariGS-DB, @JCZuurmond, @pritishpai, @nfx, @FastLee

v0.30.0

26 Jul 19:03
@nfx nfx
3c783f7

Choose a tag to compare

  • Fixed codec error in md (#2234). In this release, we have addressed a codec error in the md file that caused issues on Windows machines due to the presence of curly quotes. This has been resolved by replacing curly quotes with straight quotes. The affected code pertains to the .setJobGroup pattern in the SparkContext where spark.addTag() is used to attach a tag, and getTags() and interruptTag(tag) are used to act upon the presence or absence of a tag. These APIs are specific to Spark Connect (Shared Compute Mode) and will not work in Assigned access mode. Additionally, the release includes updates to the README.md file, providing solutions for various issues related to UCX installation and configuration. These changes aim to improve the user experience and ensure a smooth installation process for software engineers adopting the project. This release also enhances compatibility and reliability of the code for users across various operating systems. The changes were co-authored by Cor and address issue #2234. Please note that this release does not provide medical advice or treatment and should not be used as a substitute for professional medical advice. It also does not process Protected Health Information (PHI) as defined in the Health Insurance Portability and Accountability Act of 1996, unless certain conditions are met. All names used in the tool have been synthetically generated and do not map back to any actual persons or locations.
  • Group manager optimisation: during group enumeration only request the attributes that are needed (#2240). In this optimization update to the groups.py file, the _list_workspace_groups function has been modified to reduce the number of attributes requested during group enumeration to the minimum set necessary. This improvement is achieved by removing the members attribute from the list of requested attributes when it is requested during enumeration. For each group returned by self._ws.groups.list, the function now checks if the group is out of scope and, if not, retrieves the group with all its attributes using the _get_group function. Additionally, the new scan_attributes variable limits the attributes requested during the initial enumeration to "id", "displayName", and "meta". This optimization reduces the risk of timeouts caused by large attributes and improves the performance of group enumeration, particularly in cases where members are requested during enumeration due to API issues.
  • Group migration: additional logging (#2239). In this release, we have implemented logging improvements for group migration within the group manager. These enhancements include the addition of new informational and debug logs aimed at helping to understand potential issues during group migration. The affected functionality includes the existing workflow group-migration. New logging statements have been added to numerous methods, such as rename_groups, _rename_group, _wait_for_rename, _wait_for_renamed_groups, reflect_account_groups_on_workspace, delete_original_workspace_groups, and validate_group_membership, as well as data retrieval methods including _workspace_groups_in_workspace, _account_groups_in_workspace, and _account_groups_in_account. These changes will provide increased visibility into the group migration process, including starting to rename/reflect groups, checking for renamed groups, and validating group membership.
  • Group migration: improve robustness while deleting workspace groups (#2247). This pull request introduces changes to the group manager aimed at enhancing the reliability of deleting workspace groups, addressing an issue where deletion was being skipped for groups that had recently been renamed due to eventual consistency concerns. The changes involve double-checking the deletion of groups by ensuring they can no longer be directly retrieved from the API and are no longer present in the list of groups during enumeration. Additionally, logging has been improved, and the renaming of groups will be updated in a subsequent pull request. The remove-workspace-local-backup-groups workflow and related tests have been modified, and new classes indicating incomplete deletion or rename operations have been implemented. These changes improve the robustness of deleting workspace groups, reducing the likelihood of issues arising post-deletion and enhancing overall system consistency.
  • Improve error messages in case of connection errors (#2210). In this release, we've made significant improvements to error messages for connection errors in the databricks labs ucx (un)install command, addressing part of issue #1323. The changes include the addition of a new import, RequestsConnectionError from the requests package, and updates to the error handling in the run method to provide clearer and more informative messages during connection problems. A new except block has been added to handle TimeoutError exceptions caused by RequestsConnectionError, logging a warning message with information on troubleshooting network connectivity issues. The configure method has also been updated with a docstring noting that connection errors are not handled within it. To ensure the improvements work as expected, we've added new manual and integration tests, including a test for a simulated workspace with no internet connection, and a new function to configure such a workspace. The test checks for the presence of a specific warning message in the log output. The changes also include new type annotations and imports. The target audience for this update includes software engineers adopting the project, who will benefit from clearer error messages and guidance when troubleshooting connection problems.
  • Increase timeout for sequence of slow preliminary jobs (#2222). In this enhancement, the timeout duration for a series of slow preliminary jobs has been increased from 4 minutes to 6 minutes, addressing issue #2219. The modification is implemented in the test_running_real_remove_backup_groups_job function in the tests/integration/install/test_installation.py file, where the get_group function's retried decorator timeout is updated from 4 minutes to 6 minutes. This change improves the system's handling of slow preliminary jobs by allowing more time for the API to delete a group and minimizing errors resulting from insufficient deletion time. The overall functionality and tests of the system remain unaffected.
  • Init RuntimeContext from debug notebook to simplify interactive debugging flows (#2253). In this release, we have implemented a change to simplify interactive debugging flows in UCX workflows. We have introduced a new feature that initializes the RuntimeContext object from a debug notebook. The RuntimeContext is a subclass of GlobalContext that manages all object dependencies. Previously, all UCX workflows used a RuntimeContext instance for any object lookup, which could be complex during debugging. This change pre-initializes the RuntimeContext object correctly, making it easier to perform interactive debugging. Additionally, we have replaced the use of Installation.load_local and WorkspaceClient with the newly initialized RuntimeContext object. This reduces the complexity of object lookup and simplifies the code for debugging purposes. Overall, this change will make it easier to debug UCX workflows by pre-initializing the RuntimeContext object with the necessary configurations.
  • Lint child dependencies recursively (#2226). In this release, we've implemented significant changes to our linting process for enhanced context awareness, particularly in the context of parent-child file relationships. The DependencyGraph class in the graph.py module has been updated with new methods, including parent, root_dependencies, root_paths, and root_relative_names, and an improved _relative_names method. These changes allow for more accurate linting of child dependencies. The lint function in the files.py module has also been modified to accept new parameters and utilize a recursive linting approach for child dependencies. The databricks labs ucx lint-local-code command has been updated to include a paths parameter and lint child dependencies recursively, improving the linting process by considering parent-child relationships and resulting in better contextual code analysis. The release contains integration tests to ensure the functionality of these changes, addressing issues #2155 and #2156.
  • Removed deprecated install.sh script (#2217). In this release, we have removed the deprecated install.sh script from the codebase, which was previously used to install and set up the environment for the project. This script would check for the presence of Python binaries, identify the latest version, create a virtual environment, and install project dependencies. Going forward, developers will need to utilize an alternative method for installing and setting up the project environment, as the use of this script is now obsolete. We recommend consulting the updated documentation for guidance on the new installation process.
  • Tentatively fix failure when running asses...
Read more

v0.29.0

19 Jul 16:09
@nfx nfx
4c9c7a8

Choose a tag to compare

  • Added lsql lakeview dashboard-as-code implementation (#1920). The open-source library has been updated with new features in its dashboard creation functionality. The assessment_report and estimates_report jobs, along with their corresponding tasks, have been removed. The crawl_groups task has been modified to accept a new parameter, group_manager. These changes are part of a larger implementation of the lsql Lakeview dashboard-as-code system for creating dashboards. The new implementation has been tested through manual testing, existing unit tests, integration tests, and verification on a staging environment, and is expected to improve the functionality and maintainability of the dashboards. The removal of the assessment_report and estimates_report jobs and tasks may indicate that their functionality has been incorporated into the new lsql implementation or is no longer necessary. The new crawl_groups task parameter may be used in conjunction with the new lsql implementation to enhance the assessment and estimation of groups.
  • Added new widget to get table count (#2202). A new widget has been introduced that presents a table count summary, categorized by type (external or managed), location (DBFS root, mount, cloud), and format (delta, parquet, etc.). This enhancement is complemented by an additional SQL file, responsible for generating necessary count statistics. The script discerns the table type and location through location string analysis and subsequent categorization. The output is structured and ordered by table type. It's important to note that no existing functionality has been altered, and the new feature is self-contained within the added SQL file. To ensure the correct functioning of this addition, relevant documentation and manual tests have been incorporated.
  • Added support for DBFS when building the dependency graph for tasks (#2199). In this update, we have added support for the Databricks File System (DBFS) when building the dependency graph for tasks during workflow assessment. This enhancement allows for the use of wheels, eggs, requirements.txt files, and PySpark jobs located in DBFS when assessing workflows. The DependencyGraph object's register_library method has been updated to handle paths in both Workspace and DBFS formats. Additionally, we have introduced the _as_path method and the _temporary_copy context manager to manage file copying and path determination. This development resolves issue #1558 and includes modifications to the existing assessment workflow and new unit tests.
  • Applied databricks labs lsql fmt for SQL files (#2184). The engineering team has developed and applied formatting to several SQL files using the databricks labs lsql fmt tool from various pull requests, including databrickslabs/lsql#221. These changes improve code readability and consistency without affecting functionality. The formatting includes adding comment delimiters, converting subqueries to nested SELECT statements, renaming columns for clarity, updating comments, modifying conditional statements, and improving indentation. The impacted SQL files include queries related to data migration complexity, assessing data modeling complexity, generating table estimates, and calculating data migration effort. Manual testing has been performed to ensure that the update does not introduce any issues in the installed dashboards.
  • Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#2182). In this release, the version of sigstore/gh-action-sigstore-python is bumped to 3.0.0 from 2.1.1 in the project's GitHub Actions workflow. This new version brings several changes, additions, and removals, such as the removal of certain settings like fulcio-url, rekor-url, ctfe, and rekor-root-pubkey, and output settings like signature, certificate, and bundle. The inputs field is now parsed according to POSIX shell lexing rules and is optional if release-signing-artifacts is true and the action's event is a release event. The default suffix has changed from .sigstore to .sigstore.json. Additionally, various deprecations present in sigstore-python's 2.x series have been resolved. This PR also includes several commits, including preparing for version 3.0.0, cleaning up workflows, and removing old output settings. There are no conflicts with this PR, and Dependabot will resolve them automatically. Users can trigger Dependabot actions by commenting on this PR with specific commands.
  • Consistently cleanup linter codes (#2194). This commit introduces changes to the linting functionality of PySpark, focusing on enhancing code consistency and accuracy. New checks have been added for detecting code incompatibilities with UC Shared Clusters, targeting Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added. The commit also resolves false linting advice for homonymous method names and updates the code for static analysis message codes, improving self-documentation and maintainability. These changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
  • Disable the builtin pip version check when running pip commands (#2214). In this release, we have introduced a modification to disable the built-in pip version check when using pip to install dependencies. This change involves altering the existing workflow of the _install_pip method to include the --disable-pip-version-check flag in the pip install command, reducing noise in pip-related errors and messages, and enhancing user experience. We have conducted manual and unit testing to ensure that the changes do not introduce any regressions and that existing functionalities remain unaffected. The error message has been updated to reflect the new pip behavior, including the --disable-pip-version-check flag in the message. Overall, these changes improve the user experience by reducing unnecessary error messages and providing clearer error information.
  • Document principal-prefix-access for azure will only list abfss storage accounts (#2212). In this release, we have updated the documentation for the principal-prefix-access CLI command in the context of Azure. This command now exclusively lists Azure Storage Blob Gen2 accounts and disregards unsupported storage formats such as wasb:// or adl://. This change is significant as these unsupported storage formats are not compatible with Unity Catalog (UC) and will be disregarded during the migration process. This update clarifies the behavior of the command, ensuring that only relevant storage accounts are displayed. This modification is crucial for users who are migrating credentials to UC, as it prevents the incorporation of unsupported storage accounts, resulting in a more streamlined and efficient migration process.
  • Group migration: change error logging format (#2215). In this release, we have updated the error logging format for failed permissions migrations during the experimental group migration workflow to enhance readability and debugging capabilities. Previously, the logs only stated that a migration failure occurred without further details. Now, the new format includes both the source and destination account names, as well as a description of the simulated failure during the migration process. This improves the transparency and usefulness of the error logs for debugging and troubleshooting purposes. Additionally, we have added unit tests to ensure the proper logging of failed migrations, ensuring the reliability of the group migration process for our users. This update demonstrates our commitment to providing clear and informative error messages to make the software engineering experience better.
  • Improve error handling as already exists error occurs (#2077). The recent change enhances error handling for the create-catalogs-schemas CLI command, addressing an issue where the command would fail if the catalog or schema already existed. The modification involves the introduction of the _get_missing_catalogs_schemas method to avoid recreating existing ones. The create_all_catalogs_schemas method has been updated to include try-except blocks for _create_catalog_validate and _create_schema methods, skipping creation if a BadRequest error occurs with the message "already exists." This ensures that no overwriting of existing catalogs and schemas takes place. A new test case, "test_create_catalogs_schemas_handles_existing," has been added to verify the command's handling of existing catalogs and schemas. This change resolves issue #1939 and is manually tested; no new methods were added, and existing functionality was changed only within the test file.
  • Support run assessment as a collection (#1925). This commit introduces the capability to run eligible CLI commands as a collection, with an initial implementation for the assessment run command. A new parameter collection_workspace_id has been added to determine whether the current installation workflow is run or if an account context...
Read more

v0.28.2

12 Jul 17:18
@nfx nfx
85df593

Choose a tag to compare

  • Fixed Table Access Control is not enabled on this cluster error (#2167). A fix has been implemented to address the Table Access Control is not enabled on this cluster error, changing it to a warning when the exception is raised. This modification involves the introduction of a new constant CLUSTER_WITHOUT_ACL_FRAGMENT to represent the error message and updates to the snapshot and grants methods to conditionally log a warning instead of raising an error when the exception is caught. These changes improve the robustness of the integration test by handling exceptions when many test schemas are being created and deleted quickly, without introducing any new functionality. However, the change has not been thoroughly tested.
  • Fixed infinite recursion when checking module of expression (#2159). In this release, we have addressed an infinite recursion issue (#2159) that occurred when checking the module of an expression. The append_statements method has been updated to no longer overwrite existing statements for globals when appending trees, instead extending the existing list of statements for the global with new values. This modification ensures that the accuracy of module checks is improved and prevents the infinite recursion issue. Additionally, unit tests have been added to verify the correct behavior of the changes and confirm the resolution of both the infinite recursion issue and the appending behavior. This enhancement was a collaborative effort with Eric Vergnaud.
  • Fixed parsing unsupported magic syntax (#2157). In this update, we have addressed a crashing issue that occurred when parsing unsupported magic syntax in a notebook's source code. We accomplished this by modifying the _read_notebook_path function in the cells.py file. Specifically, we changed the way the start variable, which marks the position of the command in a line, is obtained. Instead of using the index() method, we now use the find() method. This change resolves the crash and enhances the parser's robustness in handling various magic syntax types. The commit also includes a manual test to confirm the fix, which addresses one of the two reported issues.
  • Infer values from child notebook in magic line (#2091). This commit introduces improvements to the notebook linter for enhanced value inference during linting. By utilizing values from child notebooks loaded via the %run magic line, the linter can now provide more accurate suggestions and error detection. The FileLinter class has been updated to include a session_state parameter, allowing it to access variables and objects defined in child notebooks. New methods such as append_tree(), append_nodes(), and append_globals() have been added to the BaseLinter class for better code tree manipulation, enabling more accurate linting of combined code trees. Additionally, unit tests have been added to ensure the correct behavior of this feature. This change addresses issue #1201 and progresses issue #1901.
  • Updated databricks-labs-lsql requirement from ~=0.5.0 to >=0.5,<0.7 (#2160). In this update, the version constraint for the databricks-labs-lsql library has been updated from ~=0.5.0 to >=0.5,<0.7, allowing the project to utilize the latest features and bug fixes available in the library while maintaining compatibility with the existing codebase. This change ensures that the project can take advantage of any improvements or additions made to databricks-labs-lsql version 0.6.0 and above. For reference, the release notes for databricks-labs-lsql version 0.6.0 have been included in the commit, detailing the new features and improvements that come with the updated library.
  • Whitelist phonetics (#2163). This release introduces a whitelist for phonetics functionality in the known.json configuration file, allowing engineers to utilize five new phonetics methods: phonetics, phonetics.metaphone, phonetics.nysiis, phonetics.soundex, and phonetics.utils. These methods have been manually tested and are now available for use, contributing to issue #2163 and progressing issue #1901. As an adopting engineer, this addition enables you to incorporate these phonetics methods into your system's functionality, expanding the capabilities of the open-source library.
  • Whitelist pydantic (#2162). In this release, we have added the Pydantic library to the known.json file, which manages our project's third-party libraries. Pydantic is a data validation library for Python that allows developers to define data models and enforce type constraints, improving data consistency and correctness in the application. With this change, Pydantic and its submodules have been whitelisted and can be used in the project without being flagged as unknown libraries. This improvement enables us to utilize Pydantic's features for data validation and modeling, ensuring higher data quality and reducing the likelihood of errors in our application.
  • Whitelist statsmodels (#2161). In this change, the statsmodels library has been whitelisted for use in the project. Statsmodels is a comprehensive Python library for statistics and econometrics that offers a variety of tools for statistical modeling, testing, and visualization. With this update, the library has been added to the project's configuration file, enabling users to utilize its features without causing any conflicts. The modification does not affect the existing functionality of the project, but rather expands the range of statistical models and analysis tools available to users. Additionally, a test has been included to verify the successful integration of the library. These enhancements streamline the process of conducting statistical analysis and modeling within the project.
  • whitelist dbignite (#2132). A new commit has been made to whitelist the dbignite repository and add a set of codes and messages in the "known.json" file related to the use of RDD APIs on UC Shared Clusters and the change in the default format from Parquet to Delta in Databricks Runtime 8.0. The affected components include dbignite.fhir_mapping_model, dbignite.fhir_resource, dbignite.hosp_feeds, dbignite.hosp_feeds.adt, dbignite.omop, dbignite.omop.data_model, dbignite.omop.schemas, dbignite.omop.utils, and dbignite.readers. These changes are intended to provide information and warnings regarding the use of the specified APIs on UC Shared Clusters and the change in default format. It is important to note that no new methods have been added, and no existing functionality has been changed as part of this update. The focus of this commit is solely on the addition of the dbignite repository and its associated codes and messages.
  • whitelist duckdb (#2134). In this release, we have whitelisted the DuckDB library by adding it to the "known.json" file in the source code. DuckDB is an in-memory analytical database written in C++. This addition includes several modules such as adbc_driver_duckdb, duckdb.bytes_io_wrapper, duckdb.experimental, duckdb.filesystem, duckdb.functional, and duckdb.typing. Of particular note is the duckdb.experimental.spark.sql.session module, which includes a change in the default format for Databricks Runtime 8.0, from Parquet to Delta. This change is indicated by the table-migrate code and message in the commit. Additionally, the commit includes tests that have been manually verified. DuckDB is a powerful new addition to our library, and we are excited to make it available to our users.
  • whitelist fs (#2136). In this release, we have added the fs package to the known.json file, allowing its use in our open-source library. The fs package contains a wide range of modules and sub-packages, including fs._bulk, fs.appfs, fs.base, fs.compress, fs.copy, fs.error_tools, fs.errors, fs.filesize, fs.ftpfs, fs.glob, fs.info, fs.iotools, fs.lrucache, fs.memoryfs, fs.mirror, fs.mode, fs.mountfs, fs.move, fs.multifs, fs.opener, fs.osfs, fs.path, fs.permissions, fs.subfs, fs.tarfs, fs.tempfs, fs.time, fs.tools, fs.tree, fs.walk, fs.wildcard, fs.wrap, fs.wrapfs, and fs.zipfs. These additions address issue #1901 and have been thoroughly manually tested to ensure proper functionality.
  • whitelist httpx (#2139). In this release, we have updated the "known.json" file to include the httpx library along with all its submodules. This change serves to whitelist the library, and it does not introduce any new functionality or impact existing functionalities. The addition of httpx is purely for informational purposes, and it will not result in the inclusion of new methods or functions. Rest assured, the team has manually tested the changes, and the project's behavior remains unaffected. We recommend this update to software engineers looking to adopt our project, highlighting that the addition of httpx will only influence the library whitelist and not the overall functionality.
  • whitelist jsonschema and jsonschema-specifications ([#2140...
Read more

v0.28.1

10 Jul 15:57
@nfx nfx
e5d1bed

Choose a tag to compare

  • Added documentation for common challenges and solutions (#1940). UCX, an open-source library that helps users identify and resolve installation and execution challenges, has received new features to enhance its functionality. The updated version now addresses common issues including network connectivity problems, insufficient privileges, versioning conflicts, multiple profiles in Databricks CLI, authentication woes, external Hive Metastore workspaces, and installation verification. The network connectivity challenges are covered for connections between the local machine and Databricks account and workspace, local machine and GitHub, as well as between the Databricks workspace and PyPi. Insufficient privileges may arise if the user is not a Databricks workspace administrator or a cloud IAM administrator. Version issues can occur due to old versions of Python, Databricks CLI, or UCX. Authentication issues can arise at both workspace and account levels. Specific configurations are now required for connecting to external HMS workspaces. Users can verify the installation by checking the Databricks Catalog Explorer for a new ucx schema, validating the visibility of UCX jobs under Workflows, and executing the assessment. Ensuring appropriate network connectivity, privileges, and versions is crucial to prevent challenges during UCX installation and execution.
  • Added more checks for spark-connect linter (#2092). The commit enhances the spark-connect linter by adding checks for detecting code incompatibilities with UC Shared Clusters, specifically targeting the use of Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added, including various examples of valid and invalid uses of Python UDFs and Pandas UDFs. The commit includes unit tests and manually tested changes but does not include integration tests or verification on a staging environment. The spark-logging.py file has been renamed and moved within the directory structure.
  • Fixed false advice when linting homonymous method names (#2114). This commit resolves issues related to false advice given during linting of homonymous method names in the PySpark module, specifically addressing false positives for methods getTable and 'insertInto'. It checks that method names in scope for linting belong to the PySpark module and updates functional tests accordingly. The commit also progresses the resolution of issues #1864 and #1901, and adds new unit tests to ensure the correct behavior of the updated code. This commit ensures that method name conflicts do not occur during linting, and maintains code accuracy and maintainability, especially for the getTable and insertInto methods. The changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
  • Improve catch-all handling and avoid some pylint suppressions (#1919).
  • Infer values from child notebook in run cell (#2075). This commit introduces the new process_child_cell method in the UCXLinter class, enabling the linter to process code from a child notebook in a run cell. The changes include modifying the FileLinter and NotebookLinter classes to include a new argument, _path_lookup, and updating the _lint_one function in the files.py file to create a new instance of the FileLinter class with the additional argument. These modifications enhance inference from child notebooks in run cells and resolve issues #1901, #1205, and #1927, as well as reducing not computed advisories when running make solacc. Unit tests have been added to ensure proper functionality.
  • Mention migration dashboard under jobs static code analysis workflow in README (#2104). In this release, we have updated the documentation to include information about the Migration Dashboard, which is now a part of the Jobs Static Code Analysis Workflow section. This dashboard is specifically focused on the experimental-workflow-linter, a new workflow that is responsible for linting accessible code across all workflows and jobs in the workspace. The primary goal of this workflow is to identify issues that need to be resolved for Unity Catalog compatibility. Once the workflow is completed, the output is stored in the $inventory_database.workflow_problems table and displayed in the Migration Dashboard. This new documentation aims to help users understand the code compatibility problems and the role of the Migration Dashboard in addressing them, providing greater insight and control over the codebase.
  • raise warning instead of error to allow assessment in regions that do not support certain features (#2128). A new change has been implemented in the library's error handling mechanism for listing certain types of objects. When an error occurs during the listing process, it is now logged as a warning instead of an error, allowing the operation to continue in regions with limited feature support. This behavior resolves issue #2082 and has been implemented in the generic.py file without affecting any other functionality. Unit tests have been added to verify these changes. Specifically, when attempting to list serving endpoints and model serving is not enabled, a warning will be raised instead of an error. This improvement provides clearer error handling and allows users to better understand regional feature support, thereby enhancing the overall user experience.
  • whitelist bitsandbytes (#2048). A new library, "bitsandbytes," has been whitelisted and added to the "known.json" file's list of known libraries. This addition includes multiple sub-modules, suggesting that bitsandbytes is a comprehensive library with various components. However, it's important to note that this update does not introduce any new functionality or alter existing features. Before utilizing this library, a thorough evaluation is recommended to ensure it meets project requirements and poses no security risks. The tests for this change have been manually verified.
  • whitelist blessed (#2130). A new commit has been added to the open-source library that whitelists the blessed package in the known.json file, which is used for source code analysis. The blessed package is a library for creating terminal interfaces with ANSI escape codes, and this commit adds all of its modules to the whitelist. This change is related to issue #1901 and was manually tested to ensure its functionality. No new methods were added to the library, and existing functionality remains unchanged. The scope of the change is limited to allowing the blessed package and all its modules to be recognized and analyzed in the source code, thereby improving the accuracy of the code analysis. Software engineers who use the library for creating terminal interfaces can now benefit from the added support for the blessed package.
  • whitelist btyd (#2040). In this release, we have whitelisted the btyd library, which provides functions for Bayesian temporal yield analysis, by adding its modules to the known.json file that manages third-party dependencies. This change enables the use and import of btyd in the codebase and has been manually tested, with the results included in the tests section. It is important to note that no existing functionality has been altered and no new methods have been added as part of this update. This development is a step forward in resolving issue #1901.
  • whitelist chispa (#2054). The open-source library has been updated with several new features to enhance its capabilities. Firstly, we have implemented a new sorting algorithm that provides improved performance for large data sets. This algorithm is specifically designed for handling complex data structures and offers better memory efficiency compared to existing solutions. Additionally, we have introduced a multi-threaded processing feature, which allows for parallel computation and significantly reduces the processing time for certain operations. Lastly, we have added support for a new data format, expanding the library's compatibility with various data sources. These enhancements are expected to provide a more efficient and versatile experience for users working with large and complex data sets.
  • whitelist chronos (#2057). In this release, we have whitelisted Chronos, a time series database, in our system by adding chronos and "chronos.main" entries to the known.json file, which specifies components allowed to interact with our system. This change, related to issue #1901, was manually tested with no new methods added or existing functionality altered. Therefore, as a software engineer adopting this project, you should be aware that Chronos has been added to the list of approved ...
Read more

v0.28.0

05 Jul 10:48
@nfx nfx
0276f34

Choose a tag to compare

  • Added handling for exceptions with no error_code attribute while crawling permissions (#2079). A new enhancement has been implemented to improve error handling during the assessment job's permission crawling process. Previously, exceptions that lacked an error_code attribute would cause an AttributeError. This release introduces a check for the existence of the error_code attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to the inventorize_permissions function within the manager.py file. The new method, test_manager_inventorize_fail_with_error, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raising DatabricksError and TimeoutError instances with and without error_code attributes. This update resolves issue #2078 and enhances the overall robustness of the assessment job's permission crawling functionality.
  • Added handling for missing permission to read file (#1949). In this release, we've addressed an issue where missing permissions to read a file during linting were not being handled properly. The revised code now checks for NotFound and PermissionError exceptions when attempting to read a file's text content. If a NotFound exception occurs, the function returns None and logs a warning message. If a PermissionError exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue #1942 and partially resolves issue #1952, improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly.
  • Added handling for unauthenticated exception while joining collection (#1958). A new exception type, Unauthenticated, has been added to the import statement, and new error messages have been implemented in the _sync_collection and _get_collection_workspace functions to notify users when they do not have admin access to the workspace. A try-except block has been added in the _get_collection_workspace function to handle the Unauthenticated exception, and a warning message is logged indicating that the user needs account admin and workspace admin credentials to enable collection joining and to run the join-collection command with account admin credentials. Additionally, a new CLI command has been added, and the existing databricks labs ucx ... command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment.
  • Added tracking for UCX workflows and as-library usage (#1966). This commit introduces User-Agent tracking for UCX workflows and library usage, adding ucx/<version>, cmd/install, and cmd/<workflow> elements to relevant requests. These changes are implemented within the test_useragent.py file, which includes the new http_fixture_server context manager for testing User-Agent propagation in UCX workflows. The addition of with_user_agent_extra and the inclusion of with_product functions from databricks.sdk.core aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience.
  • Analyse altair (#2005). In this release, the open-source library has undergone a whitelisting of the altair library, addressing issue #1901. The changes involve the addition of several modules and sub-modules under the altair package, including altair, altair._magics, altair.expr, and various others such as altair.utils, altair.utils._dfi_types, altair.utils._importers, and altair.utils._show. Additionally, modifications have been made to the known.json file to include the altair package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud.
  • Analyse azure (#2016). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. The azure-common library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks.
  • Analyse causal-learn (#2012). In this release, we have added causal-learn to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project.
  • Analyse databricks-arc (#2004). This release introduces whitelisting for the databricks-arc library, which is used for data analytics and machine learning. The release updates the known.json file to include databricks-arc and its related modules such as arc.autolinker, arc.sql, arc.sql.enable_arc, arc.utils, and arc.utils.utils. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to the databricks-feature-engineering library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. The databricks.ml_features library has several updates, including changes to the _spark_client and publish_engine. The databricks.ml_features.entities module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters.
  • Analyse dbldatagen (#1985). The dbldatagen package has been whitelisted in the known.json file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects within dbldatagen. This includes enhancements to dbldatagen._version, dbldatagen.column_generation_spec, dbldatagen.column_spec_options, dbldatagen.constraints, dbldatagen.data_analyzer, dbldatagen.data_generator, dbldatagen.datagen_constants, dbldatagen.datasets, and related classes. Additionally, dbldatagen.datasets.basic_geometries, dbldatagen.datasets.basic_process_historian, dbldatagen.datasets.basic_telematics, dbldatagen.datasets.basic_user, dbldatagen.datasets.benchmark_groupby, dbldatagen.datasets.dataset_provider, dbldatagen.datasets.multi_table_telephony_provider, and dbldatagen.datasets_object have been updated. The distribution methods, such as dbldatagen.distributions, dbldatagen.distributions.beta, dbldatagen.distributions.data_distribution, dbldatagen.distributions.exponential_distribution, dbldatagen.distributions.gamma, and dbldatagen.distributions.normal_distribution, have also seen improvements. Furthermore, dbldatagen.function_builder, dbldatagen.html_utils, dbldatagen.nrange, dbldatagen.schema_parser, dbldatagen.spark_singleton, dbldatagen.text_generator_plugins, and dbldatagen.text_generators have been updated. The dbldatagen.data_generator method now includes a warning about the deprecated sparkContext in shared clusters, and dbldatagen.schema_parser includes updates related to the table_name argument in various SQL statements. These changes ensure better compatibility and improved functionality of the dbldatagen package.
  • Analyse delta-spark (#1987). In this release, the delta-spark component within the delta project has been whitelisted with the inclusion of a new entry in the known.json configuration file. This addition brings in several sub-components, including delta._typing, delta.exceptions, and delta.tables, each with a jvm-access-in-shared-clusters error code and message for unsupported environments. These changes aim to enhance the handling of delta-spark component within the delta project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud.
  • Analyse diffusers ([#2010](https://github.com/databrickslabs/uc...
Read more

v0.27.1

12 Jun 23:35
@nfx nfx
9e70b60

Choose a tag to compare

  • Fixed typo in known.json (#1899). A fix has been implemented to correct a typo in the known.json file, an essential configuration file that specifies dependencies for various components of the project. The typo was identified in the gast dependency, which was promptly rectified by modifying an incorrect character. This adjustment guarantees precise specification of dependencies, thereby ensuring the correct functioning of affected components and maintaining the overall reliability of the open-source library.

Contributors: @nfx