|
1 | 1 | # Version changelog |
2 | 2 |
|
| 3 | +## 0.11.1 |
| 4 | + |
| 5 | +* Updated log level for spark connect to supress telemetry warnings in serverless. |
| 6 | + |
| 7 | +## 0.11.0 |
| 8 | + |
| 9 | +* Generationg of DQX rules from ODCS Data Contracts ([#932](https://github.com/databrickslabs/dqx/issues/932)). The Data Contract Quality Rules Generation feature has been introduced, enabling users to generate data quality rules directly from data contracts following the Open Data Contract Standard (ODCS). This feature supports three types of rule generation: predefined rules derived from schema properties and constraints, explicit DQX rules embedded in the contract, and text-based rules defined in natural language and processed by a Large Language Model (LLM) to generate appropriate checks. The feature provides rich metadata tracing generated rules back to the source contract for lineage and governance, and it can be used to implement federated data governance, standardize data contracts, and maintain version-controlled quality rules alongside schema definitions. |
| 10 | +* AI-Assisted Primary Key Detection and Uniqueness Rules Generation ([#934](https://github.com/databrickslabs/dqx/issues/934)). Introduced AI-assisted primary key detection and uniqueness rules generation capabilities, leveraging Large Language Models (LLMs) to analyze table schema and metadata. This feature analyzes table schemas and metadata to intelligently detect single or composite primary keys, and performs validation by checking for duplicate values. The `DQProfiler` class now includes a `detect_primary_keys_with_llm` method, which returns a dictionary containing the primary key detection result, including the table name, success status, detected primary key columns, confidence level, reasoning, and error message if any. The `DQGenerator` class has been extended to utilize uniqueness profiles from the profiler for AI-assisted uniqueness rules generation. Various updates have been made to the configuration options, including the addition of an `llm_primary_key_detection` option, which allows users to control whether AI-assisted primary key detection is enabled or disabled. |
| 11 | +* AI-Assisted Rules Generation Improvements ([#925](https://github.com/databrickslabs/dqx/issues/925)). The AI-Assisted Rules Generation feature has been enhanced to handle input as a path in addition to a table, and to generate rules with a filter. The `generate_dq_rules_ai_assisted` method now accepts an `InputConfig` object, which allows users to specify the location and format of the input data, enabling more flexible input handling and filtering capabilities. The feature includes test cases to verify its functionality, including manual tests, unit tests, and integration tests, and the documentation has been updated with minor changes to reflect the new functionality. Additionally, the code has been modified to capitalize keywords to stabilize integration tests, and the `DQGenerator` class has been updated to accommodate the changes, allowing users to generate data quality rules from a variety of input sources. The `InputConfig` class provides a flexible way to configure the input data, including its location and format, and the `get_column_metadata` function has been introduced to retrieve column metadata from a given location. Overall, these updates aim to enhance the functionality and usability of the AI-assisted rules generation feature, providing more flexibility and accuracy in generating data quality rules. |
| 12 | +* Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list checks ([#673](https://github.com/databrickslabs/dqx/issues/673)). The `is_in_list` and `is_not_null_and_is_in_list` check functions have been enhanced to support case-insensitive comparison, allowing users to choose between case-sensitive and case-insensitive comparisons via an optional `case_sensitive` boolean flag that defaults to True. These checks verify if values in a specified column are present in a list of allowed values, with the `is_not_null_and_is_in_list` check also requiring the values to be non-null. The updated checks provide more flexibility in data validation, enabling users to configure parameters such as the column to check, the list of allowed values, and the case sensitivity flag. However, it is recommended to use the `foreign_key` dataset-level check for large lists of allowed values or for columns of type `MapType` or `StructType`, as these checks are not suitable for such scenarios. |
| 13 | +* Added documentation for using DQX in streaming scenarios with foreach batch ([#948](https://github.com/databrickslabs/dqx/issues/948)). Documentation and example code snippets were added to demonstrate how to apply checks in foreachBatch structured streaming function. |
| 14 | +* Added telemetry to track count of input tables ([#954](https://github.com/databrickslabs/dqx/issues/954)). Added additional telemetry for better trakcing of DQX usage to help improve the product. |
| 15 | +* Added support for installing DQX from private PYPI repositories ([#930](https://github.com/databrickslabs/dqx/issues/930)). The DQX library has been enhanced with support for installing DQX using a company-hosted PyPI mirror, which is necessary for enterprises that block the public PyPI index. The documentation has been added to describe the feature. The tool installation code has been modified to include new functionality for automatically upload dependencies to a workspace when internet access is blocked. |
| 16 | +* Support Custom Folder Installation for CLI Commands ([#942](https://github.com/databrickslabs/dqx/issues/942)). The command-line interface (CLI) has been enhanced to support custom installation folders, providing users with greater flexibility when working with the library. A new `--install-folder` argument has been introduced, allowing users to specify a custom installation folder when running various CLI commands, such as opening dashboards, workflows, logs, and profiles. This argument override the default installation location to support scenarios where the user installs DQX in a custom location. The library's dependency on sqlalchemy has also been updated to require a version greater than or equal to 2.0 and less than 3.0 to avoid dependency issues in older DBRs. |
| 17 | +* Enhancement to end to end tests ([#921](https://github.com/databrickslabs/dqx/issues/921)). The e2e tests has been enhanced to test integration with dbt transformation framework. Additionally, the documentation for contributing to the project and testing has been updated to simplify the setup process for running tests locally. |
| 18 | + |
| 19 | +BREAKING CHANGES! |
| 20 | + |
| 21 | +* Renamed `level` parameter to `criticality` in `generate_dq_rules` method of `DQGenerator` for consistency. |
| 22 | +* Replaced `table: str` parameter with `input_config: InputConfig` in `profile_table` method of `DQProfiler` for greater flexibility. |
| 23 | +* Replaced `table_name: str` parameter with `input_config: InputConfig` in `generate_dq_rules_ai_assisted` method of `DQGenerator` for greater flexibility. |
| 24 | + |
| 25 | +## 0.10.0 |
| 26 | + |
| 27 | +* Added Data Quality Summary Metrics ([#553](https://github.com/databrickslabs/dqx/issues/553)). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new `DQMetricsObserver` class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The `DQEngine` class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the `metrics_config` parameter, and a new `save_summary_metrics` method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique `run_id` field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. |
| 28 | +* LLM assisted rules generation ([#577](https://github.com/databrickslabs/dqx/issues/577)). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The `DQGenerator` class now includes a `generate_dq_rules_ai_assisted` method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. |
| 29 | +* Added Lakebase checks storage backend ([#550](https://github.com/databrickslabs/dqx/issues/550)). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The `checks_location` resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the `LakebaseChecksStorageConfig` class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. |
| 30 | +* Added runtime validation of sql expressions ([#625](https://github.com/databrickslabs/dqx/issues/625)). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages. |
| 31 | +* Fixed docs ([#598](https://github.com/databrickslabs/dqx/issues/598)). The documentation build process has undergone significant improvements to enhance efficiency and maintainability. |
| 32 | +* Improved Config Serialization ([#676](https://github.com/databrickslabs/dqx/issues/676)). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the `ConfigSerializer` class, which handles the serialization and deserialization of workspace and run configurations. |
| 33 | +* Restore use of `hatch-fancy-pypi-readme` to fix images in PyPi ([#601](https://github.com/databrickslabs/dqx/issues/601)). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. |
| 34 | +* Skip check evaluation if columns or filter cannot be resolved in the input DataFrame ([#609](https://github.com/databrickslabs/dqx/issues/609)). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid. |
| 35 | +* Updated user guide docs ([#607](https://github.com/databrickslabs/dqx/issues/607)). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables. |
| 36 | +* Improved build process ([#618](https://github.com/databrickslabs/dqx/issues/618)). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the `databricks-labs-pytester` version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure. |
| 37 | + |
3 | 38 | ## 0.9.3 |
4 | 39 |
|
5 | 40 | * Added support for running checks on multiple tables ([#566](https://github.com/databrickslabs/dqx/issues/566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. |
|
0 commit comments