Release v0.12.0 · databrickslabs/dqx

What's Changed

AI-Assisted rules generation from data profiles (#963). AI-assisted data quality rule generation was added, leveraging summary statistics from a profiler to create rules. The DQGenerator class includes a generate_dq_rules_ai_assisted method that can generate rules with or without user-provided input, using summary statistics to inform the rule creation process. This method offers flexibility in rule generation, allowing for both automated and user-guided creation of data quality rules.
Added new checks for JSON validation (#616). DQX now includes three new quality checks for JSON data validation, especially useful for validating data coming from streaming systems such as Kafka: is_valid_json, has_json_keys, and has_valid_json_schema. The is_valid_json check verifies whether values in a specified column are valid JSON strings, while the has_json_keys check confirms the presence of specific keys in the outermost JSON object, allowing for optional parameters to require all keys to be present. The has_valid_json_schema check ensures that JSON strings conform to an expected schema, ignoring extra fields not defined in the schema.
Added geometry row-level checks (#636). The library has been enhanced with new row-level checks for geometry columns, including checks for area and number of points, such as is_area_not_less_than, is_area_not_greater_than, is_area_equal_to, is_area_not_equal_to, is_num_points_not_less_than, is_num_points_not_greater_than, is_num_points_equal_to, and is_num_points_not_equal_to. These checks allow users to validate geometric data based on specific criteria, with options to specify the spatial reference system (SRID) and use geodesic area calculations. These changes enable more effective validation and quality control of geometric data, and are supported in Databricks serverless compute or runtime versions 17.1 and later.
Added support to write using delta table path (#594). The quality check results saving functionality has been enhanced to support saving to Unity Catalog Volume paths, S3, ADLS, or GCS in addition to tables, providing more flexibility in storing and managing results. The save_results_in_table method now accepts output configurations with volume paths, and the OutputConfig object has been updated to support table names with 2 or 3-level namespace, storage paths including Volume paths, S3, ADLS, or GCS, and optional trigger settings for streaming output. Furthermore, the code now supports saving DataFrames to both Delta tables and storage paths, with the save_dataframe_as_table function taking an output_config object that determines whether to save the DataFrame to a table or a path. The functionality includes support for batch and streaming writes, input validation, and error handling, with the existing functionality of saving to Delta tables preserved and new functionality added for saving to storage paths.
Extended aggregation check function to support more aggregation types (#951). The aggregation check function has been significantly enhanced to support a wide range of aggregate functions, including 20 curated statistical and percentile-based functions, as well as any Databricks built-in aggregate function, with runtime validation to ensure compatibility and trigger warnings for non-curated functions. The function now accepts an aggr_params parameter to pass parameters to aggregate functions, such as percentile calculations, and supports two-stage aggregation for window-incompatible aggregates like count_distinct. Additionally, the function includes improved error handling, human-readable violation messages, and performance benchmarks for various aggregation scenarios, enabling advanced data quality monitoring and validation capabilities for data engineers and analysts.
Added new is_not_in_list check function (#969). A new check function, is_not_in_list, has been added to verify that values in a specified column are not present in a given list of forbidden values, allowing for null values and optional case-insensitive comparisons. This function is suitable for columns that are not of type MapType or StructType, and for optimal performance with large lists of forbidden values, it is recommended to use the foreign_key dataset-level check with the negate argument set to Trueumn to check, the list of forbidden values, and optionally the case sensitivity of the comparison, and its implementation includes input validation and custom error messages, with additional benchmark tests to measure its performance.
Improve Generator to emit temporal checks for min/max date & datetime (#624). The data quality generator has been enhanced to support temporal checks for columns with datetime and date types, in addition to numeric types. The generator now creates rules with "is_in_range", "is_not_less_than", and is_not_greater_than functions based on the provided minimum and maximum limits, ensuring correct comparison by verifying that both limit values are of the same type. This update preserves the existing numeric behavior and introduces support for timestamp and date checks, while maintaining the ability to handle Python numeric types without stringification.
Improved sql query check funciton to make merge columns parameter optional (#945). The sql_query check has been enhanced to support both row-level and dataset-level validation, allowing for more flexible data validation scenarios. In row-level validation, the check joins query results back to the input data to mark specific rows, whereas in dataset-level validation, the check result applies to all rows, making it suitable for aggregate validations with custom metrics. The merge_columns parameter is now optional, and when not provided, the check performs a dataset-level validation, providing a convenient way to validate entire datasets without requiring specific column mappings. Additionally, the check has been made more robust with input validation and error handling, ensuring that users can perform checks at both the row and dataset levels while preventing incorrect usage with informative error messages.
Outlier detection numerical values (#944). The has_no_outliers function has been introduced to detect outliers in numeric columns using the Median Absolute Deviation (MAD) method, which calculates the lower and upper limits as median - 3.5 * MAD and median + 3.5 * MAD, respectively, and considers values outside these limits as outliers. The function is designed to work with numeric columns of type int, float, long, and decimal, and it raises an error if the specified column is not of numeric type. The addition of this function enables the detection of outlier numeric values, enhancing the overall data validation capabilities.
Library improvements (#966). The library has undergone updates to improve its functionality, performance, and documentation. The has_json_keys function has been updated to treat NULL values as valid, ensuring consistent behavior across ANSI and non-ANSI modes. Additionally, the functionality of saving DataFrames as tables has been improved, with updated regular expression patterns for table names and enhanced handling of streaming and non-streaming DataFrames.
Updated has_valid_schema check to accept a reference dataframe or table (#960). The has_valid_schema check has been enhanced to support validation against a reference dataframe or table, in addition to the existing expected schema. This allows users to verify the schema of their input dataframe against a reference dataframe or table by specifying either the ref_df_name or ref_table parameter, with exactly one of expected_schema, ref_df_name, or ref_table required. The check can be performed in strict mode for exact schema matching or in non-strict mode, which permits extra columns, and users can also specify particular columns to validate using the columns parameter. The function's update includes improved parameter validation, ensuring that only one valid schema source is specified, and new test cases have been added to cover various scenarios, including the use of reference tables and dataframes for schema validation, as well as parameter validation logic.
Updated dashboards deployment to use standard lakeview dashboard definitions (#950). The dashboard installer has been updated to use standard Lakeview dashboard definitions.
Added null island gemetry check function (#613). A new quality check called is_not_null_island has been introduced to verify whether values in a specified column are NULL island geometries, such as POINT(0 0), POINTZ(0 0 0), or POINTZM(0 0 0 0). The is_not_null_island function requires Databricks serverless compute or runtime version 17.1 or higher.
Added float support for range and compare functions (#962). The comparison and validation functions have been enhanced to support float values, in addition to existing support for integers, dates, timestamps, and strings. This update allows for more flexible and nuanced comparisons and range checks, enabling precise and robust validation operations, particularly in scenarios involving decimal or fractional values. The functions now accept float values for limit parameters, and the range checks are inclusive of both boundaries. This enhancement enables users to specify minimum and maximum limits with decimal points, allowing for more precise data validation.

Contributors

@mwojtyczka @ghanse @souravg-db2 @vb-dbrks @alfredzimmer @AdityaMandiwal @Escanor1996 @larsmoan @STEFANOVIVAS @tdikland @cornzyblack @alfredzimmer @bsr-the-mngrm

Full Changelog: v0.11.1...v0.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.12.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Contributors

Uh oh!