v0.12.0
What's Changed
- AI-Assisted rules generation from data profiles (#963). AI-assisted data quality rule generation was added, leveraging summary statistics from a profiler to create rules. The
DQGeneratorclass includes agenerate_dq_rules_ai_assistedmethod that can generate rules with or without user-provided input, using summary statistics to inform the rule creation process. This method offers flexibility in rule generation, allowing for both automated and user-guided creation of data quality rules. - Added new checks for JSON validation (#616). DQX now includes three new quality checks for JSON data validation, especially useful for validating data coming from streaming systems such as Kafka:
is_valid_json,has_json_keys, andhas_valid_json_schema. Theis_valid_jsoncheck verifies whether values in a specified column are valid JSON strings, while thehas_json_keyscheck confirms the presence of specific keys in the outermost JSON object, allowing for optional parameters to require all keys to be present. Thehas_valid_json_schemacheck ensures that JSON strings conform to an expected schema, ignoring extra fields not defined in the schema. - Added geometry row-level checks (#636). The library has been enhanced with new row-level checks for geometry columns, including checks for area and number of points, such as
is_area_not_less_than,is_area_not_greater_than,is_area_equal_to,is_area_not_equal_to,is_num_points_not_less_than,is_num_points_not_greater_than,is_num_points_equal_to, andis_num_points_not_equal_to. These checks allow users to validate geometric data based on specific criteria, with options to specify the spatial reference system (SRID) and use geodesic area calculations. These changes enable more effective validation and quality control of geometric data, and are supported in Databricks serverless compute or runtime versions 17.1 and later. - Added support to write using delta table path (#594). The quality check results saving functionality has been enhanced to support saving to Unity Catalog Volume paths, S3, ADLS, or GCS in addition to tables, providing more flexibility in storing and managing results. The
save_results_in_tablemethod now accepts output configurations with volume paths, and theOutputConfigobject has been updated to support table names with 2 or 3-level namespace, storage paths including Volume paths, S3, ADLS, or GCS, and optional trigger settings for streaming output. Furthermore, the code now supports saving DataFrames to both Delta tables and storage paths, with thesave_dataframe_as_tablefunction taking anoutput_configobject that determines whether to save the DataFrame to a table or a path. The functionality includes support for batch and streaming writes, input validation, and error handling, with the existing functionality of saving to Delta tables preserved and new functionality added for saving to storage paths. - Extended aggregation check function to support more aggregation types (#951). The aggregation check function has been significantly enhanced to support a wide range of aggregate functions, including 20 curated statistical and percentile-based functions, as well as any Databricks built-in aggregate function, with runtime validation to ensure compatibility and trigger warnings for non-curated functions. The function now accepts an
aggr_paramsparameter to pass parameters to aggregate functions, such as percentile calculations, and supports two-stage aggregation for window-incompatible aggregates likecount_distinct. Additionally, the function includes improved error handling, human-readable violation messages, and performance benchmarks for various aggregation scenarios, enabling advanced data quality monitoring and validation capabilities for data engineers and analysts. - Added new is_not_in_list check function (#969). A new check function,
is_not_in_list, has been added to verify that values in a specified column are not present in a given list of forbidden values, allowing for null values and optional case-insensitive comparisons. This function is suitable for columns that are not of typeMapTypeorStructType, and for optimal performance with large lists of forbidden values, it is recommended to use theforeign_keydataset-level check with thenegateargument set toTrueumn to check, the list of forbidden values, and optionally the case sensitivity of the comparison, and its implementation includes input validation and custom error messages, with additional benchmark tests to measure its performance. - Improve Generator to emit temporal checks for min/max date & datetime (#624). The data quality generator has been enhanced to support temporal checks for columns with datetime and date types, in addition to numeric types. The generator now creates rules with "is_in_range", "is_not_less_than", and
is_not_greater_thanfunctions based on the provided minimum and maximum limits, ensuring correct comparison by verifying that both limit values are of the same type. This update preserves the existing numeric behavior and introduces support for timestamp and date checks, while maintaining the ability to handle Python numeric types without stringification. - Improved sql query check funciton to make merge columns parameter optional (#945). The
sql_querycheck has been enhanced to support both row-level and dataset-level validation, allowing for more flexible data validation scenarios. In row-level validation, the check joins query results back to the input data to mark specific rows, whereas in dataset-level validation, the check result applies to all rows, making it suitable for aggregate validations with custom metrics. Themerge_columnsparameter is now optional, and when not provided, the check performs a dataset-level validation, providing a convenient way to validate entire datasets without requiring specific column mappings. Additionally, the check has been made more robust with input validation and error handling, ensuring that users can perform checks at both the row and dataset levels while preventing incorrect usage with informative error messages. - Outlier detection numerical values (#944). The
has_no_outliersfunction has been introduced to detect outliers in numeric columns using the Median Absolute Deviation (MAD) method, which calculates the lower and upper limits as median - 3.5 * MAD and median + 3.5 * MAD, respectively, and considers values outside these limits as outliers. The function is designed to work with numeric columns of type int, float, long, and decimal, and it raises an error if the specified column is not of numeric type. The addition of this function enables the detection of outlier numeric values, enhancing the overall data validation capabilities. - Library improvements (#966). The library has undergone updates to improve its functionality, performance, and documentation. The
has_json_keysfunction has been updated to treat NULL values as valid, ensuring consistent behavior across ANSI and non-ANSI modes. Additionally, the functionality of saving DataFrames as tables has been improved, with updated regular expression patterns for table names and enhanced handling of streaming and non-streaming DataFrames. - Updated
has_valid_schemacheck to accept a reference dataframe or table (#960). Thehas_valid_schemacheck has been enhanced to support validation against a reference dataframe or table, in addition to the existing expected schema. This allows users to verify the schema of their input dataframe against a reference dataframe or table by specifying either theref_df_nameorref_tableparameter, with exactly one ofexpected_schema,ref_df_name, orref_tablerequired. The check can be performed in strict mode for exact schema matching or in non-strict mode, which permits extra columns, and users can also specify particular columns to validate using thecolumnsparameter. The function's update includes improved parameter validation, ensuring that only one valid schema source is specified, and new test cases have been added to cover various scenarios, including the use of reference tables and dataframes for schema validation, as well as parameter validation logic. - Updated dashboards deployment to use standard lakeview dashboard definitions (#950). The dashboard installer has been updated to use standard Lakeview dashboard definitions.
- Added null island gemetry check function (#613). A new quality check called
is_not_null_islandhas been introduced to verify whether values in a specified column are NULL island geometries, such as POINT(0 0), POINTZ(0 0 0), or POINTZM(0 0 0 0). Theis_not_null_islandfunction requires Databricks serverless compute or runtime version 17.1 or higher. - Added float support for range and compare functions (#962). The comparison and validation functions have been enhanced to support float values, in addition to existing support for integers, dates, timestamps, and strings. This update allows for more flexible and nuanced comparisons and range checks, enabling precise and robust validation operations, particularly in scenarios involving decimal or fractional values. The functions now accept float values for limit parameters, and the range checks are inclusive of both boundaries. This enhancement enables users to specify minimum and maximum limits with decimal points, allowing for more precise data validation.
Contributors
@mwojtyczka @ghanse @souravg-db2 @vb-dbrks @alfredzimmer @AdityaMandiwal @Escanor1996 @larsmoan @STEFANOVIVAS @tdikland @cornzyblack @alfredzimmer @bsr-the-mngrm
Full Changelog: v0.11.1...v0.12.0