-
Notifications
You must be signed in to change notification settings - Fork 74
Update has_valid_schema check to accept a reference dataframe or table
#960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
✅ 482/482 passed, 1 flaky, 41 skipped, 3h37m9s total Flaky tests:
Running from acceptance #3373 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for specifying a reference table as the source of the expected schema in the has_valid_schema check function. Instead of providing an explicit schema string or StructType, users can now pass a ref_table parameter pointing to a table in the catalog, and the check will load that table's schema as the expected schema.
Key changes:
- Added
ref_tableparameter tohas_valid_schemafunction as an alternative toexpected_schema - Added validation to ensure exactly one of
expected_schemaorref_tableis specified - Updated the apply method signature to accept a
sparkparameter for loading reference tables
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Added ref_table parameter, validation logic, and schema loading from reference table |
| tests/unit/test_dataset_checks.py | Added unit tests for parameter validation |
| tests/integration/test_dataset_checks.py | Updated all test calls to pass spark parameter and added integration test for ref_table functionality |
| docs/dqx/docs/reference/quality_checks.mdx | Updated documentation with examples of using ref_table parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
has_valid_schema check to accept a reference dataframe or table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes
This PR introduces reference datasets (either tables or dataframes) for the
has_valid_schemacheck function.The behavior is as follows:
ref_dfsis created in-code andref_df_nameis specified, the valid schema will be determined from the reference dataframeref_tableis specified, the valid schema will be determined by loading the reference table as a Spark dataframeSpecifying multiple valid schema sources (e.g.
expected_schemaandref_df_nameorref_table) will raise anInvalidParameterError.Linked issues
Resolves #959
Tests