Skip to content

Conversation

@ghanse
Copy link
Contributor

@ghanse ghanse commented Dec 5, 2025

Changes

This PR introduces reference datasets (either tables or dataframes) for the has_valid_schema check function.

The behavior is as follows:

  • When ref_dfs is created in-code and ref_df_name is specified, the valid schema will be determined from the reference dataframe
  • When ref_table is specified, the valid schema will be determined by loading the reference table as a Spark dataframe

Specifying multiple valid schema sources (e.g. expected_schema and ref_df_name or ref_table) will raise an InvalidParameterError.

Linked issues

Resolves #959

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

✅ 482/482 passed, 1 flaky, 41 skipped, 3h37m9s total

Flaky tests:

  • 🤪 test_e2e_workflow_serverless (9m44.999s)

Running from acceptance #3373

@ghanse ghanse requested a review from Copilot December 5, 2025 21:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for specifying a reference table as the source of the expected schema in the has_valid_schema check function. Instead of providing an explicit schema string or StructType, users can now pass a ref_table parameter pointing to a table in the catalog, and the check will load that table's schema as the expected schema.

Key changes:

  • Added ref_table parameter to has_valid_schema function as an alternative to expected_schema
  • Added validation to ensure exactly one of expected_schema or ref_table is specified
  • Updated the apply method signature to accept a spark parameter for loading reference tables

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/databricks/labs/dqx/check_funcs.py Added ref_table parameter, validation logic, and schema loading from reference table
tests/unit/test_dataset_checks.py Added unit tests for parameter validation
tests/integration/test_dataset_checks.py Updated all test calls to pass spark parameter and added integration test for ref_table functionality
docs/dqx/docs/reference/quality_checks.mdx Updated documentation with examples of using ref_table parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ghanse ghanse requested a review from mwojtyczka December 8, 2025 22:48
@ghanse ghanse changed the title Update has_valid_schema check to accept a reference table Update has_valid_schema check to accept a reference DataFrame or table Dec 8, 2025
@ghanse ghanse changed the title Update has_valid_schema check to accept a reference DataFrame or table Update has_valid_schema check to accept a reference dataframe or table Dec 8, 2025
@ghanse ghanse changed the title Update has_valid_schema check to accept a reference dataframe or table Update has_valid_schema check to accept a reference dataframe or table Dec 8, 2025
@ghanse ghanse requested a review from Copilot December 8, 2025 22:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 293c764 into main Dec 9, 2025
16 checks passed
@mwojtyczka mwojtyczka deleted the has_valid_schema branch December 9, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Allow schema validation against a reference table

3 participants