Skip to content

Commit 76aed0b

Browse files
vb-dbrksghansemwojtyczkaCopilot
authored
Made merge columns optional in sql query check (#945)
### PR summary: 1. Made sql_query support **merge_columns be optional/empty** and documenting/demoing the two modes (row-level with merge_columns vs. dataset-level). Added notebooks/docs/tests to prove the new behaviour, including row-filters, metrics observers, ref DF support, negation, and performance comparisons. 1. **Follow-up polish:** centralised the dataset-level logic into _apply_dataset_level_sql_check, reused the new path in integration tests, and tightened unit expectations so helpers (like DQDatasetRule) stay type-stable. 1. **Hardening + helper cleanup**: reintroduced merge-column validation to fail fast on bad inputs, forced dataset-level queries to error if they emit more than one row, and added the shared build_quality_violation helper so all integration expectations reuse the same metadata factory. Also refreshed formatting/linting. ****TL;DR:** sql_query now covers both dataset-level and row-level scenarios with clear validation and deterministic outputs, plus comprehensive tests and docs to back it up.** ## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary --> ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves #938 ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [x] manually tested - [x] added unit tests - [x] added integration tests - [ ] added end-to-end tests - [ ] added performance tests --------- Co-authored-by: Greg Hansen <[email protected]> Co-authored-by: Marcin Wojtyczka <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: mwojtyczka <[email protected]>
1 parent 7c2253b commit 76aed0b

File tree

9 files changed

+966
-172
lines changed

9 files changed

+966
-172
lines changed

demos/dqx_demo_library.py

Lines changed: 66 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -943,11 +943,15 @@ def not_ends_with(column: str, suffix: str) -> Column:
943943
# COMMAND ----------
944944

945945
# MAGIC %md
946-
# MAGIC #### Using `sql_query` check
946+
# MAGIC #### Using `sql_query` check - Row-level validation
947+
# MAGIC
948+
# MAGIC The `sql_query` check supports two modes:
949+
# MAGIC - **Row-level validation** (with `merge_columns`): Query results are joined back to mark specific rows
950+
# MAGIC - **Dataset-level validation** (without `merge_columns`): Check result applies to all rows
947951

948952
# COMMAND ----------
949953

950-
# using DQX classes
954+
# Row-level validation example: Check each sensor against its threshold
951955
from databricks.labs.dqx.rule import DQDatasetRule
952956
from databricks.labs.dqx.check_funcs import sql_query
953957

@@ -973,7 +977,7 @@ def not_ends_with(column: str, suffix: str) -> Column:
973977
check_func=sql_query,
974978
check_func_kwargs={
975979
"query": query,
976-
"merge_columns": ["sensor_id"],
980+
"merge_columns": ["sensor_id"], # Results joined back by sensor_id
977981
"condition_column": "condition", # the check fails if this column evaluates to True
978982
"msg": "one of the sensor reading is greater than limit",
979983
"name": "sensor_reading_check",
@@ -990,6 +994,41 @@ def not_ends_with(column: str, suffix: str) -> Column:
990994

991995
# COMMAND ----------
992996

997+
# MAGIC %md
998+
# MAGIC #### Using `sql_query` check - Dataset-level validation
999+
# MAGIC
1000+
# MAGIC When `merge_columns` is not provided, the check applies to all rows (all pass or all fail together).
1001+
# MAGIC This is useful for dataset-level aggregate validations.
1002+
1003+
# COMMAND ----------
1004+
1005+
# Dataset-level validation example: Check total sensor count
1006+
dataset_query = """
1007+
SELECT COUNT(DISTINCT sensor_id) < 1 AS condition
1008+
FROM {{ sensor }}
1009+
"""
1010+
1011+
checks = [
1012+
DQDatasetRule(
1013+
criticality="warn",
1014+
check_func=sql_query,
1015+
check_func_kwargs={
1016+
"query": dataset_query,
1017+
# No merge_columns = dataset-level check (all rows get same result)
1018+
"condition_column": "condition",
1019+
"msg": "Dataset has no sensors",
1020+
"name": "dataset_has_sensors",
1021+
"input_placeholder": "sensor",
1022+
},
1023+
),
1024+
]
1025+
1026+
ref_dfs = {"sensor_specs": sensor_specs_df}
1027+
valid_and_quarantine_df = dq_engine.apply_checks(sensor_df, checks, ref_dfs=ref_dfs)
1028+
display(valid_and_quarantine_df)
1029+
1030+
# COMMAND ----------
1031+
9931032
# using YAML declarative approach
9941033
checks = yaml.safe_load(
9951034
"""
@@ -1028,6 +1067,30 @@ def not_ends_with(column: str, suffix: str) -> Column:
10281067

10291068
# COMMAND ----------
10301069

1070+
# YAML example for dataset-level validation (without merge_columns)
1071+
checks_dataset_level = yaml.safe_load(
1072+
"""
1073+
- criticality: warn
1074+
check:
1075+
function: sql_query
1076+
arguments:
1077+
# No merge_columns = dataset-level validation
1078+
condition_column: condition
1079+
msg: Dataset has no sensors
1080+
name: dataset_has_sensors
1081+
input_placeholder: sensor
1082+
query: |
1083+
SELECT COUNT(DISTINCT sensor_id) < 1 AS condition
1084+
FROM {{ sensor }}
1085+
"""
1086+
)
1087+
1088+
ref_dfs = {"sensor_specs": sensor_specs_df}
1089+
valid_and_quarantine_df = dq_engine.apply_checks_by_metadata(sensor_df, checks_dataset_level, ref_dfs=ref_dfs)
1090+
display(valid_and_quarantine_df)
1091+
1092+
# COMMAND ----------
1093+
10311094
# MAGIC %md
10321095
# MAGIC #### Defining custom python dataset-level check
10331096

docs/dqx/docs/reference/benchmarks.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ sidebar_position: 13
2323
| test_benchmark_compare_datasets | 3.598445 | 3.556993 | 3.430710 | 3.793938 | 0.158157 | 0.280218 | 3.466942 | 3.747160 | 5 | 0 | 2 | 0.28 |
2424
| test_benchmark_foreach_compare_datasets[n_rows_100000000_n_columns_5] | 25.879615 | 25.919933 | 25.536855 | 26.071184 | 0.217230 | 0.307223 | 25.748681 | 26.055904 | 5 | 0 | 1 | 0.04 |
2525
| test_benchmark_foreach_foreign_key[n_rows_100000000_n_columns_5] | 24.264873 | 22.893218 | 20.587308 | 29.037093 | 4.062789 | 7.705522 | 20.652819 | 28.358341 | 5 | 0 | 1 | 0.04 |
26+
| test_benchmark_foreach_has_no_outliers[n_rows_100000000_n_columns_5] | 22.524313 | 22.347593 | 22.104944 | 22.924248 | 0.374170 | 0.646915 | 22.271984 | 22.918899 | 5 | 0 | 3 | 0.04 |
2627
| test_benchmark_foreach_has_valid_schema[n_rows_100000000_n_columns_5] | 1.068582 | 1.050490 | 0.979350 | 1.219259 | 0.092674 | 0.112164 | 1.003924 | 1.116088 | 5 | 0 | 1 | 0.94 |
2728
| test_benchmark_foreach_is_aggr_equal[n_rows_100000000_n_columns_5] | 1.239298 | 1.213153 | 1.192442 | 1.341836 | 0.060654 | 0.068928 | 1.200719 | 1.269646 | 5 | 0 | 1 | 0.81 |
2829
| test_benchmark_foreach_is_aggr_not_equal[n_rows_100000000_n_columns_5] | 1.264898 | 1.250273 | 1.218577 | 1.345211 | 0.051090 | 0.071957 | 1.225905 | 1.297862 | 5 | 0 | 1 | 0.79 |
@@ -54,6 +55,7 @@ sidebar_position: 13
5455
| test_benchmark_foreach_sql_query[n_rows_100000000_n_columns_5] | 4.578799 | 4.602143 | 4.442396 | 4.644892 | 0.083901 | 0.113694 | 4.530776 | 4.644470 | 5 | 0 | 1 | 0.22 |
5556
| test_benchmark_foreign_key | 31.784272 | 31.787610 | 31.414708 | 32.123221 | 0.269713 | 0.386951 | 31.597198 | 31.984149 | 5 | 0 | 2 | 0.03 |
5657
| test_benchmark_has_dimension | 0.215338 | 0.213285 | 0.210530 | 0.223131 | 0.005056 | 0.007086 | 0.211819 | 0.218905 | 5 | 0 | 1 | 4.64 |
58+
| test_benchmark_has_no_outliers | 0.234952 | 0.228169 | 0.224165 | 0.257274 | 0.013649 | 0.017354 | 0.225936 | 0.243290 | 5 | 0 | 1 | 4.26 |
5759
| test_benchmark_has_valid_schema | 0.172078 | 0.172141 | 0.163793 | 0.181081 | 0.006715 | 0.009295 | 0.167010 | 0.176305 | 6 | 0 | 2 | 5.81 |
5860
| test_benchmark_has_x_coordinate_between | 0.217192 | 0.213656 | 0.209310 | 0.236233 | 0.011150 | 0.012638 | 0.209410 | 0.222048 | 5 | 0 | 1 | 4.60 |
5961
| test_benchmark_has_y_coordinate_between | 0.218497 | 0.219630 | 0.209352 | 0.234111 | 0.010103 | 0.013743 | 0.209584 | 0.223327 | 5 | 0 | 1 | 4.58 |

0 commit comments

Comments
 (0)