[MAINTENANCE] Optimize `ExpectColumnDistinctValuesToBeInSet` with database-pushed comparison #11614

NathanFarmer · 2026-01-23T20:21:00Z

Description

Optimizes ExpectColumnDistinctValuesToBeInSet to push comparison logic to the database instead of fetching all distinct values into memory.

Changes

Add ColumnDistinctValuesNotInSetCount metric to count violations in DB
Add ColumnDistinctValuesNotInSet metric to fetch sample violations with LIMIT
Refactor expectation to use new metrics instead of fetching all distinct values
Add Metrics API wrapper classes for new metrics
Respects result_format settings

…hed comparison - Add ColumnDistinctValuesNotInSetCount metric to count violations in DB - Add ColumnDistinctValuesNotInSet metric to fetch sample violations with LIMIT - Refactor expectation to use new metrics instead of fetching all distinct values - Add Metrics API wrapper classes for new metrics - Respects result_format and partial_unexpected_count settings

netlify · 2026-01-23T20:21:05Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`c6e0f0a`
🔍 Latest deploy log	https://app.netlify.com/projects/niobium-lead-7998/deploys/697d396f26abaf000857e83d

codecov · 2026-01-23T20:22:26Z

Codecov Report

❌ Patch coverage is 89.89899% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.14%. Comparing base (8d8970b) to head (c6e0f0a).
⚠️ Report is 1 commits behind head on develop.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...column_aggregate_metrics/column_distinct_values.py	90.00%	7 Missing ⚠️
...core/expect_column_distinct_values_to_be_in_set.py	75.00%	2 Missing ⚠️
...re/expect_column_distinct_values_to_contain_set.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #11614      +/-   ##
===========================================
- Coverage    84.14%   84.14%   -0.01%     
===========================================
  Files          465      467       +2     
  Lines        39364    39463      +99     
===========================================
+ Hits         33124    33207      +83     
- Misses        6240     6256      +16

Flag	Coverage Δ
3.10	`72.72% <52.52%> (-0.07%)`	⬇️
3.10 athena	`41.48% <38.38%> (-0.01%)`	⬇️
3.10 aws_deps	`45.85% <38.38%> (-0.02%)`	⬇️
3.10 big	`55.22% <44.44%> (-0.03%)`	⬇️
3.10 bigquery	`50.64% <77.77%> (+0.06%)`	⬆️
3.10 clickhouse	`41.49% <38.38%> (-0.01%)`	⬇️
3.10 databricks	`52.38% <67.67%> (+0.03%)`	⬆️
3.10 filesystem	`63.98% <52.52%> (-0.03%)`	⬇️
3.10 gx-redshift	`50.76% <67.67%> (+0.04%)`	⬆️
3.10 mssql	`50.87% <67.67%> (+0.04%)`	⬆️
3.10 mysql	`51.28% <67.67%> (+0.16%)`	⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds	`59.22% <67.67%> (+0.02%)`	⬆️
3.10 postgresql	`54.67% <67.67%> (+0.03%)`	⬆️
3.10 snowflake	`53.20% <67.67%> (+0.03%)`	⬆️
3.10 spark	`55.32% <54.54%> (-0.01%)`	⬇️
3.10 spark_connect	`46.36% <38.38%> (-0.03%)`	⬇️
3.10 trino	`48.18% <38.38%> (-0.03%)`	⬇️
3.11	`72.72% <52.52%> (-0.07%)`	⬇️
3.11 athena	`?`
3.11 aws_deps	`?`
3.11 big	`?`
3.11 clickhouse	`?`
3.11 filesystem	`?`
3.11 mssql	`?`
3.11 mysql	`?`
3.11 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.11 spark_connect	`?`
3.12	`72.73% <52.52%> (-0.07%)`	⬇️
3.12 athena	`?`
3.12 aws_deps	`?`
3.12 big	`?`
3.12 mssql	`?`
3.12 mysql	`?`
3.12 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.13	`72.73% <52.52%> (-0.07%)`	⬇️
3.13 athena	`41.48% <38.38%> (-0.01%)`	⬇️
3.13 aws_deps	`45.85% <38.38%> (-0.02%)`	⬇️
3.13 big	`55.22% <44.44%> (-0.03%)`	⬇️
3.13 bigquery	`50.65% <77.77%> (+0.06%)`	⬆️
3.13 clickhouse	`41.49% <38.38%> (-0.01%)`	⬇️
3.13 databricks	`52.38% <67.67%> (+0.03%)`	⬆️
3.13 filesystem	`63.98% <52.52%> (-0.03%)`	⬇️
3.13 gx-redshift	`50.76% <67.67%> (+0.04%)`	⬆️
3.13 mssql	`50.87% <67.67%> (+0.04%)`	⬆️
3.13 mysql	`51.28% <67.67%> (+0.16%)`	⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds	`59.22% <67.67%> (+0.02%)`	⬆️
3.13 postgresql	`54.68% <67.67%> (+0.03%)`	⬆️
3.13 snowflake	`53.21% <67.67%> (+0.03%)`	⬆️
3.13 spark	`55.32% <54.54%> (-0.01%)`	⬇️
3.13 spark_connect	`46.36% <38.38%> (-0.03%)`	⬇️
3.13 trino	`48.18% <38.38%> (-0.03%)`	⬇️
cloud	`0.00% <0.00%> (ø)`
docs-basic	`58.62% <38.38%> (-0.06%)`	⬇️
docs-creds-needed	`57.42% <38.38%> (-0.05%)`	⬇️
docs-spark	`56.72% <38.38%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches

…unt on failure - Filter out null/NaN values from observed_value_set before comparison - Only include unexpected_count in result when there are violations - Fixes test failures for null handling and result format

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

- Add helper method to coerce string value_set to DATE objects for BigQuery DATE columns - BigQuery doesn't support DATE NOT IN UNNEST(ARRAY<STRING>) - Convert string date values to date objects before SQL query - Use date.fromisoformat() instead of datetime.strptime() to avoid DTZ007 warning - Fix line length issues - Fixes test failure for test_dates_with_str_value_set on BigQuery

- Add explicit List[Any] type annotation to coerced_value_set to fix type checker error - Add _coerce_value_set_for_bigquery_date method to ColumnDistinctValuesNotInSet class - Fix kwargs passing in _sqlalchemy method to include _metrics - Fixes type checker errors for BigQuery DATE coercion

- Update _sqlalchemy method signature to include metrics and runtime_configuration parameters - Extract dialect from execution_engine and pass to helper method along with metrics - Fixes type checker error about missing _coerce_value_set_for_bigquery_date method

- Use hasattr(dialect, 'BigQueryDialect') pattern from util.py - Previous code incorrectly called .get() on a type object - Fixes AttributeError: 'PGDialect_psycopg2' has no attribute 'get'

- Test ColumnDistinctValuesNotInSetCount for all data sources - Test ColumnDistinctValuesNotInSet for all data sources - Tests cover: all values in set, some values not in set, no values in set - Tests verify limit parameter works correctly

- Add test_dates_all_in_set for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Tests both ColumnDistinctValuesNotInSet and ColumnDistinctValuesNotInSetCount - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_in_set tests with proper date objects are sufficient for metrics.

…eInSet

The .count suffix should work automatically without explicit definition. Removed the explicit count metric class and updated the expectation to compute violation count directly from Python-side set comparison.

The wrapper class allows users to reference the .count metric programmatically, even though the underlying metric computation is handled automatically.

The .count suffix does not work automatically - metrics must be explicitly registered. Restored the metric provider class to register the metric.

observed_value now contains only violations (values not in set), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns.

…_be_in_set - observed_value now contains only violations, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns

…pected_list - observed_value is now None (semantically correct - it's not observed values) - Violations go in partial_unexpected_list (limited by partial_unexpected_count) - unexpected_count always included when not BOOLEAN_ONLY - Renderer returns '--' for observed value since it's None

…tation

NathanFarmer changed the title ~~Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison~~ [MAINTENANCE] Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison Jan 23, 2026

NathanFarmer changed the title ~~[MAINTENANCE] Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison~~ [MAINTENANCE] Optimize ExpectColumnDistinctValuesToBeInSet with database-pushed comparison Jan 23, 2026

NathanFarmer and others added 25 commits January 23, 2026 13:26

Fix circular import: move ValidationDependencies to TYPE_CHECKING block

01e0717

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

Remove unused type: ignore comment on execute_query

cbd1b01

Remove PR_ORGANIZATION.md from branch

6a681b7

Fix type error: handle partial_unexpected_count type safely

15b28c5

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

Fix BigQuery dialect detection in _coerce_value_set_for_bigquery_date

d16c110

- Use hasattr(dialect, 'BigQueryDialect') pattern from util.py - Previous code incorrectly called .get() on a type object - Fixes AttributeError: 'PGDialect_psycopg2' has no attribute 'get'

Remove test_dates_with_str_value_set from metric tests

72918c7

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_in_set tests with proper date objects are sufficient for metrics.

Add result format integration tests for ExpectColumnDistinctValuesToB…

1c5bb41

…eInSet

Fix type errors: use result.result instead of to_json_dict

05b7d3d

Merge branch 'develop' into m/gx-2374/distinct-values-be-in-set

34b2e93

Fix value_counts comparison: use to_json_dict for proper serialization

66afddc

Fix type errors: compare full result dict instead of nested access

f9cee69

Merge branch 'develop' into m/gx-2374/distinct-values-be-in-set

35361e0

Remove unnecessary fallback to column.value_counts metric

f2139ec

Remove explicit ColumnDistinctValuesNotInSetCount class

0687d51

The .count suffix should work automatically without explicit definition. Removed the explicit count metric class and updated the expectation to compute violation count directly from Python-side set comparison.

Remove tests for deleted ColumnDistinctValuesNotInSetCount metric

558a7c4

Restore Metrics API wrapper for ColumnDistinctValuesNotInSetCount

822bbe9

The wrapper class allows users to reference the .count metric programmatically, even though the underlying metric computation is handled automatically.

Restore ColumnDistinctValuesNotInSetCount metric provider class

8a6023a

The .count suffix does not work automatically - metrics must be explicitly registered. Restored the metric provider class to register the metric.

BREAKING: Remove column.value_counts and column.distinct_values

f9deb5e

observed_value now contains only violations (values not in set), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns.

NathanFarmer added 4 commits January 27, 2026 11:21

Update tests for breaking changes in expect_column_distinct_values_to…

4d49e92

…_be_in_set - observed_value now contains only violations, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns

Restore renderer to use partial_unexpected_list for rendering violations

7e05a82

Revert expectation and tests to original column.value_counts implemen…

151f258

…tation

NathanFarmer self-assigned this Jan 28, 2026

NathanFarmer and others added 4 commits January 28, 2026 16:16

Merge branch 'develop' into m/gx-2374/distinct-values-be-in-set

605771c

Merge branch 'develop' into m/gx-2374/distinct-values-be-in-set

4e13a14

Limit observed_value to 1000 values to prevent 413 payload errors

ec683e6

Limit value_counts to 1000 items to prevent 413 payload errors

c6e0f0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] Optimize `ExpectColumnDistinctValuesToBeInSet` with database-pushed comparison #11614

[MAINTENANCE] Optimize `ExpectColumnDistinctValuesToBeInSet` with database-pushed comparison #11614

Uh oh!

NathanFarmer commented Jan 23, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MAINTENANCE] Optimize ExpectColumnDistinctValuesToBeInSet with database-pushed comparison #11614

Are you sure you want to change the base?

[MAINTENANCE] Optimize ExpectColumnDistinctValuesToBeInSet with database-pushed comparison #11614

Uh oh!

Conversation

NathanFarmer commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

netlify bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for niobium-lead-7998 canceled.

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MAINTENANCE] Optimize `ExpectColumnDistinctValuesToBeInSet` with database-pushed comparison #11614

[MAINTENANCE] Optimize `ExpectColumnDistinctValuesToBeInSet` with database-pushed comparison #11614

NathanFarmer commented Jan 23, 2026 •

edited

Loading

netlify bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading