Skip to content

Conversation

@NathanFarmer
Copy link
Contributor

@NathanFarmer NathanFarmer commented Jan 23, 2026

Description

Optimizes ExpectColumnDistinctValuesToBeInSet to push comparison logic to the database instead of fetching all distinct values into memory.

Changes

  • Add ColumnDistinctValuesNotInSetCount metric to count violations in DB
  • Add ColumnDistinctValuesNotInSet metric to fetch sample violations with LIMIT
  • Refactor expectation to use new metrics instead of fetching all distinct values
  • Add Metrics API wrapper classes for new metrics
  • Respects result_format settings

…hed comparison

- Add ColumnDistinctValuesNotInSetCount metric to count violations in DB
- Add ColumnDistinctValuesNotInSet metric to fetch sample violations with LIMIT
- Refactor expectation to use new metrics instead of fetching all distinct values
- Add Metrics API wrapper classes for new metrics
- Respects result_format and partial_unexpected_count settings
@netlify
Copy link

netlify bot commented Jan 23, 2026

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit c6e0f0a
🔍 Latest deploy log https://app.netlify.com/projects/niobium-lead-7998/deploys/697d396f26abaf000857e83d

@NathanFarmer NathanFarmer changed the title Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison [MAINTENANCE] Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison Jan 23, 2026
@NathanFarmer NathanFarmer changed the title [MAINTENANCE] Optimize expect_column_distinct_values_to_be_in_set with database-pushed comparison [MAINTENANCE] Optimize ExpectColumnDistinctValuesToBeInSet with database-pushed comparison Jan 23, 2026
@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 89.89899% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.14%. Comparing base (8d8970b) to head (c6e0f0a).
⚠️ Report is 1 commits behind head on develop.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...column_aggregate_metrics/column_distinct_values.py 90.00% 7 Missing ⚠️
...core/expect_column_distinct_values_to_be_in_set.py 75.00% 2 Missing ⚠️
...re/expect_column_distinct_values_to_contain_set.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #11614      +/-   ##
===========================================
- Coverage    84.14%   84.14%   -0.01%     
===========================================
  Files          465      467       +2     
  Lines        39364    39463      +99     
===========================================
+ Hits         33124    33207      +83     
- Misses        6240     6256      +16     
Flag Coverage Δ
3.10 72.72% <52.52%> (-0.07%) ⬇️
3.10 athena 41.48% <38.38%> (-0.01%) ⬇️
3.10 aws_deps 45.85% <38.38%> (-0.02%) ⬇️
3.10 big 55.22% <44.44%> (-0.03%) ⬇️
3.10 bigquery 50.64% <77.77%> (+0.06%) ⬆️
3.10 clickhouse 41.49% <38.38%> (-0.01%) ⬇️
3.10 databricks 52.38% <67.67%> (+0.03%) ⬆️
3.10 filesystem 63.98% <52.52%> (-0.03%) ⬇️
3.10 gx-redshift 50.76% <67.67%> (+0.04%) ⬆️
3.10 mssql 50.87% <67.67%> (+0.04%) ⬆️
3.10 mysql 51.28% <67.67%> (+0.16%) ⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds 59.22% <67.67%> (+0.02%) ⬆️
3.10 postgresql 54.67% <67.67%> (+0.03%) ⬆️
3.10 snowflake 53.20% <67.67%> (+0.03%) ⬆️
3.10 spark 55.32% <54.54%> (-0.01%) ⬇️
3.10 spark_connect 46.36% <38.38%> (-0.03%) ⬇️
3.10 trino 48.18% <38.38%> (-0.03%) ⬇️
3.11 72.72% <52.52%> (-0.07%) ⬇️
3.11 athena ?
3.11 aws_deps ?
3.11 big ?
3.11 clickhouse ?
3.11 filesystem ?
3.11 mssql ?
3.11 mysql ?
3.11 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.11 spark_connect ?
3.12 72.73% <52.52%> (-0.07%) ⬇️
3.12 athena ?
3.12 aws_deps ?
3.12 big ?
3.12 mssql ?
3.12 mysql ?
3.12 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.13 72.73% <52.52%> (-0.07%) ⬇️
3.13 athena 41.48% <38.38%> (-0.01%) ⬇️
3.13 aws_deps 45.85% <38.38%> (-0.02%) ⬇️
3.13 big 55.22% <44.44%> (-0.03%) ⬇️
3.13 bigquery 50.65% <77.77%> (+0.06%) ⬆️
3.13 clickhouse 41.49% <38.38%> (-0.01%) ⬇️
3.13 databricks 52.38% <67.67%> (+0.03%) ⬆️
3.13 filesystem 63.98% <52.52%> (-0.03%) ⬇️
3.13 gx-redshift 50.76% <67.67%> (+0.04%) ⬆️
3.13 mssql 50.87% <67.67%> (+0.04%) ⬆️
3.13 mysql 51.28% <67.67%> (+0.16%) ⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds 59.22% <67.67%> (+0.02%) ⬆️
3.13 postgresql 54.68% <67.67%> (+0.03%) ⬆️
3.13 snowflake 53.21% <67.67%> (+0.03%) ⬆️
3.13 spark 55.32% <54.54%> (-0.01%) ⬇️
3.13 spark_connect 46.36% <38.38%> (-0.03%) ⬇️
3.13 trino 48.18% <38.38%> (-0.03%) ⬇️
cloud 0.00% <0.00%> (ø)
docs-basic 58.62% <38.38%> (-0.06%) ⬇️
docs-creds-needed 57.42% <38.38%> (-0.05%) ⬇️
docs-spark 56.72% <38.38%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

NathanFarmer and others added 25 commits January 23, 2026 13:26
- Add ValidationDependencies to TYPE_CHECKING for type annotations
- Import inside method for runtime use
- Avoids circular import while keeping unquoted type annotation
…rcion

- Always fetch column.distinct_values for type coercion and observed_value
- Coerce value_set to match column value types before comparison
- Return all distinct values in observed_value when successful (backward compatible)
- Return value_counts in details when result_format is COMPLETE
- Handle type coercion for date/string mismatches
…unt on failure

- Filter out null/NaN values from observed_value_set before comparison
- Only include unexpected_count in result when there are violations
- Fixes test failures for null handling and result format
- Check type before converting to int to satisfy type checker
- Handle case where partial_unexpected_count might not be int or str
- Add helper method to coerce string value_set to DATE objects for BigQuery DATE columns
- BigQuery doesn't support DATE NOT IN UNNEST(ARRAY<STRING>)
- Convert string date values to date objects before SQL query
- Use date.fromisoformat() instead of datetime.strptime() to avoid DTZ007 warning
- Fix line length issues
- Fixes test failure for test_dates_with_str_value_set on BigQuery
- Add explicit List[Any] type annotation to coerced_value_set to fix type checker error
- Add _coerce_value_set_for_bigquery_date method to ColumnDistinctValuesNotInSet class
- Fix kwargs passing in _sqlalchemy method to include _metrics
- Fixes type checker errors for BigQuery DATE coercion
- Update _sqlalchemy method signature to include metrics and runtime_configuration parameters
- Extract dialect from execution_engine and pass to helper method along with metrics
- Fixes type checker error about missing _coerce_value_set_for_bigquery_date method
- Use hasattr(dialect, 'BigQueryDialect') pattern from util.py
- Previous code incorrectly called .get() on a type object
- Fixes AttributeError: 'PGDialect_psycopg2' has no attribute 'get'
- Test ColumnDistinctValuesNotInSetCount for all data sources
- Test ColumnDistinctValuesNotInSet for all data sources
- Tests cover: all values in set, some values not in set, no values in set
- Tests verify limit parameter works correctly
- Add test_dates_all_in_set for date objects in value_set
- Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion)
- Tests both ColumnDistinctValuesNotInSet and ColumnDistinctValuesNotInSetCount
- Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature.
The test_dates_all_in_set tests with proper date objects are sufficient for metrics.
The .count suffix should work automatically without explicit definition.
Removed the explicit count metric class and updated the expectation to
compute violation count directly from Python-side set comparison.
The wrapper class allows users to reference the .count metric programmatically,
even though the underlying metric computation is handled automatically.
The .count suffix does not work automatically - metrics must be explicitly
registered. Restored the metric provider class to register the metric.
observed_value now contains only violations (values not in set), not all
distinct values. This is a breaking change that avoids fetching all distinct
values into memory for high-cardinality columns.
…_be_in_set

- observed_value now contains only violations, not all distinct values
- No more details.value_counts in results
- String values are no longer auto-coerced to match date columns
…pected_list

- observed_value is now None (semantically correct - it's not observed values)
- Violations go in partial_unexpected_list (limited by partial_unexpected_count)
- unexpected_count always included when not BOOLEAN_ONLY
- Renderer returns '--' for observed value since it's None
@NathanFarmer NathanFarmer self-assigned this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants