-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[MAINTENANCE] Optimize ExpectColumnDistinctValuesToBeInSet with database-pushed comparison
#11614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…hed comparison - Add ColumnDistinctValuesNotInSetCount metric to count violations in DB - Add ColumnDistinctValuesNotInSet metric to fetch sample violations with LIMIT - Refactor expectation to use new metrics instead of fetching all distinct values - Add Metrics API wrapper classes for new metrics - Respects result_format and partial_unexpected_count settings
✅ Deploy Preview for niobium-lead-7998 canceled.
|
expect_column_distinct_values_to_be_in_set with database-pushed comparison
expect_column_distinct_values_to_be_in_set with database-pushed comparisonExpectColumnDistinctValuesToBeInSet with database-pushed comparison
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #11614 +/- ##
===========================================
- Coverage 84.14% 84.14% -0.01%
===========================================
Files 465 467 +2
Lines 39364 39463 +99
===========================================
+ Hits 33124 33207 +83
- Misses 6240 6256 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation
…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches
…unt on failure - Filter out null/NaN values from observed_value_set before comparison - Only include unexpected_count in result when there are violations - Fixes test failures for null handling and result format
- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str
- Add helper method to coerce string value_set to DATE objects for BigQuery DATE columns - BigQuery doesn't support DATE NOT IN UNNEST(ARRAY<STRING>) - Convert string date values to date objects before SQL query - Use date.fromisoformat() instead of datetime.strptime() to avoid DTZ007 warning - Fix line length issues - Fixes test failure for test_dates_with_str_value_set on BigQuery
- Add explicit List[Any] type annotation to coerced_value_set to fix type checker error - Add _coerce_value_set_for_bigquery_date method to ColumnDistinctValuesNotInSet class - Fix kwargs passing in _sqlalchemy method to include _metrics - Fixes type checker errors for BigQuery DATE coercion
- Update _sqlalchemy method signature to include metrics and runtime_configuration parameters - Extract dialect from execution_engine and pass to helper method along with metrics - Fixes type checker error about missing _coerce_value_set_for_bigquery_date method
- Use hasattr(dialect, 'BigQueryDialect') pattern from util.py - Previous code incorrectly called .get() on a type object - Fixes AttributeError: 'PGDialect_psycopg2' has no attribute 'get'
- Test ColumnDistinctValuesNotInSetCount for all data sources - Test ColumnDistinctValuesNotInSet for all data sources - Tests cover: all values in set, some values not in set, no values in set - Tests verify limit parameter works correctly
- Add test_dates_all_in_set for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Tests both ColumnDistinctValuesNotInSet and ColumnDistinctValuesNotInSetCount - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_in_set tests with proper date objects are sufficient for metrics.
The .count suffix should work automatically without explicit definition. Removed the explicit count metric class and updated the expectation to compute violation count directly from Python-side set comparison.
The wrapper class allows users to reference the .count metric programmatically, even though the underlying metric computation is handled automatically.
The .count suffix does not work automatically - metrics must be explicitly registered. Restored the metric provider class to register the metric.
observed_value now contains only violations (values not in set), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns.
…_be_in_set - observed_value now contains only violations, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns
…pected_list - observed_value is now None (semantically correct - it's not observed values) - Violations go in partial_unexpected_list (limited by partial_unexpected_count) - unexpected_count always included when not BOOLEAN_ONLY - Renderer returns '--' for observed value since it's None
Description
Optimizes
ExpectColumnDistinctValuesToBeInSetto push comparison logic to the database instead of fetching all distinct values into memory.Changes
ColumnDistinctValuesNotInSetCountmetric to count violations in DBColumnDistinctValuesNotInSetmetric to fetch sample violations with LIMITresult_formatsettings