Fix: WHERE clause with zero matching rows incorrectly fails for hasMi…#677
Fix: WHERE clause with zero matching rows incorrectly fails for hasMi…#677
Conversation
…n/hasMax/isComplete/satisfies When a WHERE clause filters out all rows, several analyzers produce metrics that cause both aggregate check failure and incorrect row-level results (false instead of true/null for filtered rows). Root cause per analyzer: - Minimum/Maximum: min(null)/max(null) returns null, fromAggregationResult returns None, metric loses fullColumn, row-level falls back to lit(false) - Completeness: sum returns 0, count returns 0, state is (0,0), metric value is NaN (0/0), assertion fails - Compliance: sum(criterion) returns null, fromAggregationResult returns None, same as Minimum/Maximum Fix (per analyzer): - Minimum/Maximum: override computeMetricFrom to preserve fullColumn (criterion) when state is None and WHERE clause is present - Completeness: override computeMetricFrom to detect count=0 with WHERE clause and return Failure metric with fullColumn (rowLevelResults) - Compliance: override computeMetricFrom to preserve fullColumn (rowLevelResults) when state is None and WHERE clause is present - AnalysisBasedConstraint: treat EmptyStateException as Success when analyzer has a filter condition -- there are no matching rows to violate the constraint Note: 21 filterable analyzers exist but only the 4 used by ColumnValues rules are fixed here. A generic fix would require each analyzer to expose its row-level column and empty-state detection, which is a larger refactor.
| // When WHERE clause filters all rows, max(null) returns null, so fromAggregationResult | ||
| // returns None. We still need the fullColumn (criterion) for correct row-level results. | ||
| // Note: criterion is the raw augmented column ["FilteredData", null]. The FilteredRowOutcome | ||
| // treatment (TRUE vs NULL) is applied downstream by the assertion UDF in |
There was a problem hiding this comment.
not completely relevant to this logic but the UDF will have to be replaced. FGAC does not support UDF and this will cause issues in Glue.
There was a problem hiding this comment.
I barely followed this, but i take it that you're talking about some upstream UDF that exists irrespective of this, right?
There was a problem hiding this comment.
In the code comment it mentions treatment (TRUE vs NULL) is applied downstream by the assertion UDF - we would have to replace this for Glue. UDFs don't work when FGAC is enabled. So doesn't have to happen in this PR but we need to remove it if there's any such UDF.
| override def computeMetricFrom(state: Option[MaxState]): DoubleMetric = { | ||
| state match { | ||
| case None if where.isDefined => | ||
| metricFromEmpty(this, "Maximum", column).copy(fullColumn = Some(criterion)) |
There was a problem hiding this comment.
can we create helper for this? We have done this multiple times.
There was a problem hiding this comment.
Fixed in new commit.
| state match { | ||
| case None if where.isDefined => | ||
| metricFromEmpty(this, "Minimum", column).copy(fullColumn = Some(criterion)) | ||
| case _ => super.computeMetricFrom(state) |
There was a problem hiding this comment.
Just wanted to highlight something - maybe we should document in the PR or a comment in the code :
Any code that relies on successMetricsAsDataFrame or successMetricsAsJson to populate metric dashboards, history tables, or anomaly detection will start seeing missing data points for these analyzers in the "all rows filtered" case.
This is probably okay! Don't think there's meaningful value to report when zero rows match but I think worth documenting as a known behavior change, especially for consumers that distinguish between "metric not computed" and "metric = 0" or "metric = NaN".
Also in DQETL we will not report Minimum and Maximum statistics for a column when all rows are filtered - which is probably expected? But documentation will help.
There was a problem hiding this comment.
Added a code comment - idk how we typically add documentation, let me know if you'd prefer smth else.
There was a problem hiding this comment.
Yes code comment for here is fine. Also a note in PR description would help so consumers can track down what changed when they see different data
…ness behavior change - Add metricFromEmptyWithColumn() to Analyzers object to deduplicate the metricFromEmpty(...).copy(fullColumn = Some(...)) pattern across Maximum, Minimum, Completeness, and Compliance - Document that Completeness now returns Failure instead of Success(NaN) when WHERE clause filters all rows, consistent with the other analyzers and avoiding NaN in success metrics consumers
…n/hasMax/isComplete/satisfies
When a WHERE clause filters out all rows, several analyzers produce metrics that cause both aggregate check failure and incorrect row-level results (false instead of true/null for filtered rows).
Root cause per analyzer:
Fix (per analyzer):
Note: 21 filterable analyzers exist but only the 4 used by ColumnValues rules are fixed here. A generic fix would require each analyzer to expose its row-level column and empty-state detection, which is a larger refactor.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.