Skip to content

Fix: WHERE clause with zero matching rows incorrectly fails for hasMi…#677

Merged
sohama4 merged 2 commits intomasterfrom
sa_bug_fix
Mar 30, 2026
Merged

Fix: WHERE clause with zero matching rows incorrectly fails for hasMi…#677
sohama4 merged 2 commits intomasterfrom
sa_bug_fix

Conversation

@sohama4
Copy link
Copy Markdown
Contributor

@sohama4 sohama4 commented Mar 20, 2026

…n/hasMax/isComplete/satisfies

When a WHERE clause filters out all rows, several analyzers produce metrics that cause both aggregate check failure and incorrect row-level results (false instead of true/null for filtered rows).

Root cause per analyzer:

  • Minimum/Maximum: min(null)/max(null) returns null, fromAggregationResult returns None, metric loses fullColumn, row-level falls back to lit(false)
  • Completeness: sum returns 0, count returns 0, state is (0,0), metric value is NaN (0/0), assertion fails
  • Compliance: sum(criterion) returns null, fromAggregationResult returns None, same as Minimum/Maximum

Fix (per analyzer):

  • Minimum/Maximum: override computeMetricFrom to preserve fullColumn (criterion) when state is None and WHERE clause is present
  • Completeness: override computeMetricFrom to detect count=0 with WHERE clause and return Failure metric with fullColumn (rowLevelResults)
  • Compliance: override computeMetricFrom to preserve fullColumn (rowLevelResults) when state is None and WHERE clause is present
  • AnalysisBasedConstraint: treat EmptyStateException as Success when analyzer has a filter condition -- there are no matching rows to violate the constraint

Note: 21 filterable analyzers exist but only the 4 used by ColumnValues rules are fixed here. A generic fix would require each analyzer to expose its row-level column and empty-state detection, which is a larger refactor.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…n/hasMax/isComplete/satisfies

When a WHERE clause filters out all rows, several analyzers produce metrics
that cause both aggregate check failure and incorrect row-level results (false
instead of true/null for filtered rows).

Root cause per analyzer:
- Minimum/Maximum: min(null)/max(null) returns null, fromAggregationResult
  returns None, metric loses fullColumn, row-level falls back to lit(false)
- Completeness: sum returns 0, count returns 0, state is (0,0), metric value
  is NaN (0/0), assertion fails
- Compliance: sum(criterion) returns null, fromAggregationResult returns None,
  same as Minimum/Maximum

Fix (per analyzer):
- Minimum/Maximum: override computeMetricFrom to preserve fullColumn (criterion)
  when state is None and WHERE clause is present
- Completeness: override computeMetricFrom to detect count=0 with WHERE clause
  and return Failure metric with fullColumn (rowLevelResults)
- Compliance: override computeMetricFrom to preserve fullColumn (rowLevelResults)
  when state is None and WHERE clause is present
- AnalysisBasedConstraint: treat EmptyStateException as Success when analyzer
  has a filter condition -- there are no matching rows to violate the constraint

Note: 21 filterable analyzers exist but only the 4 used by ColumnValues rules
are fixed here. A generic fix would require each analyzer to expose its
row-level column and empty-state detection, which is a larger refactor.
// When WHERE clause filters all rows, max(null) returns null, so fromAggregationResult
// returns None. We still need the fullColumn (criterion) for correct row-level results.
// Note: criterion is the raw augmented column ["FilteredData", null]. The FilteredRowOutcome
// treatment (TRUE vs NULL) is applied downstream by the assertion UDF in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not completely relevant to this logic but the UDF will have to be replaced. FGAC does not support UDF and this will cause issues in Glue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I barely followed this, but i take it that you're talking about some upstream UDF that exists irrespective of this, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code comment it mentions treatment (TRUE vs NULL) is applied downstream by the assertion UDF - we would have to replace this for Glue. UDFs don't work when FGAC is enabled. So doesn't have to happen in this PR but we need to remove it if there's any such UDF.

override def computeMetricFrom(state: Option[MaxState]): DoubleMetric = {
state match {
case None if where.isDefined =>
metricFromEmpty(this, "Maximum", column).copy(fullColumn = Some(criterion))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create helper for this? We have done this multiple times.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new commit.

state match {
case None if where.isDefined =>
metricFromEmpty(this, "Minimum", column).copy(fullColumn = Some(criterion))
case _ => super.computeMetricFrom(state)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to highlight something - maybe we should document in the PR or a comment in the code :
Any code that relies on successMetricsAsDataFrame or successMetricsAsJson to populate metric dashboards, history tables, or anomaly detection will start seeing missing data points for these analyzers in the "all rows filtered" case.

This is probably okay! Don't think there's meaningful value to report when zero rows match but I think worth documenting as a known behavior change, especially for consumers that distinguish between "metric not computed" and "metric = 0" or "metric = NaN".

Also in DQETL we will not report Minimum and Maximum statistics for a column when all rows are filtered - which is probably expected? But documentation will help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code comment - idk how we typically add documentation, let me know if you'd prefer smth else.

Copy link
Copy Markdown
Contributor

@shriyavanvari shriyavanvari Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes code comment for here is fine. Also a note in PR description would help so consumers can track down what changed when they see different data

…ness behavior change

- Add metricFromEmptyWithColumn() to Analyzers object to deduplicate the
  metricFromEmpty(...).copy(fullColumn = Some(...)) pattern across
  Maximum, Minimum, Completeness, and Compliance
- Document that Completeness now returns Failure instead of Success(NaN)
  when WHERE clause filters all rows, consistent with the other analyzers
  and avoiding NaN in success metrics consumers
@sohama4 sohama4 merged commit ae670ce into master Mar 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants