Skip to content

Add support for RowCountMatch rule#652

Merged
joshuazexter merged 1 commit intoawslabs:masterfrom
joshuazexter:master
Jan 21, 2026
Merged

Add support for RowCountMatch rule#652
joshuazexter merged 1 commit intoawslabs:masterfrom
joshuazexter:master

Conversation

@joshuazexter
Copy link
Copy Markdown
Contributor

Description of changes:
Added RowCountMatch

  • Comparison utility
  • DQDL Rule / Executor / Translator
  • Unit Tests

From

new
RowCountMatch compares row counts between primary and reference datasets as a ratio!

Rules=[RowCountMatch "ref" >= 0.9]

Dataset.ref.RowCountMatch -> 0.857

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Copy Markdown
Contributor

@SamPom100 SamPom100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some comments

assertion: Double => Boolean): ComparisonResult = {
val primaryCount = primary.count()
val referenceCount = reference.count()
val ratio = primaryCount.toDouble / referenceCount.toDouble
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello divide by zero 😮

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches AWS Glue Data Quality's RowCountMatch behavior, including the division-by-zero edge case when the reference dataset is empty. the assertion would receive Infinity or NaN

assertion: Double => Boolean): ComparisonResult = {
val primaryCount = primary.count()
val referenceCount = reference.count()
val ratio = primaryCount.toDouble / referenceCount.toDouble
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should round this to a couple decimal places. can you check other analyzers for precedent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked other analyzers/comparisons and none round their ratios (DataSynchronization, ReferentialIntegrity, DatasetMatchAnalyzer all return raw doubles).

imo, rounding would also lose precision for assertions. Keeping consistent with existing behavior.

import com.amazon.deequ.SparkContextSpec
import org.scalatest.wordspec.AnyWordSpec

class RowCountMatchTest extends AnyWordSpec with SparkContextSpec {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add unit tests for

  • divide by zero
  • both datasets have 0 rows

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup added

rules.map { rule =>
val outcome = additionalDataSources.get(rule.referenceDatasetAlias) match {
case Some(referenceDF) =>
val result = RowCountMatch.matchRowCounts(df, referenceDF, rule.assertion)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need val result here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed good point!

@joshuazexter joshuazexter merged commit fac5c11 into awslabs:master Jan 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants