Add support for RowCountMatch rule#652
Conversation
| assertion: Double => Boolean): ComparisonResult = { | ||
| val primaryCount = primary.count() | ||
| val referenceCount = reference.count() | ||
| val ratio = primaryCount.toDouble / referenceCount.toDouble |
There was a problem hiding this comment.
This matches AWS Glue Data Quality's RowCountMatch behavior, including the division-by-zero edge case when the reference dataset is empty. the assertion would receive Infinity or NaN
| assertion: Double => Boolean): ComparisonResult = { | ||
| val primaryCount = primary.count() | ||
| val referenceCount = reference.count() | ||
| val ratio = primaryCount.toDouble / referenceCount.toDouble |
There was a problem hiding this comment.
you should round this to a couple decimal places. can you check other analyzers for precedent?
There was a problem hiding this comment.
Checked other analyzers/comparisons and none round their ratios (DataSynchronization, ReferentialIntegrity, DatasetMatchAnalyzer all return raw doubles).
imo, rounding would also lose precision for assertions. Keeping consistent with existing behavior.
| import com.amazon.deequ.SparkContextSpec | ||
| import org.scalatest.wordspec.AnyWordSpec | ||
|
|
||
| class RowCountMatchTest extends AnyWordSpec with SparkContextSpec { |
There was a problem hiding this comment.
can you add unit tests for
- divide by zero
- both datasets have 0 rows
| rules.map { rule => | ||
| val outcome = additionalDataSources.get(rule.referenceDatasetAlias) match { | ||
| case Some(referenceDF) => | ||
| val result = RowCountMatch.matchRowCounts(df, referenceDF, rule.assertion) |
There was a problem hiding this comment.
why do you need val result here
There was a problem hiding this comment.
Addressed good point!
Description of changes:
Added RowCountMatch
From
new
RowCountMatch compares row counts between primary and reference datasets as a ratio!
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.