Skip to content

Commit ae5d20e

Browse files
Copilotseanthegeek
andauthored
Fix duplicate detection for normalized aggregate reports in Elasticsearch/OpenSearch (#666)
Change date_begin/date_end queries from exact match to range queries (gte/lte) so that previously saved normalized time buckets are correctly detected as duplicates within the original report's date range. Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>
1 parent e98fdfa commit ae5d20e

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

parsedmarc/elastic.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -413,8 +413,8 @@ def save_aggregate_report_to_elasticsearch(
413413
org_name_query = Q(dict(match_phrase=dict(org_name=org_name))) # type: ignore
414414
report_id_query = Q(dict(match_phrase=dict(report_id=report_id))) # pyright: ignore[reportArgumentType]
415415
domain_query = Q(dict(match_phrase={"published_policy.domain": domain})) # pyright: ignore[reportArgumentType]
416-
begin_date_query = Q(dict(match=dict(date_begin=begin_date))) # pyright: ignore[reportArgumentType]
417-
end_date_query = Q(dict(match=dict(date_end=end_date))) # pyright: ignore[reportArgumentType]
416+
begin_date_query = Q(dict(range=dict(date_begin=dict(gte=begin_date)))) # pyright: ignore[reportArgumentType]
417+
end_date_query = Q(dict(range=dict(date_end=dict(lte=end_date)))) # pyright: ignore[reportArgumentType]
418418

419419
if index_suffix is not None:
420420
search_index = "dmarc_aggregate_{0}*".format(index_suffix)

parsedmarc/opensearch.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -413,8 +413,8 @@ def save_aggregate_report_to_opensearch(
413413
org_name_query = Q(dict(match_phrase=dict(org_name=org_name)))
414414
report_id_query = Q(dict(match_phrase=dict(report_id=report_id)))
415415
domain_query = Q(dict(match_phrase={"published_policy.domain": domain}))
416-
begin_date_query = Q(dict(match=dict(date_begin=begin_date)))
417-
end_date_query = Q(dict(match=dict(date_end=end_date)))
416+
begin_date_query = Q(dict(range=dict(date_begin=dict(gte=begin_date))))
417+
end_date_query = Q(dict(range=dict(date_end=dict(lte=end_date))))
418418

419419
if index_suffix is not None:
420420
search_index = "dmarc_aggregate_{0}*".format(index_suffix)

0 commit comments

Comments
 (0)