Skip to content

Commit 68fa946

Browse files
author
Gerit Wagner
committed
drop records with empty titles in block (not prep)
Rationale: - The `prep()` method is not expected to remove records - The changes prevent errors in the following scenario: // When users replace records_df with the prepared records records_df = prep(records_df) actual_blocked_df = block(records_df) matched_df = match(actual_blocked_df) duplicate_id_sets = cluster(matched_df) // The records_df would be missing records (without titles), // effectively producing false positives (FPs): merged_df = merge(records_df, duplicate_id_sets=duplicate_id_sets) // This error may easily be unnoticed. When records are removed in the `block()` method, this error could be prevented because actual_blocked_df has a different structure and mis-assignments would raise errors. The resulting merged_df would be formatted (prepared) but no records would be missing.
1 parent e6b08b8 commit 68fa946

File tree

2 files changed

+8
-5
lines changed

2 files changed

+8
-5
lines changed

bib_dedupe/block.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
from bib_dedupe.constants.fields import TITLE_SHORT
1919
from bib_dedupe.constants.fields import VOLUME
2020
from bib_dedupe.constants.fields import YEAR
21+
from bib_dedupe.constants.fields import TITLE
2122

2223
block_fields_list = [
2324
{AUTHOR_FIRST, YEAR},
@@ -241,6 +242,13 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
241242
)
242243
start_time = time.time()
243244

245+
if records_df[TITLE].isnull().any():
246+
verbose_print.print(
247+
"Warning: Some records have empty title field. These records will not be considered."
248+
)
249+
records_df = records_df.dropna(subset=[TITLE])
250+
251+
244252
pairs_df = pd.DataFrame(columns=["ID_1", "ID_2", "require_title_overlap"])
245253
pairs_df = pairs_df.astype(
246254
{"ID_1": str, "ID_2": str, "require_title_overlap": bool}

bib_dedupe/prep.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -138,11 +138,6 @@ def __general_prep(records_df: pd.DataFrame) -> pd.DataFrame:
138138
records_df[column] = records_df[column].replace(
139139
["#NAME?", "UNKNOWN", ""], np.nan
140140
)
141-
if records_df[TITLE].isnull().any():
142-
verbose_print.print(
143-
"Warning: Some records have empty title field. These records will not be considered."
144-
)
145-
records_df = records_df.dropna(subset=[TITLE])
146141

147142
# if columns are of type float, we need to avoid casting "3.0" to "30"
148143
for col in records_df.columns:

0 commit comments

Comments
 (0)