Commit 68fa946
Gerit Wagner
drop records with empty titles in block (not prep)
Rationale:
- The `prep()` method is not expected to remove records
- The changes prevent errors in the following scenario:
// When users replace records_df with the prepared records
records_df = prep(records_df)
actual_blocked_df = block(records_df)
matched_df = match(actual_blocked_df)
duplicate_id_sets = cluster(matched_df)
// The records_df would be missing records (without titles),
// effectively producing false positives (FPs):
merged_df = merge(records_df, duplicate_id_sets=duplicate_id_sets)
// This error may easily be unnoticed.
When records are removed in the `block()` method, this error could
be prevented because actual_blocked_df has a different structure
and mis-assignments would raise errors. The resulting merged_df
would be formatted (prepared) but no records would be missing.1 parent e6b08b8 commit 68fa946
2 files changed
+8
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
241 | 242 | | |
242 | 243 | | |
243 | 244 | | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
244 | 252 | | |
245 | 253 | | |
246 | 254 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | 141 | | |
147 | 142 | | |
148 | 143 | | |
| |||
0 commit comments