Skip to content

Commit 1f95046

Browse files
author
Gerit Wagner
committed
Merge branch 'main' into ext_full_tests
2 parents ce74d14 + 75f5993 commit 1f95046

File tree

2 files changed

+9
-9
lines changed

2 files changed

+9
-9
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ Unlike traditional deduplication methods, BibDedupe focuses on entity resolution
1919

2020
## Features
2121

22-
- **Automated Duplicate Linking with Zero False Positives**: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.
23-
- **Preprocessing Approach**: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
24-
- **Entity Resolution**: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
25-
- **Programmatic Access**: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
26-
- **Transparent and Reproducible Rules**: BibDedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
27-
- **Continuous Benchmarking**: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
28-
- **Efficient and Parallel Computation**: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.
22+
- **Automated duplicate linking with zero false positives**: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.
23+
- **Preprocessing approach**: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
24+
- **Entity resolution**: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
25+
- **Programmatic access**: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
26+
- **Transparent and reproducible rules**: BibDedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
27+
- **Continuous benchmarking**: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
28+
- **Efficient and parallel computation**: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.
2929

3030
## Documentation
3131

bib_dedupe/match_conditions.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,8 +89,8 @@ def both_entrytypes(entrytype: str) -> str:
8989
f"({au09_ti09_ctXX} & {match(VOLUME, PAGES)})",
9090
f"({au09_ti09_ctXX} & {match(PAGES, YEAR)} & {non_contradicting(VOLUME, NUMBER, DOI)})",
9191

92-
# DOI-exact is very strong; don't require container_title (often missing for misc/preprints)
93-
f"(({match(DOI)} & ~(doi_1 == '' | doi_2 == '')) & ({TITLE} > 0.95) & ({AUTHOR} > 0.9) & ({YEAR} > 0.9))",
92+
# DOI-exact match; when container-titles are non-contradicting (may be missing)
93+
f"(({match(DOI)} & ~(doi_1 == '' | doi_2 == '')) & ({TITLE} > 0.95) & ({AUTHOR} > 0.9) & ({YEAR} > 0.9)) & {non_contradicting(CONTAINER_TITLE)} ",
9494

9595
# no TITLE
9696
f"({au10_tiXX_ct10} & {match(VOLUME, NUMBER, PAGES, YEAR)} & {non_contradicting(DOI)} & ({ABSTRACT} > 0.95 | {non_contradicting(ABSTRACT)}))", # typically for number-mismatches in title

0 commit comments

Comments
 (0)