Skip to content

Commit 613bd12

Browse files
author
Gerit Wagner
committed
Merge branch 'main' into cli
2 parents 9e43cbb + 97d71d4 commit 613bd12

File tree

119 files changed

+879
-244582
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+879
-244582
lines changed

.github/workflows/evaluate.yml

Lines changed: 0 additions & 86 deletions
This file was deleted.

.github/workflows/tests.yml

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,57 @@
11
name: Run Tests
22

33
on:
4-
- push
5-
- pull_request
4+
push:
5+
pull_request:
66

77
jobs:
8-
test-full-deps:
8+
test:
9+
name: Quick tests (${{ matrix.platform }}, py${{ matrix.python-version }})
910
strategy:
1011
matrix:
1112
platform: [ubuntu-latest, windows-latest]
1213
python-version: ['3.10', '3.11', '3.12', '3.13']
1314
runs-on: ${{ matrix.platform }}
1415
steps:
1516
- uses: actions/checkout@v4
17+
with:
18+
submodules: recursive
19+
fetch-depth: 0
20+
21+
- name: Set up Python ${{ matrix.python-version }}
22+
uses: actions/setup-python@v4
23+
with:
24+
python-version: ${{ matrix.python-version }}
25+
26+
- name: Install uv and dependencies
27+
run: |
28+
pip install uv
29+
uv venv
30+
uv pip install -e .[dev] || echo "No dev extra"
31+
echo "Dependencies installed successfully"
32+
33+
- name: Setup git
34+
run: |
35+
git config --global user.name "CoLRev update"
36+
git config --global user.email "actions@users.noreply.github.com"
37+
git config --global url.https://github.com/.insteadOf git://github.com/
38+
39+
- name: Run tests (excluding full_test.py)
40+
run: uv run pytest --ignore=tests/full_test.py
41+
42+
full-test:
43+
name: Full test (${{ matrix.platform }}, py${{ matrix.python-version }})
44+
needs: test
45+
strategy:
46+
matrix:
47+
platform: [ubuntu-latest, windows-latest]
48+
python-version: ['3.10', '3.11', '3.12', '3.13']
49+
runs-on: ${{ matrix.platform }}
50+
steps:
51+
- uses: actions/checkout@v4
52+
with:
53+
submodules: recursive
54+
fetch-depth: 0
1655

1756
- name: Set up Python ${{ matrix.python-version }}
1857
uses: actions/setup-python@v4
@@ -32,5 +71,5 @@ jobs:
3271
git config --global user.email "actions@users.noreply.github.com"
3372
git config --global url.https://github.com/.insteadOf git://github.com/
3473
35-
- name: Run tests
36-
run: uv run pytest
74+
- name: Run full_test.py
75+
run: uv run pytest -q tests/full_test.py

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "tests/ldd-full-benchmark"]
2+
path = tests/ldd-full-benchmark
3+
url = git@github.com:CoLRev-Environment/ldd-full-benchmark.git

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0).
1717
### Fixed
1818
-->
1919

20+
## 0.11.0 - 2025-11-21
21+
22+
- Extract evaluation to separate repository (to be published soon)
23+
- Blocking: cleanup to ensure consistent use of ID_1 and ID_2
24+
- Refactoring to prevent FutureWarnings by pandas
25+
- Extend match conditions for records with missing fields (e.g., based on GROBID extraction)
26+
- Drop records with empty titles in block instead of prep to prevent subtle errors
27+
- Prevents same-source merges in connected components
28+
2029
## 0.10.0 - 2024-11-05
2130

2231
- Fix and silence pandas Future warnings

README.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@
88
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
99
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Ftests.yml?label=tests)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/tests.yml)
1010
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Fdocs.yml?label=docs)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/docs.yml)
11-
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Fevaluate.yml?label=continuous%20evaluation)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/evaluate.yml)
11+
<!--
12+
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/literature-deduplication-benchmarks/actions/workflows/ldd-full.yml?label=continuous%20evaluation)](https://github.com/CoLRev-Environment/literature-deduplication-benchmarks/actions/workflows/ldd-full.yml)
13+
-->
1214

1315
</div>
1416

@@ -19,13 +21,15 @@ Unlike traditional deduplication methods, BibDedupe focuses on entity resolution
1921

2022
## Features
2123

22-
- **Automated Duplicate Linking with Zero False Positives**: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.
23-
- **Preprocessing Approach**: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
24-
- **Entity Resolution**: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
25-
- **Programmatic Access**: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
26-
- **Transparent and Reproducible Rules**: BibDedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
27-
- **Continuous Benchmarking**: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
28-
- **Efficient and Parallel Computation**: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.
24+
- **Automated duplicate linking with zero false positives**: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.
25+
- **Preprocessing approach**: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
26+
- **Entity resolution**: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
27+
- **Programmatic access**: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
28+
- **Transparent and reproducible rules**: BibDedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
29+
- **Continuous benchmarking**: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
30+
- **Efficient and parallel computation**: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.
31+
32+
Regular benchmarks are available [here](https://github.com/CoLRev-Environment/literature-deduplication-benchmarks).
2933

3034
## Documentation
3135

bib_dedupe/bib_dedupe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ def merge(
166166
if verbosity_level is not None:
167167
verbose_print.verbosity_level = verbosity_level
168168

169-
if matched_df:
169+
if matched_df is not None:
170170
duplicate_id_sets = bib_dedupe.cluster.get_connected_components(matched_df)
171171

172172
if not duplicate_id_sets:

bib_dedupe/block.py

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
from bib_dedupe.constants.fields import NUMBER
1616
from bib_dedupe.constants.fields import PAGES
1717
from bib_dedupe.constants.fields import SEARCH_SET
18+
from bib_dedupe.constants.fields import TITLE
1819
from bib_dedupe.constants.fields import TITLE_SHORT
1920
from bib_dedupe.constants.fields import VOLUME
2021
from bib_dedupe.constants.fields import YEAR
@@ -85,7 +86,9 @@ def create_pairs_for_block_fields(
8586

8687
pairs = (
8788
non_empty_df.groupby("block_hash", group_keys=True)["ID"]
88-
.apply(lambda x: pd.DataFrame(list(combinations(x, 2)), columns=["ID1", "ID2"]))
89+
.apply(
90+
lambda x: pd.DataFrame(list(combinations(x, 2)), columns=["ID_1", "ID_2"])
91+
)
8992
.reset_index(drop=True)
9093
)
9194
pairs["block_rule"] = "-".join(block_fields)
@@ -208,6 +211,23 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
208211
209212
Returns:
210213
pd.DataFrame: The dataframe after blocking operation.
214+
215+
216+
Output table structure (columns, in order):
217+
block_rule,
218+
ID_1, ENTRYTYPE_1, author_1, year_1, title_1, volume_1, number_1,
219+
pages_1, abstract_1, doi_1, series_1, search_set_1, container_title_1,
220+
author_full_1, author_first_1, title_short_1, container_title_short_1,
221+
ID_2, ENTRYTYPE_2, author_2, year_2, title_2, volume_2, number_2,
222+
pages_2, abstract_2, doi_2, series_2, search_set_2, container_title_2,
223+
author_full_2, author_first_2, title_short_2, container_title_short_2
224+
225+
Column meanings:
226+
- ID_1 / ID_2: The two record identifiers that form the pair.
227+
- block_rule: Name/description of the blocking rule that surfaced this pair;
228+
use an empty string if not applicable.
229+
- *_1 / *_2: Field values of the left/right record in the pair, respectively.
230+
These mirror the original record schema (e.g., author, year, title, etc
211231
"""
212232
INSTRUCTION = "(please run prep(records_df) and pass the prepared df)"
213233
assert (
@@ -222,8 +242,16 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
222242
)
223243
start_time = time.time()
224244

225-
pairs_df = pd.DataFrame(columns=["ID1", "ID2", "require_title_overlap"])
226-
pairs_df = pairs_df.astype({"ID1": str, "ID2": str, "require_title_overlap": bool})
245+
if records_df[TITLE].isnull().any():
246+
verbose_print.print(
247+
"Warning: Some records have empty title field. These records will not be considered."
248+
)
249+
records_df = records_df.dropna(subset=[TITLE])
250+
251+
pairs_df = pd.DataFrame(columns=["ID_1", "ID_2", "require_title_overlap"])
252+
pairs_df = pairs_df.astype(
253+
{"ID_1": str, "ID_2": str, "require_title_overlap": bool}
254+
)
227255
if cpu == 1:
228256
for field in block_fields_list:
229257
pairs_df = pd.concat(
@@ -242,15 +270,15 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
242270
pairs_df = pd.concat(results, ignore_index=True)
243271

244272
# title overlap is only required when there is no blocked pair that requires it.
245-
pairs_df["require_title_overlap"] = pairs_df.groupby(["ID1", "ID2"])[
273+
pairs_df["require_title_overlap"] = pairs_df.groupby(["ID_1", "ID_2"])[
246274
"require_title_overlap"
247275
].transform("all")
248-
pairs_df = pairs_df.drop_duplicates(subset=["ID1", "ID2"])
276+
pairs_df = pairs_df.drop_duplicates(subset=["ID_1", "ID_2"])
249277

250278
pairs_df = pd.merge(
251279
pairs_df,
252280
records_df.add_suffix("_1"),
253-
left_on="ID1",
281+
left_on="ID_1",
254282
right_on="ID_1",
255283
how="left",
256284
suffixes=("", "_1"),
@@ -259,7 +287,7 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
259287
pairs_df = pd.merge(
260288
pairs_df,
261289
records_df.add_suffix("_2"),
262-
left_on="ID2",
290+
left_on="ID_2",
263291
right_on="ID_2",
264292
how="left",
265293
suffixes=("", "_2"),
@@ -273,4 +301,7 @@ def block(records_df: pd.DataFrame, cpu: int = -1) -> pd.DataFrame:
273301
verbose_print.print(f"Blocked pairs reduced to {pairs_df.shape[0]:,} pairs")
274302
end_time = time.time()
275303
verbose_print.print(f"Block completed after: {end_time - start_time:.2f} seconds")
304+
305+
pairs_df.drop(columns=["require_title_overlap"], inplace=True)
306+
276307
return pairs_df

0 commit comments

Comments
 (0)