Skip to content

Commit e33c136

Browse files
author
Gerit Wagner
committed
draft
1 parent b7b5cf3 commit e33c136

File tree

6 files changed

+148
-9
lines changed

6 files changed

+148
-9
lines changed

CONTRIBUTING.md

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,47 @@ You can contribute in many ways:
77

88
## Types of Contributions
99

10+
### Report duplicate error (FP or FN)
11+
12+
Provide the case in the following format, allowing us to add it to the `tests/test_cases.json`:
13+
14+
```json
15+
{
16+
"id": "abrahao_parigi_gupta_cook_2017_pnas_short_vs_full",
17+
"note": "Same paper; record_b uses abbreviated author formatting and omits venue fields; record_a includes DOI.",
18+
19+
"record_a": {
20+
"ENTRYTYPE": "article",
21+
"ID": "1",
22+
"doi": "10.1073/PNAS.1604234114",
23+
"author": "Abrahao, Bruno and Parigi, Paolo and Gupta, Alok and Cook, Karen S.",
24+
"title": "Reputation offsets trust judgments based on social biases among Airbnb users",
25+
"journal": "Proceedings of the National Academy of Sciences",
26+
"number": "37",
27+
"pages": "9848--9853",
28+
"volume": "114",
29+
"year": "2017"
30+
},
31+
"record_b": {
32+
"ENTRYTYPE": "article",
33+
"ID": "2",
34+
"author": "B. Abrahao; P. Parigi; A. Gupta; K. S. Cook",
35+
"year": "2017",
36+
"title": "Reputation offsets trust judgments based on social biases among Airbnb users"
37+
},
38+
39+
"expected_duplicate": true
40+
}
41+
```
42+
43+
### Fixing duplicate errors
44+
45+
All changes to deduplication logic (`prep`, `sim`, `match`) should be accompanied with a test case in the pull request.
46+
47+
TODO:
48+
- before merging, the ldd-full tests should be run to determine how the changes affect overall performance. (TBD: locally? how will it be triggered? how do we ensure that the right version/branch is tested? How are results added in the pull request? Do we want to consider performance implications?)
49+
- consider possiblity of schema inconsistency
50+
1051
### Report Bugs
1152

1253
Report bugs at https://github.com/CoLRev-Environment/bib-dedupe/issues.
@@ -51,13 +92,13 @@ Ready to contribute? Here's how to set up BibDedupe for local development.
5192
1. Fork the `bib-dedupe` repo on GitHub.
5293
2. Clone your fork locally:
5394

54-
```
95+
```sh
5596
git clone git@github.com:your_name_here/bib-dedupe.git
5697
```
5798

5899
3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
59100

60-
```
101+
```sh
61102
mkvirtualenv bib-dedupe
62103
cd bib-dedupe/
63104
pip3 install poetry
@@ -66,7 +107,7 @@ Ready to contribute? Here's how to set up BibDedupe for local development.
66107

67108
4. Create a branch for local development:
68109

69-
```
110+
```sh
70111
git checkout -b name-of-your-bugfix-or-feature
71112
```
72113

@@ -75,14 +116,14 @@ Ready to contribute? Here's how to set up BibDedupe for local development.
75116
5. When you're done making changes, check that your changes pass the
76117
tests and pre-commit hooks:
77118
78-
```
119+
```sh
79120
pytest
80121
pre-commit run -a
81122
```
82123
83124
6. Commit your changes and push your branch to GitHub:
84125
85-
```
126+
```sh
86127
git add .
87128
git commit -m "Your detailed description of your changes."
88129
git push origin name-of-your-bugfix-or-feature
@@ -98,9 +139,8 @@ Before you submit a pull request, check that it meets these guidelines:
98139
2. If the pull request adds functionality, the docs should be updated. Put
99140
your new functionality into a function with a docstring, and add the
100141
feature to the list in README.rst.
101-
3. The pull request should work for Python 3.5, 3.6, 3.7 and 3.8, and for PyPy. Check
102-
https://travis-ci.com/CoLRev-Ecosystem/bib-dedupe/pull_requests
103-
and make sure that the tests pass for all supported Python versions.
142+
3. The pull request should work for the Python versions specified in the `pyproject.toml`.
143+
Make sure that the tests pass for all supported Python versions.
104144
105145
## Coding standards
106146

docs/architecture.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
Architecture
2+
====================================
3+
4+
.. mermaid::
5+
6+
flowchart LR
7+
8+
A0["Input: records_df"]
9+
--> A1["prep(records_df)<br>• prep_schema<br>• prep_author<br>• prep_title<br>• prep_abstract<br>• prep_container_title<br>• prep_doi<br>• prep_volume<br>• prep_number<br>• prep_pages<br>• prep_year"]
10+
11+
O2["(manual review outside code)"]
12+
13+
subgraph API["Public API (call sequence)"]
14+
A1 --> A2["block(prep_df)<br>Uses block.py to create candidate record pairs"]
15+
16+
A2 --> A3["match(pairs_df)<br>Uses match.py to compute similarities and classify pairs"]
17+
18+
A3 --> A4["cluster(matched_df)<br>Uses cluster.py to build connected components (duplicate groups)"]
19+
20+
A4 --> A5["merge(records_df, duplicate_id_sets)<br>Uses merge.py to combine records within each group"]
21+
22+
subgraph Manual["Optional manual review (human-in-the-loop)"]
23+
O1["export_maybe(records_df, matched_df)<br>Write uncertain pairs via maybe_cases.export"]
24+
O3["import_maybe(matched_df)<br>Read decisions via maybe_cases.import"]
25+
end
26+
end
27+
28+
O1 --> O2 --> O3
29+
30+
A5 --> A6["Output: merged records_df"]
31+
32+
%% optional linkage back into the main flow
33+
A3 -. "optional manual review" .-> O1
34+
O3 -. "returns updated matched_df" .-> A4
35+
36+
37+
Runtime of individual steps
38+
39+
cd docs
40+
python benchmark_runtime_detailed.py

docs/benchmark.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Benchmark
2+
====================================
3+
4+
TODO

docs/benchmark_runtime_detailed.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
import time
2+
import statistics as stats
3+
import pandas as pd
4+
5+
import bib_dedupe.bib_dedupe as bd
6+
from pathlib import Path
7+
8+
BENCHMARK_DIR = Path("../tests/ldd-full-benchmark")
9+
10+
def timed(label, fn, *args, **kwargs):
11+
t0 = time.perf_counter()
12+
out = fn(*args, **kwargs)
13+
dt = time.perf_counter() - t0
14+
return out, dt
15+
16+
17+
def benchmark_pipeline(records_df, *, cpu=-1, repeats=5, warmup=1):
18+
# warmup (important for caches, process pools, etc.)
19+
for _ in range(warmup):
20+
prepped = bd.prep(records_df, verbosity_level=0, cpu=cpu)
21+
pairs = bd.block(prepped, verbosity_level=0, cpu=cpu)
22+
_ = bd.match(pairs, verbosity_level=0, cpu=cpu)
23+
24+
prep_times, block_times, match_times = [], [], []
25+
for _ in range(repeats):
26+
prepped, t_prep = timed("prep", bd.prep, records_df, verbosity_level=0, cpu=cpu)
27+
pairs, t_block = timed("block", bd.block, prepped, verbosity_level=0, cpu=cpu)
28+
matched, t_match = timed("match", bd.match, pairs, verbosity_level=0, cpu=cpu)
29+
30+
prep_times.append(t_prep)
31+
block_times.append(t_block)
32+
match_times.append(t_match)
33+
34+
def summ(xs):
35+
return {
36+
"n": len(xs),
37+
"mean_s": stats.mean(xs),
38+
"median_s": stats.median(xs),
39+
"min_s": min(xs),
40+
"max_s": max(xs),
41+
}
42+
43+
return {
44+
"prep": summ(prep_times),
45+
"block": summ(block_times),
46+
"match_total": summ(match_times),
47+
}
48+
49+
50+
dataset = "cardiac"
51+
df = pd.read_csv(BENCHMARK_DIR / dataset / "records_pre_merged.csv")
52+
53+
print(benchmark_pipeline(df, cpu=-1, repeats=10, warmup=2))

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
templates_path = ["_templates"]
2424
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
2525

26-
extensions = ["sphinx.ext.autodoc", "sphinx_copybutton"]
26+
extensions = ["sphinx.ext.autodoc", "sphinx_copybutton", "sphinxcontrib.mermaid"]
2727

2828
# -- Options for HTML output -------------------------------------------------
2929
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,3 +95,5 @@ For advanced use cases, it is also possible to complete and customize each step
9595
installation
9696
usage
9797
api
98+
architecture
99+
benchmark

0 commit comments

Comments
 (0)