Skip to content

Commit 3aed092

Browse files
authored
Merge pull request #4 from zbMATHOpen/feat/add_by_file
Feat/add by file
2 parents 5b82d36 + 8dab05f commit 3aed092

File tree

8 files changed

+86
-32
lines changed

8 files changed

+86
-32
lines changed

README.md

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Update package for the zbMATH Links API
22

3-
The purpose of this package is to populate and update the database used by another package produced at [zbMATH](https://zbmath.org/), namely the zbMATH Links API `zbmath-links-api`, available [here](https://github.com/zbMATHOpen/linksApi).
3+
The purpose of this package is to populate and update the database used by another package produced at [zbMATH](https://zbmath.org/), namely the zbMATH Links API `zbmath-links-api`, available [here](https://github.com/zbMATHOpen/linksApi).
44
The usage of the present package is mainly described in the README file of the `zbmath-links-api` package.
55

66
Here we provide some simple instructions to install and use this package.
@@ -14,8 +14,8 @@ On a first install:
1414
pip install -e .
1515
```
1616

17-
This will install the package, `update-zblinks-api`, in the [virtual environment](https://docs.python.org/3/tutorial/venv.html).
18-
17+
This will install the package, `update-zblinks-api`, in the [virtual environment](https://docs.python.org/3/tutorial/venv.html).
18+
1919

2020
2) Fill in the `config_template.ini` and save it as `config.ini`.
2121

@@ -27,47 +27,54 @@ On a first install:
2727
(iii) The API-KEY is the one used by the API package `zbmath-links-api`.
2828

2929

30-
3) The package has two entry points:
30+
3) The package has three entry points:
3131

3232
(i) To scrape (i.e., to obtain all links) all zbMATH partners and update the database used by the package `zbmath-links-api` use the command
3333

3434
```
3535
update-api
3636
```
37-
38-
This will automatically add new links, delete links that no longer exist, and edit links that have been modified.
39-
40-
**Remark 1.** The present version of the package works with the [Digital Library of Mathematical Functions](https://dlmf.nist.gov/) (DLMF) as zbMATH partner.
37+
38+
This will automatically add new links, delete links that no longer exist, and edit links that have been modified.
39+
40+
**Remark 1.** The present version of the package works with the [Digital Library of Mathematical Functions](https://dlmf.nist.gov/) (DLMF) as zbMATH partner.
4141
Therefore, one can use the command
42-
42+
4343
```
4444
update-api -p DLMF
4545
```
46-
46+
4747
to update the DLMF dataset managed by `zbmath-links-api`.
4848
In the next future, some scraping scripts for other partners will be integrated into this package, and the command
49-
49+
5050
```
5151
update-api
5252
```
53-
53+
5454
will do an automatic update of all links managed by `zbmath-links-api` for all partners.
5555
5656
**Remark 2.** To generate CSV files (but not update the database) which can be used to manually update the database use the command
57-
57+
5858
```
5959
update-api --file
6060
```
61-
62-
This creates three CSV files: `new_links.csv`, `to_edit.csv`, `delete.csv` with the obvious contents, contained in the `update_zblinks_api/results` folder.
61+
62+
This creates three CSV files: `{partner}_new_links.csv`, `{partner}_to_edit.csv`, `{partner}_delete.csv` with the obvious contents, contained in the `update_zblinks_api/results` folder.
6363
6464
(ii) Use the command
6565
6666
```
67-
csv-initial -p DLMF
67+
csv-initial -p <partner>
6868
```
69-
70-
to create two csv files with real DLMF data up to the year 2020: `DLMF_deids_table_init.csv` (to be inserted into the table `document_external_ids`) and `DLMF_source_table_init.csv` (to be inserted into the table `source`).
69+
70+
to create two csv files with real historical parter data: `{partner}_deids_table_init.csv` (to be inserted into the table `document_external_ids`) and `{partner}_source_table_init.csv` (to be inserted into the table `source`).
7171
These files are contained in the `update_zblinks_api/results` folder.
7272
73+
(iii) Use the command
74+
75+
```
76+
csv-to-db
77+
```
78+
79+
to use the csv files from the output of update-api --file and export the information from the files to the database.
7380

results/README.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
If the --file option is chosen, the results will be stored here.
22
Three files will be created: "delete_links.csv", "new_links.csv", and "to_edit.csv"
3-
with the obvious contents in each file.
3+
with the obvious contents in each file.

setup.cfg

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ where = src
3131
console_scripts =
3232
update-api = update_zblinks_api.update_with_api:update
3333
csv-initial = update_zblinks_api.matrix_table_datasets:create_matrix_table_datasets
34+
csv-to-db = update_zblinks_api.update_with_api:use_files_to_update
3435

3536
[pycodestyle]
3637
max-line-length = 79

src/update_zblinks_api/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
# tuple of all partners for zblinks API
55
partners = ("DLMF",)
6+
partners = tuple(p.lower() for p in partners)
7+
68

79
config = configparser.ConfigParser()
810
config.read("config.ini")

src/update_zblinks_api/dlmf_scraping/historical/scrape_dlmf_historical.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,12 +76,12 @@ def create_source_table_dataset(df_hist):
7676
df_hist = df_hist.rename(columns={"external_id": "id"})
7777

7878
df_hist["url"] = "https://dlmf.nist.gov/" + df_hist["id"]
79-
df_hist["partner"] = "DLMF"
79+
df_hist["partner"] = "dlmf"
8080

8181
df_hist["id_scheme"] = "DLMF scheme"
8282
df_hist["type"] = "DLMF bibliographic entry"
8383

84-
df_hist = df_hist.drop_duplicates()
84+
df_hist = df_hist.drop_duplicates(subset=["id"])
8585

8686
column_order = ["id", "id_scheme", "type", "url", "title", "partner"]
8787
df_hist = df_hist.reindex(columns=column_order)
@@ -106,7 +106,7 @@ def get_df_dlmf_initial():
106106
columns=(["document", "external_id", "date", "title"]))
107107
for year in range(2008, 2021):
108108
df_scrape = get_df_dlmf(year)
109-
df_new, df_edit, df_delete = separate_links("DLMF", df_main, df_scrape)
109+
df_new, df_edit, df_delete = separate_links("dlmf", df_main, df_scrape)
110110
df_new["date"] = year
111111
df_main = pd.concat([df_main, df_new]).drop_duplicates(keep=False)
112112

src/update_zblinks_api/helpers/source_helpers.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ def remove_lonely_sources(this_partner):
7575
LEFT OUTER JOIN document_external_ids
7676
ON src.id = document_external_ids.external_id
7777
AND src.partner = document_external_ids.type
78-
WHERE partner = %(partner_arg)s
78+
WHERE src.parter = %(partner_arg)s
7979
AND document_external_ids.external_id IS NULL
8080
"""
8181

@@ -92,7 +92,7 @@ def remove_lonely_sources(this_partner):
9292
AND partner = %(partner_arg)s
9393
"""
9494

95-
data = {"id_list": lonely_id_tuple, "partner_arg": "DLMF"}
95+
data = {"id_list": lonely_id_tuple, "partner_arg": this_partner}
9696

9797
with connection.cursor() as cursor:
9898
cursor.execute(delete_request, data)

src/update_zblinks_api/matrix_table_datasets.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def create_deids_table_dataset(partner, df_hist):
2626
Parameters
2727
----------
2828
partner : str
29-
partner from which the initial datasets are to come.
29+
partner (in lowercase) from which the initial datasets are to come.
3030
df_hist : dataframe
3131
contains columns: "document" (or "zbl_code"), "external_id",
3232
"date" (as int year).
@@ -35,7 +35,7 @@ def create_deids_table_dataset(partner, df_hist):
3535

3636
df_hist = df_hist.rename(columns={"date": "matched_at"})
3737

38-
df_hist["type"] = partner.lower()
38+
df_hist["type"] = partner
3939

4040
df_hist["matched_at"] = pd.to_datetime(df_hist["matched_at"], format="%Y")
4141
df_hist["matched_at"] = (
@@ -75,8 +75,9 @@ def create_matrix_table_datasets(partner):
7575
partner from which the initial datasets are to come.
7676
7777
"""
78+
partner = partner.lower()
7879

7980
# this also creates the initial dataset for the zb_links.source table
80-
df_init_partner = hist_scrape_dict[partner.lower()]()
81+
df_init_partner = hist_scrape_dict[partner]()
8182

8283
create_deids_table_dataset(partner, df_init_partner)

src/update_zblinks_api/update_with_api.py

Lines changed: 48 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,8 @@ def separate_links(partner, df_ext_partner, df_scrape):
153153
those entries from df_ext_partner which are to be deleted.
154154
155155
"""
156+
partner = partner.lower()
157+
156158
df_edit = pd.DataFrame(
157159
columns=(["document", "external_id", "title", "previous_ext_id"])
158160
)
@@ -169,7 +171,7 @@ def separate_links(partner, df_ext_partner, df_scrape):
169171
).drop_duplicates(subset=["document", "external_id"], keep=False)
170172

171173
# to update:
172-
if partner == "DLMF":
174+
if partner == "dlmf":
173175
df_new, df_edit, df_delete = dlmf_helpers.update(
174176
df_ext_partner, df_new, df_delete
175177
)
@@ -223,7 +225,7 @@ def scrape(partner):
223225
@click.option(
224226
"--file", is_flag=True,
225227
help="Use this option to write the data to csv files"
226-
" instead of writing to the matrix"
228+
"instead of writing to the matrix"
227229
"new_links.csv, to_edit.csv, delete_links.csv will be created"
228230
)
229231
def update(file):
@@ -245,9 +247,9 @@ def update(file):
245247
)
246248

247249
if file:
248-
df_new.to_csv("results/new_links.csv", index=False)
249-
df_edit.to_csv("results/to_edit.csv", index=False)
250-
df_delete.to_csv("results/delete_links.csv", index=False)
250+
df_new.to_csv(f"results/{partner}_new_links.csv", index=False)
251+
df_edit.to_csv(f"results/{partner}_to_edit.csv", index=False)
252+
df_delete.to_csv(f"results/{partner}_delete_links.csv", index=False)
251253
else:
252254
df_new = df_new.fillna("")
253255
df_edit = df_edit.fillna("")
@@ -263,3 +265,44 @@ def update(file):
263265
delete_request(row, partner)
264266

265267
source_helpers.remove_lonely_sources(partner)
268+
269+
270+
def use_files_to_update():
271+
"""
272+
For each partner, inserts the data from the csv files:
273+
{partner}_new_links.csv, {partner}_to_edit.csv, {partner}_delete_links.csv
274+
into the database
275+
These files need to be located in the results folder.
276+
277+
Parameters
278+
----------
279+
partner : str
280+
zblinks API partner
281+
282+
283+
"""
284+
for partner in partners:
285+
insert_file = f"results/{partner}_new_links.csv"
286+
try:
287+
df_insert = pd.read_csv(insert_file)
288+
df_insert = df_insert.fillna("")
289+
for _, row in df_insert.iterrows():
290+
post_request(row, partner)
291+
except FileNotFoundError:
292+
click.echo(f"Error: could not find {insert_file}.")
293+
294+
try:
295+
edit_file = f"results/{partner}_to_edit.csv"
296+
df_edit = pd.read_csv(edit_file)
297+
for _, row in df_edit.iterrows():
298+
update_request(row, partner)
299+
except FileNotFoundError:
300+
click.echo(f"Error: could not find {edit_file}.")
301+
302+
try:
303+
delete_file = f"results/{partner}_delete_links.csv"
304+
df_delete = pd.read_csv(delete_file)
305+
for _, row in df_delete.iterrows():
306+
delete_request(row, partner)
307+
except FileNotFoundError:
308+
click.echo(f"Error: could not find {delete_file}.")

0 commit comments

Comments
 (0)