Skip to content

Commit 161aeed

Browse files
authored
Merge pull request #1 from golnazads/master
the first
2 parents 5c76c5a + d41ed17 commit 161aeed

File tree

168 files changed

+31132
-8
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

168 files changed

+31132
-8
lines changed

.coveragerc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2+
[run]
3+
source = adsrefpipe
4+
omit = adsrefpipe/test*

.github/workflows/python_actions.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ jobs:
2222
POSTGRES_USER: postgres
2323
POSTGRES_HOST: localhost
2424
POSTGRES_PORT: 5432
25-
2625
ports:
2726
- 5432:5432
2827
# Set health checks to wait until postgres has started
@@ -32,6 +31,12 @@ jobs:
3231
--health-timeout 5s
3332
--health-retries 5
3433
34+
rabbitmq:
35+
image: rabbitmq:3.11.13
36+
ports:
37+
- 15672:15672
38+
- 5672:5672
39+
3540
steps:
3641
- uses: actions/checkout@v2
3742
- uses: actions/setup-python@v2
@@ -43,11 +48,9 @@ jobs:
4348
python -m pip install --upgrade setuptools pip
4449
pip install -r requirements.txt
4550
pip install -r dev-requirements.txt
46-
4751
- name: Test with pytest
4852
run: |
4953
py.test
50-
5154
- name: Upload coverage data to coveralls.io
5255
run: coveralls
5356
env:

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
*.py[co]
2+
local_config.py
3+
venv
4+
# Mac specific files
5+
.DS_Store

README.md

Lines changed: 86 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,23 +7,104 @@ reference resolver processing pipeline
77

88
## Short summary
99

10-
This pipeline ...
10+
This pipeline is to process source reference files, if xml to parse them first and then send to reference resolver to get matched with solr record. If reference source file is of type raw, it is sent to reference resolver to get parsed and matched there.
1111

1212

13+
## Required software
14+
15+
- RabbitMQ and PostgreSQL
16+
17+
1318
## Setup (recommended)
1419

1520
$ virtualenv python
1621
$ source python/bin/activate
1722
$ pip install -r requirements.txt
1823
$ pip install -r dev-requirements.txt
1924
$ vim local_config.py # edit, edit
20-
21-
### Config options for users
22-
*
25+
$ ./start-celery.sh
2326

2427

2528
## Queues
26-
*
29+
- task_process_reference: from input filename one reference at a time is queued for processing
30+
31+
## Command lines:
32+
33+
### To run diagnostics:
34+
- Either supply list of bibcodes or list of source files
35+
```
36+
python run.py DIAGNOSTICS -b <list of bibcodes separated by spaces>
37+
python run.py DIAGNOSTICS -s <list of source filenames separated by spaces>
38+
python run.py DIAGNOSTICS -b <list of bibcodes separated by spaces> -s <list of source filenames separated by spaces>
39+
```
40+
41+
- To check if a source files can be processed by the pipeline (parser is included), use the command
42+
```
43+
python run.py DIAGNOSTICS -p <source filename>
44+
```
45+
46+
If diagnostics is run without any parameters, count of records in each of the four tables, Reference, History, Resolved, and Compare are displayed.
47+
48+
### To resolve references:
49+
50+
- There are six options:
51+
52+
1. Specify source files to be processed, regardless of format (ie, raw, any flavor xml), use the command
53+
```
54+
python run.py RESOLVE -s <list of source filenames separated by spaces>
55+
```
56+
57+
2. Specify a directory, and file extension, to recursively search all sub directories for this type of reference file, and queue them all for processing, use the command
58+
```
59+
python run.py RESOLVE -p <source files path> -e <source files extension>
60+
```
61+
62+
3. To reprocess existing references based on confidence cutoff value, use the command
63+
```
64+
python run.py RESOLVE -c <confidence cutoff>
65+
```
66+
where all the references having score value lower than cutoff shall be queued for reprocessing.
67+
68+
4. To reprocess existing references based on resolved bibcode's bibstem, use the command
69+
```
70+
python run.py RESOLVE -b <resolved reference bibstem>
71+
```
72+
where all the references having this bibstem shall be queued for reprocessing.
73+
74+
5. To reprocess existing references based on resolved bibcode's year, use the command
75+
```
76+
python run.py RESOLVE -y <resolved reference year>
77+
```
78+
where all the references having this year shall be queued for reprocessing.
79+
80+
6. To reprocess existing references that were queued but were not resolved due to reference service issue, use the command
81+
```
82+
python run.py RESOLVE -f
83+
```
84+
where any reference that were queued but not resolved shall be reprocessed.
85+
86+
Note that there is an optional parameter that can be combined with cases *ii* - *v*, to filter results on time. Include the parameter
87+
88+
-d <days>
89+
to filter on time. For the case *ii*, this parameter is applied to source file, if timestamp of the file is later than past *days*, the file shall be queued for processing. For the cases *iii* - *v* the time is applied to resolved references run, if they were processed in the past *days*, they shall be queue for reprocessing.
90+
91+
### To query database:
92+
93+
- To get a list of source files processed from a specified publisher, use the command
94+
```
95+
python run.py STATS -p <publisher>
96+
```
97+
98+
- To see the result of resolved records for specific source bibcode/filename, use the command
99+
```
100+
python run.py STATS -b <source bibcode>
101+
python run.py STATS -s <source filename>
102+
```
103+
104+
- To see number of rows in the four main tables, use the command
105+
```
106+
python run.py STATS -c
107+
```
27108
28109
29110
## Maintainers

__init__.py

Whitespace-only changes.

adsrefpipe/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)