You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pipeline is to process source reference files, if xml to parse them first and then send to reference resolver to get matched with solr record. If reference source file is of type raw, it is sent to reference resolver to get parsed and matched there.
11
11
12
12
13
+
## Required software
14
+
15
+
- RabbitMQ and PostgreSQL
16
+
17
+
13
18
## Setup (recommended)
14
19
15
20
$ virtualenv python
16
21
$ source python/bin/activate
17
22
$ pip install -r requirements.txt
18
23
$ pip install -r dev-requirements.txt
19
24
$ vim local_config.py # edit, edit
20
-
21
-
### Config options for users
22
-
*
25
+
$ ./start-celery.sh
23
26
24
27
25
28
## Queues
26
-
*
29
+
- task_process_reference: from input filename one reference at a time is queued for processing
30
+
31
+
## Command lines:
32
+
33
+
### To run diagnostics:
34
+
- Either supply list of bibcodes or list of source files
35
+
```
36
+
python run.py DIAGNOSTICS -b <list of bibcodes separated by spaces>
37
+
python run.py DIAGNOSTICS -s <list of source filenames separated by spaces>
38
+
python run.py DIAGNOSTICS -b <list of bibcodes separated by spaces> -s <list of source filenames separated by spaces>
39
+
```
40
+
41
+
- To check if a source files can be processed by the pipeline (parser is included), use the command
42
+
```
43
+
python run.py DIAGNOSTICS -p <source filename>
44
+
```
45
+
46
+
If diagnostics is run without any parameters, count of records in each of the four tables, Reference, History, Resolved, and Compare are displayed.
47
+
48
+
### To resolve references:
49
+
50
+
- There are six options:
51
+
52
+
1. Specify source files to be processed, regardless of format (ie, raw, any flavor xml), use the command
53
+
```
54
+
python run.py RESOLVE -s <list of source filenames separated by spaces>
55
+
```
56
+
57
+
2. Specify a directory, and file extension, to recursively search all sub directories for this type of reference file, and queue them all for processing, use the command
where all the references having this year shall be queued for reprocessing.
79
+
80
+
6. To reprocess existing references that were queued but were not resolved due to reference service issue, use the command
81
+
```
82
+
python run.py RESOLVE -f
83
+
```
84
+
where any reference that were queued but not resolved shall be reprocessed.
85
+
86
+
Note that there is an optional parameter that can be combined with cases *ii* - *v*, to filter results on time. Include the parameter
87
+
88
+
-d <days>
89
+
to filter on time. For the case *ii*, this parameter is applied to source file, if timestamp of the file is later than past *days*, the file shall be queued for processing. For the cases *iii* - *v* the time is applied to resolved references run, if they were processed in the past *days*, they shall be queue for reprocessing.
90
+
91
+
### To query database:
92
+
93
+
- To get a list of source files processed from a specified publisher, use the command
94
+
```
95
+
python run.py STATS -p <publisher>
96
+
```
97
+
98
+
- To see the result of resolved records for specific source bibcode/filename, use the command
99
+
```
100
+
python run.py STATS -b <source bibcode>
101
+
python run.py STATS -s <source filename>
102
+
```
103
+
104
+
- To see number of rows in the four main tables, use the command
0 commit comments