Skip to content

Commit a58cb95

Browse files
Merge pull request #2 from wellcometrust/ivyleavedtoadflax-patch-1
Prepare train command
2 parents 2dc9225 + ee7660a commit a58cb95

File tree

7 files changed

+212
-58
lines changed

7 files changed

+212
-58
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ sync_model_to_s3:
116116
# the wheel is granted with the --acl public-read flag.
117117

118118
.PHONY: dist
119-
dist: embedding model
119+
dist:
120120
-rm build/bin build/bdist.linux-x86_64 -r
121121
-rm deep_reference_parser-20* -r
122122
-rm dist/*

README.md

Lines changed: 179 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,56 @@
1+
[![Build Status](https://travis-ci.org/wellcometrust/deep_reference_parser.svg?branch=master)](https://travis-ci.org/wellcometrust/deep_reference_parser)
2+
13
# Deep Reference Parser
24

3-
This repo contains a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. The model itself is based on the work of Rodrigues et al. (2018), although the implemention here differs significantly.
5+
Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
6+
7+
The intention for this project, like Rodrigues et al. (2018) is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection, reference component detection, and reference type classification.
8+
9+
### Current status:
10+
11+
|Component|Individual|MultiTask|
12+
|---|---|---|
13+
|Spans|✔️ Implemented|❌ Not Implemented|
14+
|Components|❌ Not Implemented|❌ Not Implemented|
15+
|Type|❌ Not Implemented|❌ Not Implemented|
16+
17+
### The model
18+
19+
The model itself is based on the work of [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing), although the implemention here differs significantly. The main differences are:
20+
21+
* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have labelled ourselves. No Rodrigues et al. data are included in the test and validation sets.
22+
* We also use a new word embedding that has been trained on documents relevant to the medicine.
23+
* Whereas Rodrigues at al. split documents on lines, and sent the lines to the model, we combine the lines of the document together, and then send larger chunks to the model, giving it more context to work with when training and predicting.
24+
* Whilst the model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
25+
* Hyperparameters are passed to the model in a config (.ini) file. This is to keep track of experiments, but also because it is difficult to save the model with the CRF architecture, so it is necesary to rebuild (not re-train!) the model object each time you want to use it. Storing the hyperparameters in a config file makes this easier.
26+
* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
27+
* The model includes a command line interface inspired by [SpaCy](https://github.com/explosion/spaCy); functions can be called from the command line with `python -m deep_reference_parser` ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/predict.py)).
28+
* Python version updated to 3.7, along with dependencies (although more to do)
429

5-
## Just show me the references!!!
30+
### Performance
631

7-
If you want to try out the model the quick way, there is a pre-packaged wheel containing the latest word embedding and weights available on S3. The following commands will get you started:
32+
#### Span detection
33+
34+
|token|f1|support|
35+
|---|---|---|
36+
|b-r|0.9364|2472|
37+
|e-r|0.9312|2424|
38+
|i-r|0.9833|92398|
39+
|o|0.9561|32666|
40+
|weighted avg|0.9746|129959|
41+
42+
#### Computing requirements
43+
44+
Models are trained on AWS instances using CPU only.
45+
46+
|Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
47+
|---|---|---|---|---|
48+
|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
49+
50+
## tl;dr: Just get me to the references!
851

952
```
10-
# Download the wheel from s3
53+
# Install from github
1154
1255
pip install git+git://github.com/wellcometrust/deep_reference_parser.git#egg=deep_reference_parser
1356
@@ -22,21 +65,147 @@ EOF
2265
2366
2467
# Run the model. This will take a little time while the weights and embeddings
25-
# are downloaded - be patient!
68+
# are downloaded. The weights are about 300MB, and the embeddings 950MB.
2669
2770
python -m deep_reference_parser predict --verbose "$(cat references.txt)"
2871
```
2972

30-
# Training your own models
73+
## The longer guide
74+
75+
### Installation
76+
77+
The package can be installed from github for now. Future versions may be available on pypi.
78+
79+
```
80+
pip install git+git://github.com/wellcometrust/deep_reference_parser.git#egg=deep_reference_parser
81+
```
82+
83+
### Config files
84+
85+
The package uses config files to store hyperparameters for the models.
3186

32-
To train your own models you will need to define the model hyperparameters in a config file.
87+
A [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which describes the parameters of the best performing model ships with the package:
88+
89+
```
90+
[DEFAULT]
91+
version = 2019.12.0
92+
93+
[data]
94+
test_proportion = 0.25
95+
valid_proportion = 0.25
96+
data_path = data/
97+
respect_line_endings = 0
98+
respect_doc_endings = 1
99+
line_limit = 250
100+
policy_train = data/2019.12.0_train.tsv
101+
policy_test = data/2019.12.0_test.tsv
102+
policy_valid = data/2019.12.0_valid.tsv
103+
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
104+
105+
[build]
106+
output_path = models/2020.2.0/
107+
output = crf
108+
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
109+
pretrained_embedding = 0
110+
dropout = 0.5
111+
lstm_hidden = 400
112+
word_embedding_size = 300
113+
char_embedding_size = 100
114+
char_embedding_type = BILSTM
115+
optimizer = rmsprop
116+
117+
[train]
118+
epochs = 10
119+
batch_size = 100
120+
early_stopping_patience = 5
121+
metric = val_f1
122+
123+
[evaluate]
124+
out_file = evaluation_data.tsv
125+
```
126+
127+
### Getting help
128+
129+
To get a list of the available commands run `python -m deep_reference_parser`
130+
131+
```
132+
$ python -m deep_reference_parser
133+
Using TensorFlow backend.
134+
135+
ℹ Available commands
136+
train, predict
137+
```
138+
139+
For additional help, you can pass a command with the `-h`/`--help` flag:
140+
141+
```
142+
$ python -m deep_reference_parser predict --help
143+
Using TensorFlow backend.
144+
usage: deep_reference_parser predict [-h]
145+
[-c]
146+
[-t] [-v]
147+
text
148+
149+
positional arguments:
150+
text Plaintext from which to extract references
151+
152+
optional arguments:
153+
-h, --help show this help message and exit
154+
-c --config-file Path to config file
155+
-t, --tokens Output tokens instead of complete references
156+
-v, --verbose Output more verbose results
157+
158+
```
159+
160+
### Training your own models
161+
162+
To train your own models you will need to define the model hyperparameters in a config file like the one above. The config file is then passed to the train command as the only argument. Note that the `output_path` defined in the config file will be created if it doesn not already exist.
33163

34164
```
35165
python -m deep_reference_parser train test.ini
36166
```
37167

168+
Data must be prepared in the following tab separated format (tsv). We may publish further tools in the future to assist in the preparation of data following annotation. In this case the data the data for reference span ddetection follows an IOBE schema.
169+
170+
You must provide the train/test/validation data splits in this format in pre-prepared files that are defined in the config file.
171+
172+
```
173+
References o
174+
1 o
175+
The b-r
176+
potency i-r
177+
of i-r
178+
history i-r
179+
was i-r
180+
on i-r
181+
display i-r
182+
at i-r
183+
a i-r
184+
workshop i-r
185+
held i-r
186+
in i-r
187+
February i-r
188+
```
189+
190+
### Making predictions
191+
192+
If you wish to use the latest model that we have trained, you can simply run:
193+
194+
```
195+
python -m deep_reference_parser predict <input text>
196+
```
197+
198+
If you wish to use a custom model that you have trained, you must specify the config file which defines the hyperparameters for that model using the `-c` flag:
199+
200+
```
201+
python -m deep_reference_parser predict -c new_model.ini <input text>
202+
```
203+
204+
Use the `-t` flag to return the raw token predictions, and the `-v` to return everything in a much more user friendly format.
205+
206+
Note that the model makes predictions at the token level, but a naive splitting is performed by simply splitting on the `b-r` tags.
38207

39-
# Developing the package
208+
### Developing the package further
40209

41210
To create a local virtual environment and activate it:
42211

@@ -48,10 +217,10 @@ make virtualenv
48217
source ./build/virtualenv/bin/activate
49218
```
50219

51-
## Get the embeddings and model artefacts
220+
## Get the data, models, and embeddings
52221

53222
```
54-
make models embeddings
223+
make data models embeddings
55224
```
56225

57226
## Testing

deep_reference_parser/configs/2019.12.0.ini

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,16 @@
11
[DEFAULT]
22
version = 2019.12.0
3-
train_script = scripts/train.py
4-
data_prep_script = scripts/prepare_data.py
53

64
[data]
75
test_proportion = 0.25
86
valid_proportion = 0.25
9-
data_path = data/processed/annotated/deep_reference_parser/
7+
data_path = data/
108
respect_line_endings = 0
119
respect_doc_endings = 1
1210
line_limit = 250
13-
rodrigues_train = data/rodrigues/clean_train.txt
14-
rodrigues_test =
15-
rodrigues_valid =
16-
policy_train = data/processed/annotated/deep_reference_parser/2019.12.0_train.tsv
17-
policy_test = data/processed/annotated/deep_reference_parser/2019.12.0_test.tsv
18-
policy_valid = data/processed/annotated/deep_reference_parser/2019.12.0_valid.tsv
11+
policy_train = data/2019.12.0_train.tsv
12+
policy_test = data/2019.12.0_test.tsv
13+
policy_valid = data/2019.12.0_valid.tsv
1914
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
2015

2116
[build]

deep_reference_parser/deep_reference_parser.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ def __init__(self, X_train=None, X_test=None, X_valid=None,
118118
self.padding_style = padding_style
119119

120120
self.output_path = output_path
121+
os.makedirs(self.output_path, exist_ok=True)
121122
self.weights_path = os.path.join(output_path, "weights.h5")
122123

123124

deep_reference_parser/predict.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,21 @@
44
Run predictions from a pre-trained model
55
"""
66

7+
import itertools
78
import os
89

910
import en_core_web_sm
10-
import pkg_resources
1111
import plac
1212
import spacy
1313
import wasabi
14-
import itertools
1514

1615
from deep_reference_parser import __file__
17-
from deep_reference_parser.__version__ import __model_version__
16+
from deep_reference_parser.common import LATEST_CFG, download_model_artefact
1817
from deep_reference_parser.deep_reference_parser import DeepReferenceParser
1918
from deep_reference_parser.logger import logger
2019
from deep_reference_parser.model_utils import get_config
2120
from deep_reference_parser.reference_utils import break_into_chunks
2221
from deep_reference_parser.tokens_to_references import tokens_to_references
23-
from deep_reference_parser.common import download_model_artefact, LATEST_CFG
2422

2523
msg = wasabi.Printer(icons={"check":"\u2023"})
2624

@@ -53,7 +51,7 @@ def __init__(self, config_file):
5351
download_model_artefact(artefact, S3_SLUG)
5452
msg.good(f"Found {artefact}")
5553
except:
56-
msg.fail(f"Could not download {artefact}")
54+
msg.fail(f"Could not download {S3_SLUG}{artefact}")
5755
logger.exception()
5856

5957
# Check on word embedding and download if not exists
@@ -65,7 +63,7 @@ def __init__(self, config_file):
6563
download_model_artefact(WORD_EMBEDDINGS, S3_SLUG)
6664
msg.good(f"Found {WORD_EMBEDDINGS}")
6765
except:
68-
msg.fail(f"Could not download {WORD_EMBEDDINGS}")
66+
msg.fail(f"Could not download {S3_SLUG}{WORD_EMBEDDINGS}")
6967
logger.exception()
7068

7169

@@ -142,6 +140,7 @@ def split(self, text, return_tokens=False, verbose=False):
142140

143141
msg.good(f"Found {len(refs)} references.")
144142
msg.info("Printing found references:")
143+
145144
for ref in refs:
146145
msg.text(ref, icon="check", spaced=True)
147146

@@ -162,5 +161,6 @@ def split(self, text, return_tokens=False, verbose=False):
162161
def predict(text, config_file=LATEST_CFG, tokens=False, verbose=False):
163162
predictor = Predictor(config_file)
164163
out = predictor.split(text, return_tokens=tokens, verbose=verbose)
164+
165165
if not verbose:
166166
print(out)

0 commit comments

Comments
 (0)