Merge pull request #2 from wellcometrust/ivyleavedtoadflax-patch-1

ivyleavedtoadflax · web-flow · commit a58cb95e547d · 2020-02-12T11:58:51.000-03:00
Prepare train command
diff --git a/Makefile b/Makefile
@@ -116,7 +116,7 @@ sync_model_to_s3:
 # the wheel is granted with the --acl public-read flag.
 
 .PHONY: dist
-dist: embedding model
+dist:
 	-rm build/bin build/bdist.linux-x86_64 -r
 	-rm deep_reference_parser-20* -r
 	-rm dist/*
diff --git a/README.md b/README.md
@@ -1,13 +1,56 @@
+[![Build Status](https://travis-ci.org/wellcometrust/deep_reference_parser.svg?branch=master)](https://travis-ci.org/wellcometrust/deep_reference_parser)
+
 # Deep Reference Parser
 
-This repo contains a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. The model itself is based on the work of Rodrigues et al. (2018), although the implemention here differs significantly.
+Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
+
+The intention for this project, like Rodrigues et al. (2018) is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection, reference component detection, and reference type classification.
+
+### Current status:
+
+|Component|Individual|MultiTask|
+|---|---|---|
+|Spans|✔️ Implemented|❌ Not Implemented|
+|Components|❌ Not Implemented|❌ Not Implemented|
+|Type|❌ Not Implemented|❌ Not Implemented|
+
+### The model
+
+The model itself is based on the work of [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing), although the implemention here differs significantly. The main differences are:
+
+* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have labelled ourselves. No Rodrigues et al. data are included in the test and validation sets.
+* We also use a new word embedding that has been trained on documents relevant to the medicine.
+* Whereas Rodrigues at al. split documents on lines, and sent the lines to the model, we combine the lines of the document together, and then send larger chunks to the model, giving it more context to work with when training and predicting.
+* Whilst the model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
+* Hyperparameters are passed to the model in a config (.ini) file. This is to keep track of experiments, but also because it is difficult to save the model with the CRF architecture, so it is necesary to rebuild (not re-train!) the model object each time you want to use it. Storing the hyperparameters in a config file makes this easier.
+* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
+* The model includes a command line interface inspired by [SpaCy](https://github.com/explosion/spaCy); functions can be called from the command line with `python -m deep_reference_parser` ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/predict.py)).
+* Python version updated to 3.7, along with dependencies (although more to do)
 
-## Just show me the references!!!
+### Performance
 
-If you want to try out the model the quick way, there is a pre-packaged wheel containing the latest word embedding and weights available on S3. The following commands will get you started:
+#### Span detection
+
+|token|f1|support|
+|---|---|---|
+|b-r|0.9364|2472|
+|e-r|0.9312|2424|
+|i-r|0.9833|92398|
+|o|0.9561|32666|
+|weighted avg|0.9746|129959|
+
+#### Computing requirements
+
+Models are trained on AWS instances using CPU only.
+
+|Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
+|---|---|---|---|---|
+|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
+
+## tl;dr: Just get me to the references!
 
 ```
-# Download the wheel from s3
+# Install from github
 
 pip install git+git://github.com/wellcometrust/deep_reference_parser.git#egg=deep_reference_parser
 
@@ -22,21 +65,147 @@ EOF
 
 
 # Run the model. This will take a little time while the weights and embeddings 
-# are downloaded - be patient!
+# are downloaded. The weights are about 300MB, and the embeddings 950MB.
 
 python -m deep_reference_parser predict --verbose "$(cat references.txt)"
 ```
 
-# Training your own models
+## The longer guide
+
+### Installation
+
+The package can be installed from github for now. Future versions may be available on pypi.
+
+```
+pip install git+git://github.com/wellcometrust/deep_reference_parser.git#egg=deep_reference_parser
+```
+
+### Config files
+
+The package uses config files to store hyperparameters for the models. 
 
-To train your own models you will need to define the model hyperparameters in a config file.
+A [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which describes the parameters of the best performing model ships with the package:
+
+```
+[DEFAULT]
+version = 2019.12.0
+
+[data]
+test_proportion = 0.25
+valid_proportion = 0.25
+data_path = data/
+respect_line_endings = 0
+respect_doc_endings = 1
+line_limit = 250
+policy_train = data/2019.12.0_train.tsv
+policy_test = data/2019.12.0_test.tsv
+policy_valid = data/2019.12.0_valid.tsv
+s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
+
+[build]
+output_path = models/2020.2.0/
+output = crf
+word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
+pretrained_embedding = 0
+dropout = 0.5
+lstm_hidden = 400
+word_embedding_size = 300
+char_embedding_size = 100
+char_embedding_type = BILSTM
+optimizer = rmsprop
+
+[train]
+epochs = 10
+batch_size = 100
+early_stopping_patience = 5
+metric = val_f1
+
+[evaluate]
+out_file = evaluation_data.tsv
+```
+
+### Getting help
+
+To get a list of the available commands run `python -m deep_reference_parser`
+
+```
+$ python -m deep_reference_parser
+Using TensorFlow backend.
+
+ℹ Available commands
+train, predict
+```
+
+For additional help, you can pass a command with the `-h`/`--help` flag:
+
+```
+$ python -m deep_reference_parser predict --help
+Using TensorFlow backend.
+usage: deep_reference_parser predict [-h]
+                                     [-c]
+                                     [-t] [-v]
+                                     text
+
+positional arguments:
+  text                  Plaintext from which to extract references
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -c  --config-file     Path to config file
+  -t, --tokens          Output tokens instead of complete references
+  -v, --verbose         Output more verbose results
+
+```
+
+### Training your own models
+
+To train your own models you will need to define the model hyperparameters in a config file like the one above. The config file is then passed to the train command as the only argument. Note that the `output_path` defined in the config file will be created if it doesn not already exist.
 
 ```
 python -m deep_reference_parser train test.ini
 ```
 
+Data must be prepared in the following tab separated format (tsv). We may publish further tools in the future to assist in the preparation of data following annotation. In this case the data the data for reference span ddetection follows an IOBE schema.
+
+You must provide the train/test/validation data splits in this format in pre-prepared files that are defined in the config file.
+
+```
+References  o
+1   o
+The	b-r
+potency	i-r
+of	i-r
+history	i-r
+was	i-r
+on	i-r
+display	i-r
+at	i-r
+a	i-r
+workshop	i-r
+held	i-r
+in	i-r
+February	i-r
+```
+
+### Making predictions
+
+If you wish to use the latest model that we have trained, you can simply run:
+
+```
+python -m deep_reference_parser predict <input text>
+```
+
+If you wish to use a custom model that you have trained, you must specify the config file which defines the hyperparameters for that model using the `-c` flag:
+
+```
+python -m deep_reference_parser predict -c new_model.ini <input text>
+```
+
+Use the `-t` flag to return the raw token predictions, and the `-v` to return everything in a much more user friendly format.
+
+Note that the model makes predictions at the token level, but a naive splitting is performed by simply splitting on the `b-r` tags.
 
-# Developing the package
+### Developing the package further
 
 To create a local virtual environment and activate it:
 
@@ -48,10 +217,10 @@ make virtualenv
 source ./build/virtualenv/bin/activate
 ```
 
-## Get the embeddings and model artefacts
+## Get the data, models, and embeddings
 
 ```
-make models embeddings
+make data models embeddings
 ```
 
 ## Testing
diff --git a/deep_reference_parser/configs/2019.12.0.ini b/deep_reference_parser/configs/2019.12.0.ini
@@ -1,21 +1,16 @@
 [DEFAULT]
 version = 2019.12.0
-train_script = scripts/train.py
-data_prep_script = scripts/prepare_data.py
 
 [data]
 test_proportion = 0.25
 valid_proportion = 0.25
-data_path = data/processed/annotated/deep_reference_parser/
+data_path = data/
 respect_line_endings = 0
 respect_doc_endings = 1
 line_limit = 250
-rodrigues_train = data/rodrigues/clean_train.txt
-rodrigues_test = 
-rodrigues_valid = 
-policy_train = data/processed/annotated/deep_reference_parser/2019.12.0_train.tsv
-policy_test = data/processed/annotated/deep_reference_parser/2019.12.0_test.tsv
-policy_valid = data/processed/annotated/deep_reference_parser/2019.12.0_valid.tsv
+policy_train = data/2019.12.0_train.tsv
+policy_test = data/2019.12.0_test.tsv
+policy_valid = data/2019.12.0_valid.tsv
 s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
 
 [build]
diff --git a/deep_reference_parser/deep_reference_parser.py b/deep_reference_parser/deep_reference_parser.py
@@ -118,6 +118,7 @@ def __init__(self, X_train=None, X_test=None, X_valid=None,
         self.padding_style = padding_style
 
         self.output_path = output_path
+        os.makedirs(self.output_path, exist_ok=True)
         self.weights_path = os.path.join(output_path, "weights.h5")
 
 
diff --git a/deep_reference_parser/predict.py b/deep_reference_parser/predict.py
@@ -4,23 +4,21 @@
 Run predictions from a pre-trained model
 """
 
+import itertools
 import os
 
 import en_core_web_sm
-import pkg_resources
 import plac
 import spacy
 import wasabi
-import itertools
 
 from deep_reference_parser import __file__
-from deep_reference_parser.__version__ import __model_version__
+from deep_reference_parser.common import LATEST_CFG, download_model_artefact
 from deep_reference_parser.deep_reference_parser import DeepReferenceParser
 from deep_reference_parser.logger import logger
 from deep_reference_parser.model_utils import get_config
 from deep_reference_parser.reference_utils import break_into_chunks
 from deep_reference_parser.tokens_to_references import tokens_to_references
-from deep_reference_parser.common import download_model_artefact, LATEST_CFG
 
 msg = wasabi.Printer(icons={"check":"\u2023"})
 
@@ -53,7 +51,7 @@ def __init__(self, config_file):
                     download_model_artefact(artefact, S3_SLUG)
                     msg.good(f"Found {artefact}")
                 except:
-                    msg.fail(f"Could not download {artefact}")
+                    msg.fail(f"Could not download {S3_SLUG}{artefact}")
                     logger.exception()
 
         # Check on word embedding and download if not exists
@@ -65,7 +63,7 @@ def __init__(self, config_file):
                 download_model_artefact(WORD_EMBEDDINGS, S3_SLUG)
                 msg.good(f"Found {WORD_EMBEDDINGS}")
             except:
-                msg.fail(f"Could not download {WORD_EMBEDDINGS}")
+                msg.fail(f"Could not download {S3_SLUG}{WORD_EMBEDDINGS}")
                 logger.exception()
 
 
@@ -142,6 +140,7 @@ def split(self, text, return_tokens=False, verbose=False):
 
                     msg.good(f"Found {len(refs)} references.")
                     msg.info("Printing found references:")
+
                     for ref in refs:
                         msg.text(ref, icon="check", spaced=True)
 
@@ -162,5 +161,6 @@ def split(self, text, return_tokens=False, verbose=False):
 def predict(text, config_file=LATEST_CFG, tokens=False, verbose=False):
     predictor = Predictor(config_file)
     out = predictor.split(text, return_tokens=tokens, verbose=verbose)
+
     if not verbose:
         print(out)
diff --git a/deep_reference_parser/train.py b/deep_reference_parser/train.py
diff --git a/references.txt b/references.txt