Skip to content

Commit d77b92f

Browse files
authored
Optimisation (#5)
* Upgrade libraries and use KeyedVectors to load word vectors * Use gensim native saved vectors instead * Added tests and CI * Updated with links to new word embeddings and some code cleaning
1 parent 1e4a870 commit d77b92f

File tree

15 files changed

+126
-96
lines changed

15 files changed

+126
-96
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
*.pyc
2-
2+
.pytest_cache
33
.venv

.travis.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
language: python
2+
cache: pip
3+
python:
4+
- "2.7"
5+
install:
6+
- pip install -r requirements/test.txt
7+
script:
8+
- pytest

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ RUN apt-get update \
66
&& apt-get install -y libopenblas-dev \
77
&& apt-get clean
88

9-
RUN pip install --no-cache-dir Theano==0.10.0beta4 numpy==1.13.3 gensim==0.13.2
9+
RUN pip install --no-cache-dir Theano==1.0.2 numpy==1.14.5 gensim==3.5.0
1010

1111
RUN echo "[global]\nfloatX = float32" >> ~/.theanorc
1212
RUN echo "[blas]\nldflags = -lblas -lgfortran" >> ~/.theanorc

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
## Neural ParsCit
22

3+
[![Build Status](https://travis-ci.com/WING-NUS/Neural-ParsCit.svg?branch=master)](https://travis-ci.com/WING-NUS/Neural-ParsCit)
4+
35
Neural ParsCit is a citation string parser which parses reference strings into its component tags such as Author, Journal, Location, Date, etc. Neural ParsCit uses Long Short Term Memory (LSTM), a deep learning model to parse the reference strings. This deep learning algorithm is chosen as it is designed to perform sequence-to-sequence labeling tasks such as ours. Input to the model are word embeddings which are vector representation of words. We provide word embeddings as well as character embeddings as input to the network.
46

57

@@ -15,14 +17,20 @@ source .venv/bin/activate
1517
pip install -r requirements.txt
1618
```
1719

20+
### Word Embeddings
21+
22+
The word embeddings does not come with this repository. You can obtain the [word embeddings](http://wing.comp.nus.edu.sg/~wing.nus/resources/NParsCit/vectors.tar.gz) and the [word frequency](http://wing.comp.nus.edu.sg/~wing.nus/resources/NParsCit/freq) from WING website.
23+
24+
You will need to extract the content of the word embedding archive (`vectors.tar.gz`) to the root directory for this repository by running `tar xfz vectors.tar.gz`.
25+
1826
### Using Docker
1927

2028
1. Build the image: `docker build -t theano-gensim - < Dockerfile`
2129
1. Run the repo mounted to the container: `docker run -it -v /path/to/Neural-ParsCit:/usr/src --name np theano-gensim:latest /bin/bash`
2230

2331
## Parse citation strings
2432

25-
The fastest way to use the parser is to run state-of-the-art pretrained model as follows:
33+
The fastest way to use the parser is to run state-of-the-art pre-trained model as follows:
2634

2735
```
2836
./run.py --model_path models/neuralParsCit/ --pre_emb <vectors.bin> --run shell
@@ -50,10 +58,7 @@ There are many parameters you can tune (CRF, dropout rate, embedding dimension,
5058

5159
Input files for the training script have to follow the following format: each word of the citation string and its corresponding tag has to be on a separate line. All citation strings must be separated by a blank line.
5260

53-
54-
If you want to use the word embeddings trained on ACM refrences, and the freq., please download from WING homepage: http://wing.comp.nus.edu.sg/?page_id=158 (currently not avaible due to space issue, mail animesh@comp.nus.edu.sg, animeshprasad3@gmail.com for a copy)
55-
56-
Details about the training data, experiments can be found in the following article. Traning data and CRF baseline can be downloaded from https://github.com/knmnyn/ParsCit. Please consider citing following piblication(s) if you use Neural ParsCit:
61+
Details about the training data, experiments can be found in the following article. Training data and CRF baseline can be downloaded from https://github.com/knmnyn/ParsCit. Please consider citing following publication(s) if you use Neural ParsCit:
5762
```
5863
@article{animesh2018neuralparscit,
5964
title={Neural ParsCit: A Deep Learning Based Reference String Parser},

loader.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ def augment_with_pretrained(dictionary, ext_emb_path, words):
172172
# if len(ext_emb_path) > 0
173173
#])
174174

175-
pretrained = gensim.models.word2vec.Word2Vec.load_word2vec_format(ext_emb_path, binary=True)
175+
pretrained = gensim.models.KeyedVectors.load_word2vec_format(ext_emb_path, binary=True)
176176

177177
# We either add every word in the pretrained file,
178178
# or only words given in the `words` list to which

model.py

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
1+
from __future__ import print_function
2+
import logging
3+
import cPickle
14
import os
25
import re
36
import numpy as np
47
import scipy.io
58
import theano
69
import theano.tensor as T
7-
import codecs
8-
import cPickle
9-
import gensim
10+
from gensim.models import KeyedVectors
1011

1112
from utils import shared, set_values, get_name
1213
from nn import HiddenLayer, EmbeddingLayer, DropoutLayer, LSTM, forward
1314
from optimization import Optimization
1415

16+
logging.basicConfig(format="%(asctime)-15s %(message)s", level=logging.INFO)
17+
logger = logging.getLogger
1518

1619
class Model(object):
1720
"""
@@ -88,7 +91,7 @@ def save(self):
8891
"""
8992
Write components values to disk.
9093
"""
91-
print "Saving parameter values to disk"
94+
logging.info("Saving parameter values to disk")
9295
for name, param in self.components.items():
9396
param_path = os.path.join(self.model_path, "%s.mat" % name)
9497
if hasattr(param, 'params'):
@@ -97,7 +100,7 @@ def save(self):
97100
param_values = {name: param.get_value()}
98101
#No need to save embeding values as they are never updated
99102
#directly use the pretrained embeddings file
100-
if name=='word_layer':
103+
if name == 'word_layer':
101104
continue
102105
else:
103106
scipy.io.savemat(param_path, param_values)
@@ -109,7 +112,7 @@ def reload(self):
109112
for name, param in self.components.items():
110113
param_path = os.path.join(self.model_path, "%s.mat" % name)
111114
#load word layer during build from pretrained embeddings file.
112-
if name=='word_layer':
115+
if name == 'word_layer':
113116
continue
114117
else:
115118
param_values = scipy.io.loadmat(param_path)
@@ -133,7 +136,7 @@ def build(self,
133136
cap_dim,
134137
training=True,
135138
**kwargs
136-
):
139+
):
137140
"""
138141
Build the network.
139142
"""
@@ -163,23 +166,20 @@ def build(self,
163166
input_dim = 0
164167
inputs = []
165168

166-
#
167169
# Word inputs
168-
#
169170
if word_dim:
170171
input_dim += word_dim
171-
word_layer = EmbeddingLayer(n_words, word_dim, name='word_layer')
172+
word_layer = EmbeddingLayer(n_words, word_dim, name='word_layer', train=training)
172173
word_input = word_layer.link(word_ids)
173174
inputs.append(word_input)
174175
# Initialize with pretrained embeddings
175176
if pre_emb and training:
176177
new_weights = word_layer.embeddings.get_value()
177-
print 'Loading pretrained embeddings from %s...' % pre_emb
178-
pretrained = {}
178+
logging.info("Loading pretrained embeddings from %s...", pre_emb)
179179
emb_invalid = 0
180180

181181
#use gensim models as pretrained embeddings
182-
pretrained = gensim.models.word2vec.Word2Vec.load_word2vec_format(pre_emb, binary=True)
182+
pretrained = KeyedVectors.load(pre_emb, mmap='r')
183183

184184
# for i, line in enumerate(codecs.open(pre_emb, 'r', 'cp850')):
185185
# line = line.rstrip().split()
@@ -196,30 +196,26 @@ def build(self,
196196
c_lower = 0
197197
c_zeros = 0
198198
# Lookup table initialization
199-
for i in xrange(n_words):
199+
for i in range(n_words):
200200
word = self.id_to_word[i]
201201
if word in pretrained:
202202
new_weights[i] = pretrained[word]
203203
c_found += 1
204204
elif word.lower() in pretrained:
205205
new_weights[i] = pretrained[word.lower()]
206206
c_lower += 1
207-
elif re.sub('\d', '0', word.lower()) in pretrained:
207+
elif re.sub(r'\d', '0', word.lower()) in pretrained:
208208
new_weights[i] = pretrained[
209-
re.sub('\d', '0', word.lower())
209+
re.sub(r'\d', '0', word.lower())
210210
]
211211
c_zeros += 1
212212
word_layer.embeddings.set_value(new_weights)
213213
# print 'Loaded %i pretrained embeddings.' % len(pretrained)
214-
print ('%i / %i (%.4f%%) words have been initialized with '
215-
'pretrained embeddings.') % (
216-
c_found + c_lower + c_zeros, n_words,
217-
100. * (c_found + c_lower + c_zeros) / n_words
218-
)
219-
print ('%i found directly, %i after lowercasing, '
220-
'%i after lowercasing + zero.') % (
221-
c_found, c_lower, c_zeros
222-
)
214+
logging.info('%i / %i (%.4f%%) words have been initialized with '
215+
'pretrained embeddings.', c_found + c_lower + c_zeros,
216+
n_words, 100. * (c_found + c_lower + c_zeros) / n_words)
217+
logging.info('%i found directly, %i after lowercasing, '
218+
'%i after lowercasing + zero.', c_found, c_lower, c_zeros)
223219

224220
#
225221
# Chars inputs
@@ -384,7 +380,7 @@ def build(self,
384380
lr_method_parameters = {}
385381

386382
# Compile training function
387-
print 'Compiling...'
383+
logging.info('Compiling...')
388384
if training:
389385
updates = Optimization(clip=5.0).get_updates(lr_method_name, cost, params, **lr_method_parameters)
390386
f_train = theano.function(
@@ -412,3 +408,13 @@ def build(self,
412408
)
413409

414410
return f_train, f_eval
411+
412+
@staticmethod
413+
def load_word_embeddings(embeddings, mode='r'):
414+
if isinstance(embeddings, KeyedVectors):
415+
return embeddings
416+
else:
417+
if os.path.isfile(embeddings) and os.path.isfile(embeddings + 'vectors.npy'):
418+
return KeyedVectors.load(embeddings, mmap=mode)
419+
else:
420+
raise IOError("{embeddings} cannot be found.".format(embeddings=embeddings))

nn.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,18 +59,20 @@ class EmbeddingLayer(object):
5959
Output: tensor of dimension (dim*, output_dim)
6060
"""
6161

62-
def __init__(self, input_dim, output_dim, name='embedding_layer'):
62+
def __init__(self, input_dim, output_dim, name='embedding_layer', train=True):
6363
"""
6464
Typically, input_dim is the vocabulary size,
6565
and output_dim the embedding dimension.
6666
"""
6767
self.input_dim = input_dim
6868
self.output_dim = output_dim
6969
self.name = name
70+
self.train = train
7071

7172
# Randomly generate weights
7273
self.embeddings = shared((input_dim, output_dim),
73-
self.name + '__embeddings')
74+
self.name + '__embeddings',
75+
train=self.train)
7476

7577
# Define parameters
7678
self.params = [self.embeddings]

requirements/dev.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,4 @@
11
-r prod.txt
2+
pylint==1.9.2
3+
pytest==3.5.1
4+
ipython==5.7.0

requirements/prod.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
gensim==0.13.2
2-
theano==0.10.b4
3-
numpy==1.13.3
1+
gensim==3.5.0
2+
theano==1.0.2
3+
numpy==1.14.5

requirements/test.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
-r prod.txt
2+
pylint==1.9.2
3+
pytest==3.5.1

0 commit comments

Comments
 (0)