Skip to content

Commit f48e42c

Browse files
authored
Merge pull request #117 from inspirehep/keras-2-compatibility
Release Magpie 2.0
2 parents 1ca6ea1 + 0b7f6f9 commit f48e42c

File tree

7 files changed

+85
-57
lines changed

7 files changed

+85
-57
lines changed

README.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,9 @@ Magpie is a deep learning tool for multi-label text classification. It learns on
44

55
## Very short introduction
66
```
7-
>>> from magpie import MagpieModel
8-
>>> magpie = MagpieModel()
7+
>>> magpie = Magpie()
98
>>> magpie.init_word_vectors('/path/to/corpus', vec_dim=100)
10-
>>> magpie.train('/path/to/corpus', ['label1', 'label2', 'label3'], nb_epochs=3)
9+
>>> magpie.train('/path/to/corpus', ['label1', 'label2', 'label3'], epochs=3)
1110
Training...
1211
>>> magpie.predict_from_text('Well, that was quick!')
1312
[('label1', 0.96), ('label3', 0.65), ('label2', 0.21)]
@@ -24,9 +23,9 @@ $ ls data/hep-categories
2423

2524
Before you train the model, you need to build appropriate word vector representations for your corpus. In theory, you can train them on a different corpus or reuse already trained ones ([tutorial](http://rare-technologies.com/word2vec-tutorial/)), however Magpie enables you to do that as well.
2625
```python
27-
from magpie import MagpieModel
26+
from magpie import Magpie
2827

29-
magpie = MagpieModel()
28+
magpie = Magpie()
3029
magpie.train_word2vec('data/hep-categories', vec_dim=100)
3130
```
3231

@@ -41,10 +40,10 @@ You would usually want to combine those two steps, by simply running:
4140
magpie.init_word_vectors('data/hep-categories', vec_dim=100)
4241
```
4342

44-
If you plan to reuse the trained word representations, you might want to save them and pass in the constructor to `MagpieModel` next time. For the training, just type:
43+
If you plan to reuse the trained word representations, you might want to save them and pass in the constructor to `Magpie` next time. For the training, just type:
4544
```python
4645
labels = ['Gravitation and Cosmology', 'Experiment-HEP', 'Theory-HEP']
47-
magpie.train('data/hep-categories', labels, test_ratio=0.2, nb_epochs=30)
46+
magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=30)
4847
```
4948
By providing the `test_ratio` argument, the model splits data into train & test datasets (in this example into 80/20 ratio) and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of `test_ratio` is 0 meaning that all the data will be used for training.
5049

@@ -63,7 +62,7 @@ Trained models can be used for prediction with methods:
6362
('Theory-HEP', 0.20917746)]
6463
```
6564
## Saving & loading the model
66-
A `MagpieModel` object consists of three components - the word2vec mappings, a scaler and a `keras` model. In order to train Magpie you can either provide the word2vec mappings and a scaler in advance or let the program compute them for you on the training data. Usually you would want to train them yourself on a full dataset and reuse them afterwards. You can use the provided functions for that purpose:
65+
A `Magpie` object consists of three components - the word2vec mappings, a scaler and a `keras` model. In order to train Magpie you can either provide the word2vec mappings and a scaler in advance or let the program compute them for you on the training data. Usually you would want to train them yourself on a full dataset and reuse them afterwards. You can use the provided functions for that purpose:
6766

6867
```python
6968
magpie.save_word2vec_model('/save/my/embeddings/here')
@@ -74,7 +73,7 @@ magpie.save_model('/save/my/model/here.h5')
7473
When you want to reinitialize your trained model, you can run:
7574

7675
```python
77-
magpie = MagpieModel(
76+
magpie = Magpie(
7877
keras_model='/save/my/model/here.h5',
7978
word2vec_model='/save/my/embeddings/here',
8079
scaler='/save/my/scaler/here',
@@ -87,9 +86,12 @@ or just pass the objects directly!
8786

8887
The package is not on PyPi, but you can get it directly from GitHub:
8988
```
90-
$ pip install git+https://github.com/inspirehep/magpie.git@v1.0
89+
$ pip install git+https://github.com/inspirehep/magpie.git@v2.0
9190
```
9291
If you encounter any problems with the installation, make sure to install the correct versions of dependencies listed in `setup.py` file.
9392

93+
## Magpie v1.0 vs v2.0
94+
Magpie v1.0 depends on Keras v1.X, while Magpie v2.0 on Keras v2.X. You can install and use either of those, but bear in mind that only v2.0 will be developed in the future. If you have troubles with installation, make sure that both Magpie and Keras have the same major version.
95+
9496
## Contact
9597
If you have any problems, feel free to open an issue. We'll do our best to help :+1:

magpie/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
from .main import MagpieModel
1+
from .main import Magpie

magpie/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
# Training parameters
1313
BATCH_SIZE = 64
14-
NB_EPOCHS = 1
14+
EPOCHS = 1
1515

1616
# Number of tokens to save from the abstract, zero padded
1717
SAMPLE_LENGTH = 200

magpie/main.py

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import unicode_literals, print_function, division
22

3+
import math
34
import os
45
import sys
56
from six import string_types
@@ -9,13 +10,13 @@
910

1011
from magpie.base.document import Document
1112
from magpie.base.word2vec import train_word2vec, fit_scaler
12-
from magpie.config import NN_ARCHITECTURE, BATCH_SIZE, EMBEDDING_SIZE, NB_EPOCHS
13+
from magpie.config import NN_ARCHITECTURE, BATCH_SIZE, EMBEDDING_SIZE, EPOCHS
1314
from magpie.nn.input_data import get_data_for_model
1415
from magpie.nn.models import get_nn_model
1516
from magpie.utils import save_to_disk, load_from_disk
1617

1718

18-
class MagpieModel(object):
19+
class Magpie(object):
1920

2021
def __init__(self, keras_model=None, word2vec_model=None, scaler=None,
2122
labels=None):
@@ -38,7 +39,7 @@ def __init__(self, keras_model=None, word2vec_model=None, scaler=None,
3839

3940
def train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
4041
nn_model=NN_ARCHITECTURE, batch_size=BATCH_SIZE, test_ratio=0.0,
41-
nb_epochs=NB_EPOCHS, verbose=1):
42+
epochs=EPOCHS, verbose=1):
4243
"""
4344
Train the model on given data
4445
:param train_dir: directory with data files. Text files should end with
@@ -51,7 +52,7 @@ def train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
5152
:param batch_size: size of one batch
5253
:param test_ratio: the ratio of samples that will be withheld from training
5354
and used for testing. This can be overridden by test_dir.
54-
:param nb_epochs: number of epochs to train
55+
:param epochs: number of epochs to train
5556
:param verbose: 0, 1 or 2. As in Keras.
5657
5758
:return: History object
@@ -99,7 +100,7 @@ def train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
99100
x_train,
100101
y_train,
101102
batch_size=batch_size,
102-
nb_epoch=nb_epochs,
103+
epochs=epochs,
103104
validation_data=test_data,
104105
validation_split=test_ratio,
105106
callbacks=callbacks or [],
@@ -108,7 +109,7 @@ def train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
108109

109110
def batch_train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
110111
nn_model=NN_ARCHITECTURE, batch_size=BATCH_SIZE,
111-
nb_epochs=NB_EPOCHS, verbose=1):
112+
epochs=EPOCHS, verbose=1):
112113
"""
113114
Train the model on given data
114115
:param train_dir: directory with data files. Text files should end with
@@ -119,7 +120,7 @@ def batch_train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
119120
:param callbacks: objects passed to the Keras fit function as callbacks
120121
:param nn_model: string defining the NN architecture e.g. 'crnn'
121122
:param batch_size: size of one batch
122-
:param nb_epochs: number of epochs to train
123+
:param epochs: number of epochs to train
123124
:param verbose: 0, 1 or 2. As in Keras.
124125
125126
:return: History object
@@ -163,10 +164,13 @@ def batch_train(self, train_dir, vocabulary, test_dir=None, callbacks=None,
163164
scaler=self.scaler,
164165
)
165166

167+
nb_of_files = len({filename[:-4] for filename in os.listdir(train_dir)})
168+
steps_per_epoch = math.ceil(nb_of_files / batch_size)
169+
166170
return self.keras_model.fit_generator(
167171
train_generator,
168-
len({filename[:-4] for filename in os.listdir(train_dir)}),
169-
nb_epochs,
172+
steps_per_epoch=steps_per_epoch,
173+
epochs=epochs,
170174
validation_data=test_data,
171175
callbacks=callbacks or [],
172176
verbose=verbose,

magpie/nn/models.py

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
from keras.layers.convolutional import MaxPooling1D, Convolution1D
2-
from keras.layers.core import Flatten, Dropout, Dense, Merge
3-
from keras.layers.normalization import BatchNormalization
4-
from keras.layers.recurrent import GRU
5-
from keras.models import Sequential
1+
from keras.layers import Input, Dense, GRU, Dropout, BatchNormalization, \
2+
MaxPooling1D, Conv1D, Flatten, Concatenate
3+
from keras.models import Model
64

75
from magpie.config import SAMPLE_LENGTH
86

@@ -18,31 +16,33 @@ def get_nn_model(nn_model, embedding, output_length):
1816

1917
def cnn(embedding_size, output_length):
2018
""" Create and return a keras model of a CNN """
19+
2120
NB_FILTER = 256
2221
NGRAM_LENGTHS = [1, 2, 3, 4, 5]
2322

24-
conv_layers = []
23+
conv_layers, inputs = [], []
24+
2525
for ngram_length in NGRAM_LENGTHS:
26-
ngram_layer = Sequential()
27-
ngram_layer.add(Convolution1D(
26+
current_input = Input(shape=(SAMPLE_LENGTH, embedding_size))
27+
inputs.append(current_input)
28+
29+
convolution = Conv1D(
2830
NB_FILTER,
2931
ngram_length,
30-
input_dim=embedding_size,
31-
input_length=SAMPLE_LENGTH,
32-
init='lecun_uniform',
32+
kernel_initializer='lecun_uniform',
3333
activation='tanh',
34-
))
35-
pool_length = SAMPLE_LENGTH - ngram_length + 1
36-
ngram_layer.add(MaxPooling1D(pool_length=pool_length))
37-
conv_layers.append(ngram_layer)
34+
)(current_input)
3835

39-
model = Sequential()
40-
model.add(Merge(conv_layers, mode='concat'))
36+
pool_size = SAMPLE_LENGTH - ngram_length + 1
37+
pooling = MaxPooling1D(pool_size=pool_size)(convolution)
38+
conv_layers.append(pooling)
4139

42-
model.add(Dropout(0.5))
43-
model.add(Flatten())
40+
merged = Concatenate()(conv_layers)
41+
dropout = Dropout(0.5)(merged)
42+
flattened = Flatten()(dropout)
43+
outputs = Dense(output_length, activation='sigmoid')(flattened)
4444

45-
model.add(Dense(output_length, activation='sigmoid'))
45+
model = Model(inputs=inputs, outputs=outputs)
4646

4747
model.compile(
4848
loss='binary_crossentropy',
@@ -57,20 +57,21 @@ def rnn(embedding_size, output_length):
5757
""" Create and return a keras model of a RNN """
5858
HIDDEN_LAYER_SIZE = 256
5959

60-
model = Sequential()
60+
inputs = Input(shape=(SAMPLE_LENGTH, embedding_size))
6161

62-
model.add(GRU(
62+
gru = GRU(
6363
HIDDEN_LAYER_SIZE,
64-
input_dim=embedding_size,
65-
input_length=SAMPLE_LENGTH,
66-
init='glorot_uniform',
67-
inner_init='normal',
64+
input_shape=(SAMPLE_LENGTH, embedding_size),
65+
kernel_initializer="glorot_uniform",
66+
recurrent_initializer='normal',
6867
activation='relu',
69-
))
70-
model.add(BatchNormalization())
71-
model.add(Dropout(0.1))
68+
)(inputs)
69+
70+
batch_normalization = BatchNormalization()(gru)
71+
dropout = Dropout(0.1)(batch_normalization)
72+
outputs = Dense(output_length, activation='sigmoid')(dropout)
7273

73-
model.add(Dense(output_length, activation='sigmoid'))
74+
model = Model(inputs=inputs, outputs=outputs)
7475

7576
model.compile(
7677
loss='binary_crossentropy',

magpie/tests/test_api.py

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,43 @@
22
import os
33
import unittest
44

5+
from magpie import Magpie
6+
57
# This one is hacky, but I'm too lazy to do it properly!
68
PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(__file__)))
79
DATA_DIR = os.path.join(PROJECT_DIR, 'data', 'hep-categories')
810

911
class TestAPI(unittest.TestCase):
1012
""" Basic integration test """
11-
def test_integrity(self):
13+
def test_cnn_train(self):
14+
# Get them labels!
15+
with io.open(DATA_DIR + '.labels', 'r') as f:
16+
labels = {line.rstrip('\n') for line in f}
17+
18+
# Run the model
19+
model = Magpie()
20+
model.init_word_vectors(DATA_DIR, vec_dim=100)
21+
history = model.train(DATA_DIR, labels, nn_model='cnn', test_ratio=0.3, epochs=3)
22+
assert history is not None
23+
24+
# Do a simple prediction
25+
predictions = model.predict_from_text("Black holes are cool!")
26+
assert len(predictions) == len(labels)
27+
28+
# Assert the hell out of it!
29+
for lab, val in predictions:
30+
assert lab in labels
31+
assert 0 <= val <= 1
32+
33+
def test_rnn_batch_train(self):
1234
# Get them labels!
1335
with io.open(DATA_DIR + '.labels', 'r') as f:
1436
labels = {line.rstrip('\n') for line in f}
1537

1638
# Run the model
17-
from magpie import MagpieModel
18-
model = MagpieModel()
39+
model = Magpie()
1940
model.init_word_vectors(DATA_DIR, vec_dim=100)
20-
history = model.train(DATA_DIR, labels, test_ratio=0.3, nb_epochs=3)
41+
history = model.batch_train(DATA_DIR, labels, nn_model='rnn', epochs=3)
2142
assert history is not None
2243

2344
# Do a simple prediction

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
# Versions should comply with PEP440. For a discussion on single-sourcing
2323
# the version across setup.py and the project code, see
2424
# https://packaging.python.org/en/latest/single_source_version.html
25-
version='1.0',
25+
version='2.0',
2626

2727
description='Automatic text classification tool',
2828
# long_description=long_description,
@@ -73,7 +73,7 @@
7373
'scipy~=0.18',
7474
'gensim~=0.13',
7575
'scikit-learn~=0.18',
76-
'keras~=1.2.2',
76+
'keras~=2.0',
7777
'h5py~=2.6',
7878
],
7979

0 commit comments

Comments
 (0)