Ensemble NER approach - queries #8274

akki248 · 2021-06-03T15:29:52Z

akki248
Jun 3, 2021

Hi,

I'm currently stuck with an approach to extract named entities.

Some useful details:
Python implementation: CPython
Python version : 3.8.8
IPython version : 7.22.0
Compiler : Clang 10.0.0
OS : Darwin
Release : 20.3.0
Machine : x86_64
Processor : i386
CPU cores : 8
Architecture: 64bit
spacy version: 3.0.6

The goal is to predict the following labels: therapeutic areas, product / drug names, diseases, chemicals, geography (includes countries, continents and a few abbreviations such as US, USA and EU).

My approach is to build something like an ensemble of NER models and is described as follows:

Composite NER-1: Identifies the label - therapeutic areas, product / drug names, geography
Pre-trained NER-2: I plan to use the pre-trained model 'en_ner_bc5cdr_md' to identify the labels - diseases and chemicals.

Then, I plan to package all the NER models together in sequence to identify all the labels.

For composite NER: First, I generate the training and test set using the EntityRuler approach and train a combined NER models from scratch that identifies all the 3 labels. The performance metrics look too good, so I'm a bit perplexed what is happening here. I'm aware that this approach may somehow incorporate flawed context in the data but I thought this a middle way to get annotated data quickly and train a ML based NER.

Total text samples: approximately 10,000 and they have varied length with 200 words to 1500 words (English).
Samples having at least one entity with the label (therapeutic area, location, product/drug name): approximately 8500

I consider all the samples to split into train and test set even when a sample has no entities corresponding to a label in it i.e. 1500 samples have no relevant entities.

Data split: 75% (train set), 25%(test set or hold out set)
The train set is further split into training set and evaluation set in the ratio of 75% to 25% respectively (shown in the data validation snapshot)

The config.cfg file:
[paths]
train = "ta_train.spacy"
dev = "ta_validation.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_per_type = null
ents_f = 0.0
ents_p = 0.0
ents_r = 1.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

The data validation snapshot:

The snapshot of the training process:

Best model based on recall:

Query-1: Currently, it appears that the model is probably overfitting or learning the rules but unable to take context into account and then predict the labels. Now, if want to enhance the generalisation capability of the model. How can I do it (say manually annotating the data is the last preference and having a human in-the-loop is planned but currently not possible)?
E.g. If I feed in the text - "I love Oncology and diabetes more than Falafel and Vodka", to the trained NER model, it identifies 'Oncology' and 'diabetes' as the therapeutic areas.

Query-2: Which evaluation scheme is implemented to compute the performance metrics (precision, recall, etc.) during the training? Is it based on IOB tags or BILOU tags? The ents_f is a weighted F score?

Query-3: Is the training process appropriate in your experience? What does the tok2vec loss signify and why does it fluctuate so much? The NER loss does decrease but still fluctuates. Is there a possibility to plot the losses?

Query-4: If I plan to move ahead and add the trained (composite NER-1) and pre-trained (en_ner_bc5cdr_md) model in a single pipeline and package it for deployment. How do I do that and is there a blog that helps to do it?
Also, I read here that one has to be careful regarding the Vocab, Vectors and model settings. Then, I got confused if the approach is technically feasible..?

I wanted to write the thought process regarding the approach so that the queries have a background. Please let me know if you need further information from my side and any suggestions are welcome.

Thanks.

Answered by polm

Jun 4, 2021

I wanted to write the thought process regarding the approach so that the queries have a background. Please let me know if you need further information from my side and any suggestions are welcome.

While background on a problem is helpful, this post is a lot of information. It would be easier for us to help you if you broke things down into more specific questions honestly.

One question about your system: What do you mean by "composite NER"? I'm not clear what you mean by "composite" when referring to a single component.

Query-1: Currently, it appears that the model is probably overfitting or learning the rules but unable to take context into account and then predict the labels.

Since …

View full answer

polm · 2021-06-04T08:43:01Z

polm
Jun 4, 2021

I wanted to write the thought process regarding the approach so that the queries have a background. Please let me know if you need further information from my side and any suggestions are welcome.

While background on a problem is helpful, this post is a lot of information. It would be easier for us to help you if you broke things down into more specific questions honestly.

One question about your system: What do you mean by "composite NER"? I'm not clear what you mean by "composite" when referring to a single component.

Query-1: Currently, it appears that the model is probably overfitting or learning the rules but unable to take context into account and then predict the labels.

Since you generated the training data using rules it sounds like your model is just memorizing some keywords. If that's working well on your test data then maybe an explicit list of keywords and rule based tagging is enough for your project? If not you should look into data augmentation - by replacing keywords or changing their spelling slightly you can make the model rely more on context. Take a look at the augmentation section in the docs and libraries like nlpaug as a starting point.

Note that the process of creating artificial training data you have used is called "weak supervision", so looking for information about that may help you deal with common issues users of it have encountered.

Is there a possibility to plot the losses?

Check the Weights & Biases integration.

Query-3: Is the training process appropriate in your experience?

On a quick look it looks fine. If you use the quickstart as a base you should be OK.

What does the tok2vec loss signify and why does it fluctuate so much?

The tok2vec layer, since it's a layer in the model which is being trained, has its own loss values. Since you're just training NER, that's the only objective, so it's the loss that's backpropagated from that. I wouldn't worry about it fluctuating.

Query-4: If I plan to move ahead and add the trained (composite NER-1) and pre-trained (en_ner_bc5cdr_md) model in a single pipeline and package it for deployment. How do I do that and is there a blog that helps to do it?

We don't have a specific guide for multiple NER models because it should basically just work. The things to watch out for are that the components need to have separate names in the config (they can't both be just "ner") and order matters - NER components won't overwrite existing entities, so models earlier in your pipeline have precedence.

This thread covers a pipeline kind of similar to what you're planning.

Also, I read here that one has to be careful regarding the Vocab, Vectors and model settings. Then, I got confused if the approach is technically feasible..?

The issue here is that an NER layer really needs to be combined with the tok2vec layer it was trained with. If you're using a pretrained NER model without updating it you need to use the replace_listeners config to pull in the tok2vec that comes with the pretrained NER to keep it separate from the tok2vec trained with your new NER model. So this isn't a problem, but you do have to get the config right.

2 replies

akki248 Jun 6, 2021
Author

Thanks for the reply and the suggestions. :)

Sorry, I will make sure to break the questions next time.
By the term 'composite', I meant that a single NER model would predict 3 labels. But, you are right its a single NER model, so the term 'composite' is confusing. Also, I wasn't aware of the technical term - 'weak supervision'.

Could you please also answer to the query-2 ?

polm Jun 7, 2021

NER scoring implementation is here. spaCy docs only store IOB tags, not BILOU, though once applied to a doc they're equivalent anyway. Yes, f is an F1 score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ensemble NER approach - queries #8274

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Ensemble NER approach - queries #8274

Uh oh!

Uh oh!

akki248 Jun 3, 2021

Replies: 1 comment · 2 replies

Uh oh!

polm Jun 4, 2021

Uh oh!

Uh oh!

akki248 Jun 6, 2021 Author

Uh oh!

polm Jun 7, 2021

akki248
Jun 3, 2021

Replies: 1 comment 2 replies

polm
Jun 4, 2021

akki248 Jun 6, 2021
Author