Train NER based on existing model or add custom trained NER to existing model error #7149

nomisjon · 2021-02-20T10:13:45Z

nomisjon
Feb 20, 2021

in spaCy < 3.0 I was able to train the NER component within the trained en_core_web_sm model:

python -m spacy train en model training validation --base-model en_core_web_sm --pipeline "ner" -R -n 10

Specifically, I need the tagger and in the parser of the en_core_web_lg model. These components can be added with the corresponding source and then insert to the frozen_component in the training section of the config file (I will provide my full config at the end of this question):

[components]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]
.
.
.
[training]
frozen_components = ["tagger","parser"]

When I'm debugging, following error occurs:
ValueError: [E922] Component 'tagger' has been initialized with an output dimension of 49 - cannot add any more labels.

When I put tagger to the disabled components in the nlp section of the config file or if I delete everything related to the tagger, debugging and training works. However, when applying the trained model to a text loaded to a doc, only the trained NER works and none of the other components. E.g. the parser predicts everything is ROOT.

I also tried to train the NER model on its own and then add it to the loaded en_core_web_sm model:

MODEL_PATH = 'data/model/model-best'
nlp = spacy.load(MODEL_PATH)

english_nlp = spacy.load("en_core_web_sm")

ner_labels = nlp.get_pipe("ner")
english_nlp.add_pipe('ner_labels')

This leads to the following error:
ValueError: [E002] Can't find factory for 'ner_labels' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component(for function components) or@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer`

Does anyone have a suggestion how I can either train my NER with the en_core_web_sm model or how I could integrate my trained component?

Stack Overflow: https://stackoverflow.com/questions/66090423/spacy-v3-train-ner-based-on-existing-model-or-add-custom-trained-ner-to-existing?noredirect=1#comment117160843_66090423

Here's the entire config file:

[paths]
train = "training"
dev = "validation"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["tagger","parser"]
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Your Environment

Operating System: Windows 10
Python Version Used: 3.8.3
spaCy Version Used: 3
Environment Information: python .venv

Answered by svlandeg

Feb 22, 2021

I think there are a few different issues going on here:

[training]
frozen_components = ["tagger","parser"]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.tok2vec]
factory = "tok2vec"

Freezing the sourced tagger & parser should indeed work like this, and you are right to try and disconnect the tok2vec layer with the replace_listeners function. However, it's not entirely working like you think. The tagger and parser just know that they're listening to an upstream tok2vec component, and by creating a new one with factory = "tok2vec", you're basically still …

View full answer

svlandeg · 2021-02-22T21:39:05Z

svlandeg
Feb 22, 2021

I think there are a few different issues going on here:

[training]
frozen_components = ["tagger","parser"]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

[components.tok2vec]
factory = "tok2vec"

Freezing the sourced tagger & parser should indeed work like this, and you are right to try and disconnect the tok2vec layer with the replace_listeners function. However, it's not entirely working like you think. The tagger and parser just know that they're listening to an upstream tok2vec component, and by creating a new one with factory = "tok2vec", you're basically still wiping the weights. Because the replace_listeners utility will copy the tok2vec weights to an internal subnetwork, but that subnetwork is not currently available in your config - you should source it too:

[components.tok2vec]
source = "en_core_web_sm"

You'll see that once you do this - freezing the tagger/parser and disconnecting them from their original tok2vec layer, will work. Your tagging/parsing performance should remain nice and high and you shouldn't get predictions like "all ROOT".

Now, that leaves the problem of how to deal with the NER. There are basically three options.

If you changed the config as described above, you freeze the tagger & parser, and give them an "in-house" copy of their trained tok2vec layer that is sourced from en_core_web_sm. Now, you can use the sourced tok2vec layer and retrain/fine-tune it on the NER task. I would definitely try this option first.
Just define the NER's tok2vec as an internal layer, like so:

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

This means that the NER will just use its internal tok2vec layer, and in fact training the NER will NOT impact the parser/tagger - even if you don't freeze those or use replace_listeners. But it means the tok2vec layer needs to be trained from scratch.

Use a listener pattern for the NER's tok2vec layer, but create a second, new tok2vec from scratch:

[nlp]
pipeline = ["tok2vec","tok2vec_beta","tagger","parser","ner"]

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec_beta.model.encode.width}
upstream = "tok2vec_beta"

[components.tok2vec_beta]
factory = "tok2vec"

Again, this will avoid conflicts with the other tok2vec of the tagger/parser, but again it means training a new tok2vec layer from scratch.

Your final point

ValueError: [E922] Component 'tagger' has been initialized with an output dimension of 49 - cannot add any more labels.

is a little bit unclear to me. I can't reproduce it. Are you perhaps trying to run the pretrained tagger on gold data that contains labels that are not included in en_core_web_sm?

2 replies

badri-thinker Jan 31, 2022

do we need to add replace_listeners to every component. I used config file for webtrf model and trained the model but I am not getting any entities labeled with the custom label. I am getting 0 for entities recognized as custom label, When I evaluate , I see the custom label along with other NER labels but has 0 items recognized as such

polm Feb 1, 2022

@badri-thinker Please don't post in multiple threads, let's keep this in #10064.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Train NER based on existing model or add custom trained NER to existing model error #7149

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Train NER based on existing model or add custom trained NER to existing model error #7149

Uh oh!

Uh oh!

nomisjon Feb 20, 2021

Your Environment

Replies: 1 comment · 2 replies

Uh oh!

svlandeg Feb 22, 2021

Uh oh!

Uh oh!

badri-thinker Jan 31, 2022

Uh oh!

polm Feb 1, 2022

nomisjon
Feb 20, 2021

Replies: 1 comment 2 replies

svlandeg
Feb 22, 2021