Using NER component from en_core_web_trf in custom pipeline #13120

SharbelWired · 2023-11-09T02:22:14Z

SharbelWired
Nov 9, 2023

Hi everyone, I have successfully trained a transformer pipeline that has custom textcat_multilabel, spancat, and NER components as part of the pipeline. In addition, I want to also have the NER component from en_core_web_trf available to my pipeline, and I have achieved this trying the config and code approach. Essentially, I start with a clean en_core_web_trf model, then load my custom model, and use add_pipe on the en_core_web_trf model to add my components from my custom model. This block of code illustrates what I am doing =>

import spacy

nlp = spacy.load("en_core_web_trf")

nlp_custom = spacy.load('/path/to/custom/model')
nlp_custom.replace_listeners("transformer", "textcat_multilabel", ["model.tok2vec"])
nlp_custom.replace_listeners("transformer", "spancat", ["model.tok2vec"])
nlp_custom.replace_listeners("transformer", "ner", ["model.tok2vec"])

nlp.add_pipe('textcat_multilabel', name='custom_sentcat', source=nlp_custom)
nlp.add_pipe('spancat', name='custom_spancat', source=nlp_custom)
nlp.add_pipe('ner', name='custom_ner', source=nlp_custom)

nlp.to_disk('/path/to/save/merged/model')

When I use the merged model (nlp), I am getting the textcats, spancats, and NER predictions that I expect. The spacy supplied NER tags work great, and the few NER labels that we custom trained show up. I have not assessed if there has been any degradation in predictions formally yet but things seem to work fine when I do test inferencing calls.

The output directory when I save is over 2gb, which is a bit surprising since my custom model, prior to merging it with en_core_web_trf had used shared transformer layers. After I ran the above, each of my components (spancat, textcat and NER) all had much larger models in their corresponding subdirectories, unlike before merging. I was expecting that the en_core_web_trf model may have its own transformer but I thought my custom model's shared embedding layer from when I trained would stay, but it seems that each component in fact has an independent transformer after I run the above code.

I did try starting with my custom model, then adding en_core_web_trf's NER to that, but that didn't yield good predictions at all.

Question 1: I understand if the en_core_web_trf model needs to keep its own transformer, but is there a way to keep my shared embedding layer for my custom trained components?

Question 2: Is the NER training data for en_core_web_trf available somewhere so that I can just train my own NER component that includes the spacy labels and my own custom ones? I'm looking mostly for the MONEY, DATE, CARDINAL ones as I think the ORG, PERSON ones are easier to find. Is this a viable approach or should I just focus on what I am doing above?

Thank-you in advance for any guidance!

Answered by adrianeboyd

Nov 9, 2023

Don't use replace_listeners and instead create your custom pipeline with a custom transformer name. These are the kinds of changes you'd need in your config (just done for transformer+ner as an example). The main thing is to give the new transformer component a custom name and to also specify that name in upstream for the components listening to it:

--- /tmp/ner_orig.cfg	2023-11-09 08:28:52.807778529 +0100
+++ /tmp/ner.cfg	2023-11-09 08:28:42.927768314 +0100
@@ -10,7 +10,7 @@
 
 [nlp]
 lang = "en"
-pipeline = ["transformer","ner"]
+pipeline = ["custom_transformer","ner"]
 batch_size = 128
 disabled = []
 before_creation = null
@@ -41,29 +41,29 @@
 @architectures = "spacy-transformers.Tran…

View full answer

adrianeboyd · 2023-11-09T08:06:36Z

adrianeboyd
Nov 9, 2023

Don't use replace_listeners and instead create your custom pipeline with a custom transformer name. These are the kinds of changes you'd need in your config (just done for transformer+ner as an example). The main thing is to give the new transformer component a custom name and to also specify that name in upstream for the components listening to it:

--- /tmp/ner_orig.cfg	2023-11-09 08:28:52.807778529 +0100
+++ /tmp/ner.cfg	2023-11-09 08:28:42.927768314 +0100
@@ -10,7 +10,7 @@
 
 [nlp]
 lang = "en"
-pipeline = ["transformer","ner"]
+pipeline = ["custom_transformer","ner"]
 batch_size = 128
 disabled = []
 before_creation = null
@@ -41,29 +41,29 @@
 @architectures = "spacy-transformers.TransformerListener.v1"
 grad_factor = 1.0
 pooling = {"@layers":"reduce_mean.v1"}
-upstream = "*"
+upstream = "custom_transformer"
 
-[components.transformer]
+[components.custom_transformer]
 factory = "transformer"
 max_batch_items = 4096
 set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
 
-[components.transformer.model]
+[components.custom_transformer.model]
 @architectures = "spacy-transformers.TransformerModel.v3"
 name = "roberta-base"
 mixed_precision = false
 
-[components.transformer.model.get_spans]
+[components.custom_transformer.model.get_spans]
 @span_getters = "spacy-transformers.strided_spans.v1"
 window = 128
 stride = 96
 
-[components.transformer.model.grad_scaler_config]
+[components.custom_transformer.model.grad_scaler_config]
 
-[components.transformer.model.tokenizer_config]
+[components.custom_transformer.model.tokenizer_config]
 use_fast = true
 
-[components.transformer.model.transformer_config]
+[components.custom_transformer.model.transformer_config]
 
 [corpora]

When you combine the pipelines, source the custom_transformer followed by the other components (no replacing listeners).

Just as a note, you can't interleave the separate groups of transformer+components in your pipeline or it can lead to gibberish, but if you have the components from en_core_web_trf followed by your custom components it should work.

The NER data is from OntoNotes, which can be licensed from the LDC but it's pricey, and definitely not worth it for those categories. I've always wondered whether simpler rule-based matching libraries, especially for times/dates, would do just as well. (Also the TIME/DATE distinction seems kind of arbitrary and CARDINAL ends up being grab bucket category where the only remaining non-cardinal cases seem to be from constructions with the impersonal "one".)

1 reply

SharbelWired Nov 9, 2023
Author

@adrianeboyd THANK-YOU! This is exactly what I needed, I'm still training but I took a copy of the last model-best directory and did a quick experiment to combine en_core_web_trf with my custom model pipeline. After making your suggested changes, the merged model output is roughly 685mb, down from > 2gb. Each of my custom models (textcat_multilabel, spancat, NER)'s model directory is relatively small now, and the predictions that I tried were expected; bearing in mind that the custom model was not training for long. Thanks again for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using NER component from en_core_web_trf in custom pipeline #13120

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using NER component from en_core_web_trf in custom pipeline #13120

Uh oh!

SharbelWired Nov 9, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Nov 9, 2023

Uh oh!

SharbelWired Nov 9, 2023 Author

SharbelWired
Nov 9, 2023

Replies: 1 comment 1 reply

adrianeboyd
Nov 9, 2023

SharbelWired Nov 9, 2023
Author