Using NER component from en_core_web_trf in custom pipeline #13120
-
Hi everyone, I have successfully trained a transformer pipeline that has custom textcat_multilabel, spancat, and NER components as part of the pipeline. In addition, I want to also have the NER component from en_core_web_trf available to my pipeline, and I have achieved this trying the config and code approach. Essentially, I start with a clean en_core_web_trf model, then load my custom model, and use
When I use the merged model ( The output directory when I save is over 2gb, which is a bit surprising since my custom model, prior to merging it with en_core_web_trf had used shared transformer layers. After I ran the above, each of my components (spancat, textcat and NER) all had much larger models in their corresponding subdirectories, unlike before merging. I was expecting that the en_core_web_trf model may have its own transformer but I thought my custom model's shared embedding layer from when I trained would stay, but it seems that each component in fact has an independent transformer after I run the above code. I did try starting with my custom model, then adding en_core_web_trf's NER to that, but that didn't yield good predictions at all. Question 1: I understand if the en_core_web_trf model needs to keep its own transformer, but is there a way to keep my shared embedding layer for my custom trained components? Question 2: Is the NER training data for en_core_web_trf available somewhere so that I can just train my own NER component that includes the spacy labels and my own custom ones? I'm looking mostly for the MONEY, DATE, CARDINAL ones as I think the ORG, PERSON ones are easier to find. Is this a viable approach or should I just focus on what I am doing above? Thank-you in advance for any guidance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Don't use --- /tmp/ner_orig.cfg 2023-11-09 08:28:52.807778529 +0100
+++ /tmp/ner.cfg 2023-11-09 08:28:42.927768314 +0100
@@ -10,7 +10,7 @@
[nlp]
lang = "en"
-pipeline = ["transformer","ner"]
+pipeline = ["custom_transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
@@ -41,29 +41,29 @@
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
-upstream = "*"
+upstream = "custom_transformer"
-[components.transformer]
+[components.custom_transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
-[components.transformer.model]
+[components.custom_transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false
-[components.transformer.model.get_spans]
+[components.custom_transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
-[components.transformer.model.grad_scaler_config]
+[components.custom_transformer.model.grad_scaler_config]
-[components.transformer.model.tokenizer_config]
+[components.custom_transformer.model.tokenizer_config]
use_fast = true
-[components.transformer.model.transformer_config]
+[components.custom_transformer.model.transformer_config]
[corpora] When you combine the pipelines, source the Just as a note, you can't interleave the separate groups of transformer+components in your pipeline or it can lead to gibberish, but if you have the components from The NER data is from OntoNotes, which can be licensed from the LDC but it's pricey, and definitely not worth it for those categories. I've always wondered whether simpler rule-based matching libraries, especially for times/dates, would do just as well. (Also the |
Beta Was this translation helpful? Give feedback.
Don't use
replace_listeners
and instead create your custom pipeline with a custom transformer name. These are the kinds of changes you'd need in your config (just done fortransformer
+ner
as an example). The main thing is to give the new transformer component a custom name and to also specify that name inupstream
for the components listening to it: