Adding lemmatizer and ner to pipeline #7091

Nuccy90 · 2021-02-16T13:06:26Z

Nuccy90
Feb 16, 2021

I'm trying to train a pipeline for Swedish that will do tagging, parsing, lemmatizing, sentence segmentation and ner.
I have to use different datasets for the tagger and parser on the one hand and the ner component on the other hand because I don't have everything annotated on the same data.

I have a couple of questions:

I can't seem to be able to add a lemmatizer to the already trained tagger. I sourced the tagger from a trained model and I added an AttributeRuler to the pipeline but I still get the warning that the lemmatizer won't work. Does the lemmatizer need to be added to the pipeline at the same time as the tagger? I am hypothesizing that since the tagger is frozen in my pipeline the AttributeRuler isn't doing anything and token.pos_ is still empty.
I keep getting this warning that the performance of the tagger and parser will be degraded if I freeze them and keep training the transformer alone. As I understand it I need the transformer for the ner component so I can't freeze it as well. I tried to use replace_listeners = ["model.tok2vec"] to make a copy and decouple the tagger and parser from the transformer but it doesn't seem to be working. What is it that I am not getting?

adrianeboyd · 2021-02-17T09:03:21Z

adrianeboyd
Feb 17, 2021

The attribute ruler doesn't include rules by default, so you'll need to add rules that map/copy tag to pos in the right way for your tags. If you have a v2 tag map, you can convert that with the load_from_tag_map method.

If you have separate training corpora, it works best to have separate transformer (or other tok2vec) components for each set of components trained from the same corpus. Take a look at the en_core_web_sm config to see how there are separate tok2vec components for tagger+parser vs. ner. If you want (it will make the model twice as large), you can include two transformer components by giving one a custom name other than transformer. When training, be sure to freeze all the tok2vec+components that aren't being updated with that data.

We usually train separate pipelines for each corpus (so no freezing, just one config with tagger+parser, one config with ner) and then use the nlp.add_pipe(source=) option to merge them together, and then add the rule-based attribute ruler and lemmatizer components at the end of pipeline.

3 replies

Nuccy90 Feb 17, 2021
Author

I already have UPOS in tag because of the way I structured my training data, so I actually don't need to have a tag map in this case. What would you suggest, is it easier to modify the lemmatizer code so it looks at tag instead of pos or somehow copy the content of tag into pos with a fake tag map?

Edit: is it any better if I add a morphologizer model to the pipeline instead of the attribute ruler?

adrianeboyd Feb 17, 2021

You can have attribute ruler rules that copy tag to pos (one for each UPOS) or you can switch from the tagger to the morphologizer to have it annotate pos directly.

For the pretrained pipelines with just a morphologizer and no tagger, we copy pos to tag so that the tags aren't empty, an example is es_core_news_sm. The patterns are very simple:

[
  {
    "patterns":[
      [
        {
          "POS":"ADJ"
        }
      ]
    ],
    "attrs":{
      "TAG":"ADJ"
    },
    "index":0
  },
  {
    "patterns":[
      [
        {
          "POS":"ADP"
        }
      ]
    ],
    "attrs":{
      "TAG":"ADP"
    },
    "index":0
  },
...
]

tag -> pos would look the same, just with the attribute names swapped.

Nuccy90 Feb 17, 2021
Author

Thanks! I'll try this one out :)

Nuccy90 · 2021-03-25T08:46:54Z

Nuccy90
Mar 25, 2021
Author

Hi @adrianeboyd, as you can imagine I managed to train my model following your intructions, thank you for your help :)

Now I have another question. I want to make a spacy project with my pipeline so that you might consider adding Swedish to your core models. I was looking at the existing transformer models and even uncompressed they are at most ~500 MB. My model would have one transformer for the tagger+parser and one for ner, resulting in almost 1GB.
How did you get such small models? Is it just because you could train all components from the same dataset? And would you still consider publishing a pipeline that is much heavier than the others if there is no other solution?

Thank you again!

1 reply

adrianeboyd Mar 25, 2021

Having two transformer models is large and slow, so so far our decision is to leave out ner for trf models where we don't have training data for all components in the same dataset, so we have models that are lg_dep_ instead of lg_core_ like fr_dep_news_trf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding lemmatizer and ner to pipeline #7091

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding lemmatizer and ner to pipeline #7091

Uh oh!

Nuccy90 Feb 16, 2021

Replies: 2 comments · 4 replies

Uh oh!

adrianeboyd Feb 17, 2021

Uh oh!

Uh oh!

Nuccy90 Feb 17, 2021 Author

Uh oh!

adrianeboyd Feb 17, 2021

Uh oh!

Nuccy90 Feb 17, 2021 Author

Uh oh!

Nuccy90 Mar 25, 2021 Author

Uh oh!

adrianeboyd Mar 25, 2021

Nuccy90
Feb 16, 2021

Replies: 2 comments 4 replies

adrianeboyd
Feb 17, 2021

Nuccy90 Feb 17, 2021
Author

Nuccy90 Feb 17, 2021
Author

Nuccy90
Mar 25, 2021
Author