spaCy Project: Part-of-speech Tagging & Dependency Parsing #9916

kanayer · 2021-12-21T08:43:58Z

kanayer
Dec 21, 2021

I am currently working with this spacy project template. I was wondering if it is possible to train lemmatizer along with existing POS tagger, morphologizer and dependency parser? Currently, the pipeline in the config file is the following:
["tok2vec","tagger","morphologizer","parser"]

The components field in the config file looks like this:

[components]

[components.morphologizer]
factory = "morphologizer"

[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tagger]
factory = "tagger"

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

If it is possible to train a lemmatizer together with the rest of the pipeline, could you please help me understand what kind of component to write into the config file and what kind of changes need to be made?

Answered by adrianeboyd

Dec 21, 2021

There is not currently a trainable lemmatizer in the core spacy library. There's an experimental trainable edit tree lemmatizer in development, see:

https://explosion.ai/blog/edit-tree-lemmatizer

There's a UD benchmark project that uses it (also with a trainable tokenizer that I wouldn't recommend outside of benchmarking, see https://explosion.ai/blog/ud-benchmarks-v3-2):

https://github.com/explosion/projects/tree/v3/benchmarks/ud_benchmark

I would currently guess that the edit tree lemmatizer could move into the core library in v3.3.0, but we haven't made an official decision yet.

View full answer

adrianeboyd · 2021-12-21T09:08:30Z

adrianeboyd
Dec 21, 2021

There is not currently a trainable lemmatizer in the core spacy library. There's an experimental trainable edit tree lemmatizer in development, see:

https://explosion.ai/blog/edit-tree-lemmatizer

There's a UD benchmark project that uses it (also with a trainable tokenizer that I wouldn't recommend outside of benchmarking, see https://explosion.ai/blog/ud-benchmarks-v3-2):

https://github.com/explosion/projects/tree/v3/benchmarks/ud_benchmark

I would currently guess that the edit tree lemmatizer could move into the core library in v3.3.0, but we haven't made an official decision yet.

12 replies

kanayer Dec 22, 2021
Author

Wow, the benchmark model looks really cool! If I have my own dataset with lemmas, pos tags, head info, and dependency tags in conllu format (bigger than KAIST or GSD) can I re-train the model? Can I put my dataset into the assets folder in the ud_bechmark project that i cloned using python -m spacy project clone benchmarks/ud_benchmark

adrianeboyd Dec 22, 2021

As long as the paths/names work out for the steps in project.yml, it should work fine.

Because of the current defaults for ko, you have to have mecab installed even though it isn't used in the end. I hope to improve this in v3.3, but it's still on my to-do list.

kanayer Dec 22, 2021
Author

Wow, this is really awesome! I'm fascinated by the spaCy NLP developers team!
Do you think this will work on the Thai dataset?
How much GPU memory did you need to train these models with big datasets?
Also, is it normal that displaCy throws an error when I use it with the text analyzed by ko_udv25_koreankaist_trf?

adrianeboyd Dec 22, 2021

Typically ~12-15GB, but with the default settings it depends on the size of the dev set mainly. You can always adjust the batch sizes if necessary, though.

The only UD Thai corpus I see is way too small.

displacy works fine on my end?

adrianeboyd Dec 22, 2021

The built-in huggingface demo doesn't show anything, though, because it only supports NER.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spaCy Project: Part-of-speech Tagging & Dependency Parsing #9916

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

spaCy Project: Part-of-speech Tagging & Dependency Parsing #9916

Uh oh!

Uh oh!

kanayer Dec 21, 2021

Replies: 1 comment · 12 replies

Uh oh!

adrianeboyd Dec 21, 2021

Uh oh!

Uh oh!

kanayer Dec 22, 2021 Author

Uh oh!

adrianeboyd Dec 22, 2021

Uh oh!

kanayer Dec 22, 2021 Author

Uh oh!

adrianeboyd Dec 22, 2021

Uh oh!

adrianeboyd Dec 22, 2021

kanayer
Dec 21, 2021

Replies: 1 comment 12 replies

adrianeboyd
Dec 21, 2021

kanayer Dec 22, 2021
Author

kanayer Dec 22, 2021
Author