How to achieve experimental_edit_tree_lemmatizer best accuracy #10662

alitaker · 2022-04-15T15:31:40Z

alitaker
Apr 15, 2022

In the edit_tree_lemmatizer blog page the presented results are all above 94% but training on my machine achieves 89-90%.
I used the sample project at https://github.com/explosion/spacy-experimental and spaCy 3.2

The evaluation results are

treebank: UD_Dutch-Alpino
TOK 99.97
LEMMA 91.67
SPEED 32745
advertised in blog post: 0.96
treebank: UD_Italian-ISDT
TOK 96.01
LEMMA 89.13
SPEED 35049
advertised in blog post: 0.97
treebank: UD_Italian-VIT
TOK 96.57
LEMMA 90.08
SPEED 35305

Are there any other tweaks I can apply?

Thank you

Answered by adrianeboyd

Apr 20, 2022

Double-check that you've included the vectors in initialize.vectors and have enabled vectors with in the tok2vec with include_static_vectors = true?

For Dutch this isn't the issue, but for Italian the drop is probably mainly related to tokenization. The TOK score of 96-97 is relatively low and every tokenization error turns into at least one lemmatization error.

Docs in spacy don't support multiword tokens, so in the provided trained pipelines and the examples in that blog post, we merge multiword tokens when converting UD corpora. We also group sentences into paragraph-sized chunks, so our typical conversion with both options is:

spacy convert -T -n 10 file.conllu .

There's a longer expl…

View full answer

adrianeboyd · 2022-04-20T15:13:20Z

adrianeboyd
Apr 20, 2022

Double-check that you've included the vectors in initialize.vectors and have enabled vectors with in the tok2vec with include_static_vectors = true?

For Dutch this isn't the issue, but for Italian the drop is probably mainly related to tokenization. The TOK score of 96-97 is relatively low and every tokenization error turns into at least one lemmatization error.

Docs in spacy don't support multiword tokens, so in the provided trained pipelines and the examples in that blog post, we merge multiword tokens when converting UD corpora. We also group sentences into paragraph-sized chunks, so our typical conversion with both options is:

spacy convert -T -n 10 file.conllu .

There's a longer explanation of the merged multiword tokens about halfway through this post: https://explosion.ai/blog/ud-benchmarks-v3-2

With merged multiword tokens and the Italian tokenizer defaults, you should see TOK for ISDT at 99.9 instead of 96.0. I haven't evaluated VIT, but if the tokenization isn't above 99.5 it's probably worth spending some time customizing the tokenizer settings to improve the performance of the whole pipeline.

3 replies

alitaker Apr 27, 2022
Author

Double-check that you've included the vectors in initialize.vectors and have enabled vectors with in the tok2vec with include_static_vectors = true?

In the default config, the tok2vec section is using architectures = "spacy.HashEmbedCNN.v2" which cannot have include_static_vectors (it would return extra fields not permitted), though tere is a pretrained_vectors which I changed from null to true.

Now I get:

treebank: UD_Italian-ISDT
TOK 96.01
LEMMA 89.99
SPEED 32371

It seems the tokenizer has not improved.
I'm attaching the cfg:
config.cfg.txt

Is there anything else wrong? I thought the default config was ready for trainig...

I'll check out the MWT conversion in the meantime.

Thank you.

adrianeboyd Apr 28, 2022

Yes, pretrained_vectors = true is the right setting for that config.

The vectors don't affect the tokenization, and getting the expected tokenization (so that it lines up with the it defaults) is what you need to see better results here.

alitaker Apr 28, 2022
Author

Yep, the multi word merge in the conversion did the trick!
I just added -T in the project.yml preprocess script and re-run the spacy preprocess command.

Now for treebank UD_Italian-ISDT I'm getting:
TOK 99.97
LEMMA 97.32
SPEED 31167

Do you think there could be any kind of overfitting? I'm guessing that similar words would have similar lemmas. Morover Italian does not have a huge amount of irregular words, so I'm not expecting drawbacks.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to achieve experimental_edit_tree_lemmatizer best accuracy #10662

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to achieve experimental_edit_tree_lemmatizer best accuracy #10662

Uh oh!

alitaker Apr 15, 2022

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Apr 20, 2022

Uh oh!

alitaker Apr 27, 2022 Author

Uh oh!

adrianeboyd Apr 28, 2022

Uh oh!

alitaker Apr 28, 2022 Author

alitaker
Apr 15, 2022

Replies: 1 comment 3 replies

adrianeboyd
Apr 20, 2022

alitaker Apr 27, 2022
Author

alitaker Apr 28, 2022
Author