Problems with reproducing the training of the it_core_news_sm spaCy pipeline #12878

alavista-zan · 2023-08-01T15:16:23Z

alavista-zan
Aug 1, 2023

I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm. This pipeline is trained on 2 datasets:

UD_Italian-ISDT for the conllu tasks
WikiNer for NET tagging

Where can I find more info about the data used to trained? They used both training and dev sets to train the pipeline? Did They group sentences together as suggested by the spaCy command convert?
And also, how is it possible to train the pipeline on 2 datasets? Should I train first the pipeline on the first dataset and then the NER component on the second dataset, or is it possible to do it simultaneously?

At the moment I trained the dataset on just the UD_Italian-ISDT dataset for POS tagging (coarse-grained and fine-grained), parsing, lemmatization and morphological analysis using the config file for the training available here. I used the train set to train and validation set to test the pipeline and I obtain results far lower respect to those indicated here. Here's my results:

pos_acc: 0.9020224719
morph_acc: 0.9004449638
tag_acc: 0.9001348315
dep_uas: 0.7801636499
dep_las: 0.7451524919
sents_p: 0.9754816112
sents_r: 0.9875886525
sents_f: 0.9814977974
lemma_acc: 0.9028083577

Could someone help me with this? Where I can find more info about the setting of the training and what reasons could cause these scores?
Thank you very much for the help!

Answered by adrianeboyd

Aug 2, 2023

The discrepancy is probably due to the conversion settings for UD_Italian-ISDT. If you check token_acc it's probably much lower than in it_core_news_sm.

spacy's rule-based tokenizer can't handle UD multiword tokens like nell' -> in l'. The tokenizer can only split the characters in the original text, not modify the underlying text, so spacy's UD-based pipelines use the merged token nell' instead of splitting this to in l'.

To merge multiword tokens with spacy convert, use the --merge-subtokens option:

python -m spacy convert -n 10 --merge-subtokens train.conllu .

The default it tokenizer settings are customized for this conversion.

More details related to UD tokenization: https://explosio…

View full answer

adrianeboyd · 2023-08-02T07:20:29Z

adrianeboyd
Aug 2, 2023

The discrepancy is probably due to the conversion settings for UD_Italian-ISDT. If you check token_acc it's probably much lower than in it_core_news_sm.

spacy's rule-based tokenizer can't handle UD multiword tokens like nell' -> in l'. The tokenizer can only split the characters in the original text, not modify the underlying text, so spacy's UD-based pipelines use the merged token nell' instead of splitting this to in l'.

To merge multiword tokens with spacy convert, use the --merge-subtokens option:

python -m spacy convert -n 10 --merge-subtokens train.conllu .

The default it tokenizer settings are customized for this conversion.

More details related to UD tokenization: https://explosion.ai/blog/ud-benchmarks-v3-2

1 reply

alavista-zan Aug 2, 2023
Author

Thank you very much! This was the problem :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problems with reproducing the training of the it_core_news_sm spaCy pipeline #12878

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Problems with reproducing the training of the it_core_news_sm spaCy pipeline #12878

Uh oh!

alavista-zan Aug 1, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Aug 2, 2023

Uh oh!

alavista-zan Aug 2, 2023 Author

alavista-zan
Aug 1, 2023

Replies: 1 comment 1 reply

adrianeboyd
Aug 2, 2023

alavista-zan Aug 2, 2023
Author