Problems with reproducing the training of the it_core_news_sm spaCy pipeline #12878
-
I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm. This pipeline is trained on 2 datasets:
Where can I find more info about the data used to trained? They used both training and dev sets to train the pipeline? Did They group sentences together as suggested by the spaCy command convert? At the moment I trained the dataset on just the UD_Italian-ISDT dataset for POS tagging (coarse-grained and fine-grained), parsing, lemmatization and morphological analysis using the config file for the training available here. I used the train set to train and validation set to test the pipeline and I obtain results far lower respect to those indicated here. Here's my results:
Could someone help me with this? Where I can find more info about the setting of the training and what reasons could cause these scores? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The discrepancy is probably due to the conversion settings for UD_Italian-ISDT. If you check spacy's rule-based tokenizer can't handle UD multiword tokens like To merge multiword tokens with python -m spacy convert -n 10 --merge-subtokens train.conllu . The default More details related to UD tokenization: https://explosion.ai/blog/ud-benchmarks-v3-2 |
Beta Was this translation helpful? Give feedback.
The discrepancy is probably due to the conversion settings for UD_Italian-ISDT. If you check
token_acc
it's probably much lower than init_core_news_sm
.spacy's rule-based tokenizer can't handle UD multiword tokens like
nell' -> in l'
. The tokenizer can only split the characters in the original text, not modify the underlying text, so spacy's UD-based pipelines use the merged tokennell'
instead of splitting this toin l'
.To merge multiword tokens with
spacy convert
, use the--merge-subtokens
option:python -m spacy convert -n 10 --merge-subtokens train.conllu .
The default
it
tokenizer settings are customized for this conversion.More details related to UD tokenization: https://explosio…