Skip to content
Discussion options

You must be logged in to vote

The discrepancy is probably due to the conversion settings for UD_Italian-ISDT. If you check token_acc it's probably much lower than in it_core_news_sm.

spacy's rule-based tokenizer can't handle UD multiword tokens like nell' -> in l'. The tokenizer can only split the characters in the original text, not modify the underlying text, so spacy's UD-based pipelines use the merged token nell' instead of splitting this to in l'.

To merge multiword tokens with spacy convert, use the --merge-subtokens option:

python -m spacy convert -n 10 --merge-subtokens train.conllu .

The default it tokenizer settings are customized for this conversion.

More details related to UD tokenization: https://explosio…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@alavista-zan
Comment options

Answer selected by alavista-zan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / it Italian language data and models
2 participants