How to achieve experimental_edit_tree_lemmatizer best accuracy #10662
-
In the edit_tree_lemmatizer blog page the presented results are all above 94% but training on my machine achieves 89-90%. The evaluation results are
Are there any other tweaks I can apply? Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Double-check that you've included the vectors in For Dutch this isn't the issue, but for Italian the drop is probably mainly related to tokenization. The Docs in spacy don't support multiword tokens, so in the provided trained pipelines and the examples in that blog post, we merge multiword tokens when converting UD corpora. We also group sentences into paragraph-sized chunks, so our typical conversion with both options is: spacy convert -T -n 10 file.conllu . There's a longer explanation of the merged multiword tokens about halfway through this post: https://explosion.ai/blog/ud-benchmarks-v3-2 With merged multiword tokens and the |
Beta Was this translation helpful? Give feedback.
Double-check that you've included the vectors in
initialize.vectors
and have enabled vectors with in thetok2vec
withinclude_static_vectors = true
?For Dutch this isn't the issue, but for Italian the drop is probably mainly related to tokenization. The
TOK
score of 96-97 is relatively low and every tokenization error turns into at least one lemmatization error.Docs in spacy don't support multiword tokens, so in the provided trained pipelines and the examples in that blog post, we merge multiword tokens when converting UD corpora. We also group sentences into paragraph-sized chunks, so our typical conversion with both options is:
spacy convert -T -n 10 file.conllu .
There's a longer expl…